home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!sun-barr!west.West.Sun.COM!news2me.ebay.sun.com!exodus.Eng.Sun.COM!nonsequitur!narad
- From: narad@nonsequitur.Eng.Sun.COM (Chuck Narad)
- Newsgroups: comp.arch
- Subject: bcopy (was Re: CISC Microcode)
- Date: 25 Jul 1992 01:25:12 GMT
- Organization: Sun Microsystems
- Lines: 120
- Distribution: world
- Message-ID: <l71bboINN4en@exodus.Eng.Sun.COM>
- References: <1992Jul15.054020.1@eagle.wesleyan.edu>
- Reply-To: narad@nonsequitur.Eng.Sun.COM
- NNTP-Posting-Host: nonsequitur
-
- Having stared at the issue of accelerating bcopy for many years, I
- think I have a few comments to contribute on this thread.
-
- As Mr. Limes said, to a first order (UNIX == bcopy), so anything that can
- speed that up will free up more CPU cycles for important work like
- running emacs :-) There are two aspects to bcopy, that which happens in
- kernel space and that done in User-land.
-
- An ideal bcopy engine would be cache consistent (gets the most recent
- copy of any block), realigning (to service misaligned
- source/destination w.r.t. block size), and selectably cache polluting
- or not (some copy operations want the data just copied for immediate
- reuse, others don't, so 'pollution' of the cache is not always a bad
- thing). It would also move arbitrary-sized items, not just multiples
- of block-size. It would also (most important!) be able to accelerate
- the actual move in terms of CPU cycles used. Finally, it would be
- accessible to both user and kernel code.
-
- From avg@rodan.UU.NET (Vadim Antonov):
-
- >My point is that the required circuit is cheap enough to have a really
- >good price/benefit ratio and i see no reason why the modern machines
- >lack it. You can always do small bcopy by CPU and bulk copy jobs offload
- >to the memory duplicator unit (a pair of mated DMA channels). Later, you
- >can add the third DMA channel and a DSP and get real good bang per buck
- >on array operations :-)
-
- As described above the bcopy engine is not a simple thing. But let's
- say that the hardware is simple, and supports only transfers that have
- the src/dest both block-aligned and an integer multiple of block
- sized. Leading and trailing block fragments can be handled in the CPU,
- so this restriction is really that the src/dest are aligned
- modulo[blocksize]. Now our hardware is a simple device consisting of
- two address counters (for source and destination) and a block counter
- (number of blocks to transfer) plus some controlling state machines.
- On the face of it this is an ideal solution, both low cost and
- effective for the most common copies such as copy-on-write of a page.
- Why isn't it seen more often in computers?
-
- The answer is that this sort of device has too much overhead associated
- with it. A processor must contend for ownership of the DMA_COPY
- engine, set it up (possibly probing and prefaulting page mappings on
- the way), then either sit in a polling loop or switching contexts,
- planning to take an interrupt when the copy is done and re-awakening
- the requesting thread.
-
- It turns out that in UNIX, at least, the overhead is generally greater
- than the benefit; polling occupies busses somewhat, cutting into the
- DMA_COPY efficiency, and the overhead of allocating a device tends to
- be high; while taking an interrupt and entering/exiting user-land plus
- interrupt overhead tend to dominate the copy time. The device would
- only be useful for copies of many pages at a time, reducing its utility
- and thus making it unattractive.
-
- This sort of device is only useful from kernel land anyway, because user
- code would have to include the overhead of a system call to get to the
- resource.
-
- In my experience and analysis, the most useful general bcopy support is
- something that is tied closely to each processor, and can move one
- block at a time in a read-from-coherent-space, write-to-memory
- (coherently) pattern under block-by-block control of the processor.
- This can also be used to bfill/bzero pages efficiently. This allows us
- to accelerate relatively- aligned copies in kernel space, which covers
- many of the interesting cases including page copy, copy-on-write,
- networking buffers, etc. It doesn't help with userland bcopy,
- misaligned copies, or graphics copies including window fill, scrolling,
- or window movement; but it certainly helps at the system level.
-
- The rest would be done by the CPU. Many interesting things can be done
- in a fancy-enough CPU to help; for instance, a machine with
- non-blocking loads would allow for better pipelining between loads and
- stores. A write-through cache (hisssssssss!) actually can perform CPU
- bcopy better than a copy-back cache, since victims (displaced blocks)
- that are modified don't require a copyback. Etc.
-
- That copyback cache note is one that is often missed in bcopy analysis; if a
- bcopy by a processor goes through the cache, then it takes a cache fill
- (one block movement across the memory system), a write-allocate (to get
- the block that receives the data, another movement of a block), a bunch
- of cache-to-cache moves of data by the CPU, and potentially a copyback
- of the victim, which is a third movement of a block to/from the memory
- system. Eventually the destination will also get copied back, since it
- was modified by the processor; that means that this kind of data
- movement can cost up to 4 bus transits of a block, and in the steady
- state will require 3, meaning that the maximum copy bandwidth is 1/3 of
- the memory system bandwidth (approximately, ignoring a few factors).
-
-
- From amolitor@eagle.wesleyan.edu ():
-
- > Relevent to this is the question 'how much of the stuff you just
- >moved do you *want* in cache later?' I suspect the answer is 'very little'
- >but this might be an interesting study. Do typical programs generally
- >move data, and then forget about it for a while? Do kernels? If the answers
- >vary, perhaps MORE INSTRUCTIONS is a good idea ;)
- >
- > Andrew "cisccisccisc" Molitor
-
- From dfields@hydra.urbana.mcd.mot.com (David Fields):
-
- >Take, for instance, the case of a copy-on-write fault. The OS will
- >copy the c-o-w page to a private one that the process can write. It
- >is distructive to have 2*PAGESIZE data flush out the primary cache.
-
- Good questions. The answer is that sometimes you want the data now, and sometimes
- you don't. For example, moving data through several layers of network protocol
- stacks, there could be several successive copies that benefit from having the
- data in the cache after the first time. With a c-o-w the process that tried to
- write the data wants some portion of that page in its cache image *right* *now*;
- unfortunately it may want only one byte, while it *could* want the entire page.
-
- So in summary; with UNIX systems at least, it is easy and beneficial to accelerate aligned
- kernel copies; it is more difficult (but not impossible) to handle misaligned copies, if
- it is impossible for the HW guys (white hats) to convince the SW guys (black hats) to
- fix their d**n code; DMA engines are generally not a win for this problem due to the
- management overhead involved; and user bcopy is difficult to acclerate due to the
- overheads of requesting a kernel service.
-
- chuck/
-