NetNews Usenet Archive 1992 #16

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #16 / NN_1992_16.iso / spool / comp / arch / 8301 < prev next >

Wrap

Internet Message Format | 1992-07-25 | 6.9 KB

Path: sparky!uunet!sun-barr!west.West.Sun.COM!news2me.ebay.sun.com!exodus.Eng.Sun.COM!nonsequitur!narad From: narad@nonsequitur.Eng.Sun.COM (Chuck Narad) Newsgroups: comp.arch Subject: bcopy (was Re: CISC Microcode) Date: 25 Jul 1992 01:25:12 GMT Organization: Sun Microsystems Lines: 120 Distribution: world Message-ID: <l71bboINN4en@exodus.Eng.Sun.COM> References: <1992Jul15.054020.1@eagle.wesleyan.edu> Reply-To: narad@nonsequitur.Eng.Sun.COM NNTP-Posting-Host: nonsequitur Having stared at the issue of accelerating bcopy for many years, I think I have a few comments to contribute on this thread. As Mr. Limes said, to a first order (UNIX == bcopy), so anything that can speed that up will free up more CPU cycles for important work like running emacs :-) There are two aspects to bcopy, that which happens in kernel space and that done in User-land. An ideal bcopy engine would be cache consistent (gets the most recent copy of any block), realigning (to service misaligned source/destination w.r.t. block size), and selectably cache polluting or not (some copy operations want the data just copied for immediate reuse, others don't, so 'pollution' of the cache is not always a bad thing). It would also move arbitrary-sized items, not just multiples of block-size. It would also (most important!) be able to accelerate the actual move in terms of CPU cycles used. Finally, it would be accessible to both user and kernel code. From avg@rodan.UU.NET (Vadim Antonov): >My point is that the required circuit is cheap enough to have a really >good price/benefit ratio and i see no reason why the modern machines >lack it. You can always do small bcopy by CPU and bulk copy jobs offload >to the memory duplicator unit (a pair of mated DMA channels). Later, you >can add the third DMA channel and a DSP and get real good bang per buck >on array operations :-) As described above the bcopy engine is not a simple thing. But let's say that the hardware is simple, and supports only transfers that have the src/dest both block-aligned and an integer multiple of block sized. Leading and trailing block fragments can be handled in the CPU, so this restriction is really that the src/dest are aligned modulo[blocksize]. Now our hardware is a simple device consisting of two address counters (for source and destination) and a block counter (number of blocks to transfer) plus some controlling state machines. On the face of it this is an ideal solution, both low cost and effective for the most common copies such as copy-on-write of a page. Why isn't it seen more often in computers? The answer is that this sort of device has too much overhead associated with it. A processor must contend for ownership of the DMA_COPY engine, set it up (possibly probing and prefaulting page mappings on the way), then either sit in a polling loop or switching contexts, planning to take an interrupt when the copy is done and re-awakening the requesting thread. It turns out that in UNIX, at least, the overhead is generally greater than the benefit; polling occupies busses somewhat, cutting into the DMA_COPY efficiency, and the overhead of allocating a device tends to be high; while taking an interrupt and entering/exiting user-land plus interrupt overhead tend to dominate the copy time. The device would only be useful for copies of many pages at a time, reducing its utility and thus making it unattractive. This sort of device is only useful from kernel land anyway, because user code would have to include the overhead of a system call to get to the resource. In my experience and analysis, the most useful general bcopy support is something that is tied closely to each processor, and can move one block at a time in a read-from-coherent-space, write-to-memory (coherently) pattern under block-by-block control of the processor. This can also be used to bfill/bzero pages efficiently. This allows us to accelerate relatively- aligned copies in kernel space, which covers many of the interesting cases including page copy, copy-on-write, networking buffers, etc. It doesn't help with userland bcopy, misaligned copies, or graphics copies including window fill, scrolling, or window movement; but it certainly helps at the system level. The rest would be done by the CPU. Many interesting things can be done in a fancy-enough CPU to help; for instance, a machine with non-blocking loads would allow for better pipelining between loads and stores. A write-through cache (hisssssssss!) actually can perform CPU bcopy better than a copy-back cache, since victims (displaced blocks) that are modified don't require a copyback. Etc. That copyback cache note is one that is often missed in bcopy analysis; if a bcopy by a processor goes through the cache, then it takes a cache fill (one block movement across the memory system), a write-allocate (to get the block that receives the data, another movement of a block), a bunch of cache-to-cache moves of data by the CPU, and potentially a copyback of the victim, which is a third movement of a block to/from the memory system. Eventually the destination will also get copied back, since it was modified by the processor; that means that this kind of data movement can cost up to 4 bus transits of a block, and in the steady state will require 3, meaning that the maximum copy bandwidth is 1/3 of the memory system bandwidth (approximately, ignoring a few factors). From amolitor@eagle.wesleyan.edu (): > Relevent to this is the question 'how much of the stuff you just >moved do you *want* in cache later?' I suspect the answer is 'very little' >but this might be an interesting study. Do typical programs generally >move data, and then forget about it for a while? Do kernels? If the answers >vary, perhaps MORE INSTRUCTIONS is a good idea ;) > > Andrew "cisccisccisc" Molitor From dfields@hydra.urbana.mcd.mot.com (David Fields): >Take, for instance, the case of a copy-on-write fault. The OS will >copy the c-o-w page to a private one that the process can write. It >is distructive to have 2*PAGESIZE data flush out the primary cache. Good questions. The answer is that sometimes you want the data now, and sometimes you don't. For example, moving data through several layers of network protocol stacks, there could be several successive copies that benefit from having the data in the cache after the first time. With a c-o-w the process that tried to write the data wants some portion of that page in its cache image *right* *now*; unfortunately it may want only one byte, while it *could* want the entire page. So in summary; with UNIX systems at least, it is easy and beneficial to accelerate aligned kernel copies; it is more difficult (but not impossible) to handle misaligned copies, if it is impossible for the HW guys (white hats) to convince the SW guys (black hats) to fix their d**n code; DMA engines are generally not a win for this problem due to the management overhead involved; and user bcopy is difficult to acclerate due to the overheads of requesting a kernel service. chuck/