NetNews Usenet Archive 1992 #16

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #16 / NN_1992_16.iso / spool / comp / arch / 8208 < prev next >

Wrap

Internet Message Format | 1992-07-22 | 4.8 KB

Path: sparky!uunet!sun-barr!ames!network.ucsd.edu!nic!davsmith From: davsmith@nic.cerf.net (David Smith) Newsgroups: comp.arch Subject: Re: CISC Microcode (was Re: RISC Mainframe) Message-ID: <2369@nic.cerf.net> Date: 22 Jul 92 22:49:41 GMT References: <BrM8Gv.E3r@zoo.toronto.edu> <ADAMS.92Jul21011202@PDV2.pdv2.fmr.maschinenbau.th-darmstadt.de> <Brsx7o.G69@zoo.toronto.edu> Organization: CERFnet Lines: 86 In article <Brsx7o.G69@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >In article <ADAMS.92Jul21011202@PDV2.pdv2.fmr.maschinenbau.th-darmstadt.de> adams@pdv2.fmr.maschinenbau.th-darmstadt.de (Adams) writes: > >Uh, where do you find idle states? If the CPU is that much faster than >the memory, it wants all the memory bandwidth it can get, and the memory >designers sweat and strain to give it more. You get idle states only >when the CPU is *slow*, so it can't use the full memory bandwidth. > >Dedicated DMA engines make sense if > > 1. the CPU can't move data around at full memory speed > 2. normal execution doesn't use the full memory bandwidth > 3. interrupt overhead is too high for timely device handling > 4. bad hardware design cripples CPU data movement 1 and 2 are both true today. You can design a memory system that can provide data faster than any processor available today can consume it. This can be done by appropriate interleaving. It *is* difficult to build a memory system that will provide data at the bandwidth the processor would like in the data access pattern the processor would like (random, with little prediction). That's why we put caches between the processor and the memory system. >It should be easy to see that as the CPU gets faster, it *can* move data >around at full speed, and it wants all the bandwidth it can find (adding >caches helps some, but doesn't solve the problem). This is true *only* if the memory system's latency per fetch is low enough. The problem is that the CPU's method of interacting with the memory system is not the method which gives you the best results with an interleaved memory system. Interleaved memory systems can give you any bandwidth you would like. The cost is that in order to achieve the bandwidth you must issue memory fetches or stores asynchronosly. Interleaved memory systems work by pipelining memory accesses. A simple example (though moderately useless) would be a two way interleaved system. This would have two banks of memory each of which has independent address decoding hardware. Bank 0 contains memory addresses whose least significant bit is 0 (i.e. 0,2,4,etc.) while Bank 1 contains those whose LSB is 1. Latency to decode and retrieve the memory at an address is, we shall say, 2 cycles. OK, so we issue a fetch request for address 0. Bank 0 gets this request and returns our data 2 cycles later. If we issue a request for address 1 (or 3, 5, etc.) the next cycle after issuing the fetch for 0 then on the next cycle after the word at location 0 is returned to us the word at location 1 is returned to us. If we can keep the address pipeline full we can receive data at the rate of 1 word per cycle, the maximum bandwidth. An interleave factor of 2 is almost pointless since a memory bank's latency is rarely only 2 cycles. Typically to get best performance you will have an interleave factor of 8 or more. However, in order to get full bandwidth out of an 8-way interleaved memory we have to be capable of issuing addresses 8 words ahead of where we are. All CPUs I have seen to date (not every CPU by any means - if you know of counter examples, please post) cannot do asynchronous address generation. When they request a word of memory they want it *NOW* or within a cycle or two and will block until it arrives. This makes it impossible to get full bandwidth out of an interleaved memory system if the cache is removed. Caches on interleaved memory systems can do pre-fetching, however there is a trade-off that must be made between pre-fetching too little and too much. Caches tend to fetch in blocks so the start-up latency is spread across a number of words, but the size of the block must be large to make the latency cost trivial. There are a number of problems with fetching large blocks. Thus, it is difficult to get full bandwidth out of an interleaved memory system while going through the cache. In short, it is possible to build a memory system that will provide data as quickly as any CPU could *theoretically* (with a 0 latency memory system) move it, however current CPU implementations do not allow for exploiting that architecture. There are several ways to exploit the architecture, dedicated DMA engines are one of them (because the DMA engine knows what the next N fetches are going to be, so it can tell the memory system). Another, more RISCy approach would be to provide pre-fetch instructions which could be used. ===== David L. Smith smithd@discos.com or davsmith@nic.cerf.net