home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!sun-barr!ames!network.ucsd.edu!nic!davsmith
- From: davsmith@nic.cerf.net (David Smith)
- Newsgroups: comp.arch
- Subject: Re: CISC Microcode (was Re: RISC Mainframe)
- Message-ID: <2369@nic.cerf.net>
- Date: 22 Jul 92 22:49:41 GMT
- References: <BrM8Gv.E3r@zoo.toronto.edu> <ADAMS.92Jul21011202@PDV2.pdv2.fmr.maschinenbau.th-darmstadt.de> <Brsx7o.G69@zoo.toronto.edu>
- Organization: CERFnet
- Lines: 86
-
- In article <Brsx7o.G69@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
- >In article <ADAMS.92Jul21011202@PDV2.pdv2.fmr.maschinenbau.th-darmstadt.de> adams@pdv2.fmr.maschinenbau.th-darmstadt.de (Adams) writes:
- >
- >Uh, where do you find idle states? If the CPU is that much faster than
- >the memory, it wants all the memory bandwidth it can get, and the memory
- >designers sweat and strain to give it more. You get idle states only
- >when the CPU is *slow*, so it can't use the full memory bandwidth.
- >
- >Dedicated DMA engines make sense if
- >
- > 1. the CPU can't move data around at full memory speed
- > 2. normal execution doesn't use the full memory bandwidth
- > 3. interrupt overhead is too high for timely device handling
- > 4. bad hardware design cripples CPU data movement
-
- 1 and 2 are both true today. You can design a memory system that can
- provide data faster than any processor available today can consume it.
- This can be done by appropriate interleaving. It *is* difficult to
- build a memory system that will provide data at the bandwidth the
- processor would like in the data access pattern the processor would
- like (random, with little prediction). That's why we put caches
- between the processor and the memory system.
-
- >It should be easy to see that as the CPU gets faster, it *can* move data
- >around at full speed, and it wants all the bandwidth it can find (adding
- >caches helps some, but doesn't solve the problem).
-
- This is true *only* if the memory system's latency per fetch is low
- enough. The problem is that the CPU's method of interacting with the
- memory system is not the method which gives you the best results with
- an interleaved memory system.
-
- Interleaved memory systems can give you any bandwidth you would like.
- The cost is that in order to achieve the bandwidth you must issue
- memory fetches or stores asynchronosly.
-
- Interleaved memory systems work by pipelining memory accesses. A
- simple example (though moderately useless) would be a two way
- interleaved system. This would have two banks of memory each of which
- has independent address decoding hardware. Bank 0 contains memory
- addresses whose least significant bit is 0 (i.e. 0,2,4,etc.) while Bank
- 1 contains those whose LSB is 1. Latency to decode and retrieve the
- memory at an address is, we shall say, 2 cycles. OK, so we issue a
- fetch request for address 0. Bank 0 gets this request and returns our
- data 2 cycles later. If we issue a request for address 1 (or 3, 5,
- etc.) the next cycle after issuing the fetch for 0 then on the next
- cycle after the word at location 0 is returned to us the word at
- location 1 is returned to us. If we can keep the address pipeline full
- we can receive data at the rate of 1 word per cycle, the maximum
- bandwidth.
-
- An interleave factor of 2 is almost pointless since a memory bank's
- latency is rarely only 2 cycles. Typically to get best performance you
- will have an interleave factor of 8 or more. However, in order to get
- full bandwidth out of an 8-way interleaved memory we have to be capable
- of issuing addresses 8 words ahead of where we are.
-
- All CPUs I have seen to date (not every CPU by any means - if you know
- of counter examples, please post) cannot do asynchronous address
- generation. When they request a word of memory they want it *NOW* or
- within a cycle or two and will block until it arrives. This makes it
- impossible to get full bandwidth out of an interleaved memory system if
- the cache is removed.
-
- Caches on interleaved memory systems can do pre-fetching, however there
- is a trade-off that must be made between pre-fetching too little and
- too much. Caches tend to fetch in blocks so the start-up latency is
- spread across a number of words, but the size of the block must be
- large to make the latency cost trivial. There are a number of problems
- with fetching large blocks. Thus, it is difficult to get full bandwidth
- out of an interleaved memory system while going through the cache.
-
- In short, it is possible to build a memory system that will provide
- data as quickly as any CPU could *theoretically* (with a 0 latency
- memory system) move it, however current CPU implementations do not
- allow for exploiting that architecture.
-
- There are several ways to exploit the architecture, dedicated DMA engines
- are one of them (because the DMA engine knows what the next N fetches are
- going to be, so it can tell the memory system). Another, more RISCy
- approach would be to provide pre-fetch instructions which could be
- used.
-
- =====
- David L. Smith
- smithd@discos.com or davsmith@nic.cerf.net
-