home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.arch:12244 comp.sys.ibm.pc.hardware:35573
- Path: sparky!uunet!haven.umd.edu!darwin.sura.net!gatech!psuvax1!rutgers!cbmvax!snark!esr
- From: esr@snark.thyrsus.com (Eric S. Raymond)
- Newsgroups: comp.arch,comp.sys.ibm.pc.hardware
- Subject: Request For Comments --- IBM-clone cache-design tutorial for Buyer's Guide
- Message-ID: <1k5rJG#1wnsqp9bCpVr2wtwkc4PyLSl=esr@snark.thyrsus.com>
- Date: 11 Jan 93 23:43:56 GMT
- Followup-To: poster
- Lines: 141
-
- I want to add a section on cache design and its performance impact to my
- 386/486 Hardware Buyers' Guide. This is the draft. Please email your comments
- if you think you see an error or significant omission.
-
- -------------------------------- CUT HERE ------------------------------------
- C. Cache Flow
-
- The most obscure of the important factors in the performance of a UNIX 486
- system is the motherboard's memory cache size and design. The two questions
- performance-minded buyers have to deal with are: (1) does the cache design
- of a given motherboard work with UNIX, and (2) how much cache SRAM should
- my system have?
-
- Before normal clock speeds hit two digits in MHz, cache design wasn't a big
- issue. But DRAM's memory-cycle times just aren't fast enough to keep up with
- today's processors. Thus, your machine's memory controller caches memory
- references in faster static RAM, reading from main memory in chunks that the
- board designer hopes will be large enough to keep the CPU continuously fed
- under a typical job load. If the cache system fails to work, the processor
- will be slowed down to less than the memory's real access speed --- which,
- given January 1993's typical 60ns DRAM parts, is about 16MHz.
-
- The 486 includes an 8K cache right on the processor chip. If memory accesses
- were reliably sequential and well-localized, this would be fine.
- Unfortunately, one side-effect of what's today considered "good programming
- practice", with high-level languages using a lot of subroutine calls, is that
- the program counter of a typical process hops around like crazy; locality is
- really poor. This gives the cacheing system a workout. UNIX makes the problem
- worse, because clock interrupts and other effects of multitasking design
- degrade locality still further.
-
- Thus, the 486's 8K internal primary cache is typically supplemented with an
- external caching system using SRAM; in January 1993, 20ns SRAM is typical.
- The size and design of your motherboard cache is one of the most critical
- factors in your system's real performance.
-
- Unfortunately, cache design is a complicated black art, and cache performance
- isn't easy to predict or measure, especially under the rapidly variable
- system loads characteristic of UNIX. Thus, the best advice your humble editor
- can give is a collection of rules of thumb. Your mileage may vary...
-
- Rule 1: Buy only motherboards that have been tested with UNIX
- One of DOS's many sins is that it licenses poor hardware design; it's too
- brain-dead to stretch the cache system much. Thus, bad cache designs that
- will run DOS can completely hose UNIX, slowing the machine to a crawl or even
- (in extreme cases) causing frequent random panics. Make sure your motherboard
- or system has been tested with some UNIX variant.
-
- Rule 2: Be sure you get enough cache.
- If your motherboard offers multiple cache sizes, make sure you how much is
- required to service the DRAM you plan to install. If possible, fill the cache
- all the way -- cache-speed RAM is getting pretty cheap and this may save you
- many hassles later.
- Bela Lubkin writes: "Excess RAM [over what your cache can support] is a very
- bad idea: most designs prevent memory outside the external cache's cachable
- range from being cached by the 486 internal cache either. Code running from
- this memory runs up to 11 times slower than code running out of fully cached
- memory."
-
- Rule 3: "Enough cache" is at least 64K per MB of DRAM
- Hardware caches are usually designed to achieve effective 0 wait state
- status, rather than perform any significant buffering of data. As a general
- rule applicable to all clones, 64Kb cache handles up to 16Mb memory; in a
- "direct-mapped" cache design (typical for clone hardware) more is redundant.
- We'll have more to say about cache sizes below.
-
- Rule 4: If possible, max out the board's cache -- it wlll save hassles later
- Bela continues: "Get the largest cache size your motherboard supports, even
- if you're not fully populating it with RAM. The motherboard manufacturer buys
- cache chips in quantity, knows how to install them correctly, and you won't end
- up throwing out the small chips later when you upgrade your main RAM."
-
- A lot of fast chips are held back by poor cache systems and slow memory. The
- 50DX has a particular problem this way, because its cycle spead is as fast as
- that of a 20ns cache SRAM. To avoid trouble, cloners often insert wait states
- at the cache, slowing down the 50DX to the effective speed of a 50DX/2.
-
- Worse than this, a lot of cloners have taken the 50DX/2 and 66DX/2 as
- invitations to reuse old 25- and 33MHz board designs without change. The
- trouble is that these chips take a double hit for each wait state, because
- the wait states are timed by *external* cycles. And there can be lots of
- them; a look at the CMOS setup screen of most 33Mhz and 50MHz system will
- usually reveal many wait states.
-
- Now for some basic cache-design terminology.
-
- "write-through" --- it wouldn't do to let your cache get out of sync with
- main memory. The simplest and slowest way to handle this is to arrange that
- every write to cache generates the corresponding write to main store. In
- effect, then, you only get cache speedup on reads.
-
- "write-back" --- for each cache address range in DRAM, writes are done to
- cache only until a new block has to be fetched due to out-of-range access (at
- which point the old one is flushed to DRAM). This is much faster, because
- you get cache speedup on writes as well as reads. It's also more expensive
- and trickier to get right.
-
- To understand the next two terms, you need to think of main RAM as being
- divided into consecutive, non-overlapping segments we'll call "regions". A
- typical region size is 16MB. Think of cache SRAM as being divided into
- similar, but much smaller segments (typically 64K each) which we'll call
- "pages". When the processor reads from an address in a given region, and that
- address is not already in core, the location and 64K around it is read into a
- page.
-
- "direct-mapped" --- describes a cache system in which each region has
- exactly one corresponding page (also called "one-way cache"). Typically,
- you get the page address in cache by throwing away the top 16 bits of the
- region address.
-
- "two-way set-associative" --- each region has *two* possible lots (the cache
- must be at least twice as large as a direct-mapped cache for the same amount of
- memory). Because you can cache two sections of any given region, your odds of
- not having to fetch from DRAM (and, hence, your effective speed) go up.
-
- There are also "four-way" caches. In general, an n-way cache has n pages per
- region and improves your effective speed by some factor proportional to n.
- However, for n > 1 you need some auxilliary SRAM storage for a beast called a
- 'tag table", and pay some computation overhead on each fetch. Diminishing
- returns set in quickly, so one does not commonly see five-way or higher caches.
-
- It follows from the above that a "two-way" cache will actually need 128K per
- 16MB, possibly plus some tag storage (typically around 2K, though the storage
- may be on the cache controller chip itself). And a 4-way cache would need 256K
- per 16MB.
-
- "MFA" --- Most Frequently Accessed. A controller that supports MFA caching
- allocates cache pages in a fundamentally different way. Instead of doing
- simple address-mapping, it tracks the frequency of reference to each section of
- main memory that it picks up. When a new cache page is needed, the one
- occupying the least frequently used slot is tossed. This is very similar to
- the way OSs like UNIX manage virtual memory, and it's quite effective.
-
- The 8K primary cache on the 486 is write-back, four-way set-associative. For
- UNIX use, try to get a write-back external cache, preferably with two- or
- four-way set associativity and/or MFA. These features used to be quite
- expensive, but newer cache controllers like the SC82C348 are bringing them
- within everyone's reach.
- -------------------------------- CUT HERE ------------------------------------
- --
- Eric S. Raymond <esr@snark.thyrsus.com>
-