NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / arch / 12244 < prev next >

Wrap

Internet Message Format | 1993-01-12 | 8.3 KB

Xref: sparky comp.arch:12244 comp.sys.ibm.pc.hardware:35573 Path: sparky!uunet!haven.umd.edu!darwin.sura.net!gatech!psuvax1!rutgers!cbmvax!snark!esr From: esr@snark.thyrsus.com (Eric S. Raymond) Newsgroups: comp.arch,comp.sys.ibm.pc.hardware Subject: Request For Comments --- IBM-clone cache-design tutorial for Buyer's Guide Message-ID: <1k5rJG#1wnsqp9bCpVr2wtwkc4PyLSl=esr@snark.thyrsus.com> Date: 11 Jan 93 23:43:56 GMT Followup-To: poster Lines: 141 I want to add a section on cache design and its performance impact to my 386/486 Hardware Buyers' Guide. This is the draft. Please email your comments if you think you see an error or significant omission. -------------------------------- CUT HERE ------------------------------------ C. Cache Flow The most obscure of the important factors in the performance of a UNIX 486 system is the motherboard's memory cache size and design. The two questions performance-minded buyers have to deal with are: (1) does the cache design of a given motherboard work with UNIX, and (2) how much cache SRAM should my system have? Before normal clock speeds hit two digits in MHz, cache design wasn't a big issue. But DRAM's memory-cycle times just aren't fast enough to keep up with today's processors. Thus, your machine's memory controller caches memory references in faster static RAM, reading from main memory in chunks that the board designer hopes will be large enough to keep the CPU continuously fed under a typical job load. If the cache system fails to work, the processor will be slowed down to less than the memory's real access speed --- which, given January 1993's typical 60ns DRAM parts, is about 16MHz. The 486 includes an 8K cache right on the processor chip. If memory accesses were reliably sequential and well-localized, this would be fine. Unfortunately, one side-effect of what's today considered "good programming practice", with high-level languages using a lot of subroutine calls, is that the program counter of a typical process hops around like crazy; locality is really poor. This gives the cacheing system a workout. UNIX makes the problem worse, because clock interrupts and other effects of multitasking design degrade locality still further. Thus, the 486's 8K internal primary cache is typically supplemented with an external caching system using SRAM; in January 1993, 20ns SRAM is typical. The size and design of your motherboard cache is one of the most critical factors in your system's real performance. Unfortunately, cache design is a complicated black art, and cache performance isn't easy to predict or measure, especially under the rapidly variable system loads characteristic of UNIX. Thus, the best advice your humble editor can give is a collection of rules of thumb. Your mileage may vary... Rule 1: Buy only motherboards that have been tested with UNIX One of DOS's many sins is that it licenses poor hardware design; it's too brain-dead to stretch the cache system much. Thus, bad cache designs that will run DOS can completely hose UNIX, slowing the machine to a crawl or even (in extreme cases) causing frequent random panics. Make sure your motherboard or system has been tested with some UNIX variant. Rule 2: Be sure you get enough cache. If your motherboard offers multiple cache sizes, make sure you how much is required to service the DRAM you plan to install. If possible, fill the cache all the way -- cache-speed RAM is getting pretty cheap and this may save you many hassles later. Bela Lubkin writes: "Excess RAM [over what your cache can support] is a very bad idea: most designs prevent memory outside the external cache's cachable range from being cached by the 486 internal cache either. Code running from this memory runs up to 11 times slower than code running out of fully cached memory." Rule 3: "Enough cache" is at least 64K per MB of DRAM Hardware caches are usually designed to achieve effective 0 wait state status, rather than perform any significant buffering of data. As a general rule applicable to all clones, 64Kb cache handles up to 16Mb memory; in a "direct-mapped" cache design (typical for clone hardware) more is redundant. We'll have more to say about cache sizes below. Rule 4: If possible, max out the board's cache -- it wlll save hassles later Bela continues: "Get the largest cache size your motherboard supports, even if you're not fully populating it with RAM. The motherboard manufacturer buys cache chips in quantity, knows how to install them correctly, and you won't end up throwing out the small chips later when you upgrade your main RAM." A lot of fast chips are held back by poor cache systems and slow memory. The 50DX has a particular problem this way, because its cycle spead is as fast as that of a 20ns cache SRAM. To avoid trouble, cloners often insert wait states at the cache, slowing down the 50DX to the effective speed of a 50DX/2. Worse than this, a lot of cloners have taken the 50DX/2 and 66DX/2 as invitations to reuse old 25- and 33MHz board designs without change. The trouble is that these chips take a double hit for each wait state, because the wait states are timed by *external* cycles. And there can be lots of them; a look at the CMOS setup screen of most 33Mhz and 50MHz system will usually reveal many wait states. Now for some basic cache-design terminology. "write-through" --- it wouldn't do to let your cache get out of sync with main memory. The simplest and slowest way to handle this is to arrange that every write to cache generates the corresponding write to main store. In effect, then, you only get cache speedup on reads. "write-back" --- for each cache address range in DRAM, writes are done to cache only until a new block has to be fetched due to out-of-range access (at which point the old one is flushed to DRAM). This is much faster, because you get cache speedup on writes as well as reads. It's also more expensive and trickier to get right. To understand the next two terms, you need to think of main RAM as being divided into consecutive, non-overlapping segments we'll call "regions". A typical region size is 16MB. Think of cache SRAM as being divided into similar, but much smaller segments (typically 64K each) which we'll call "pages". When the processor reads from an address in a given region, and that address is not already in core, the location and 64K around it is read into a page. "direct-mapped" --- describes a cache system in which each region has exactly one corresponding page (also called "one-way cache"). Typically, you get the page address in cache by throwing away the top 16 bits of the region address. "two-way set-associative" --- each region has *two* possible lots (the cache must be at least twice as large as a direct-mapped cache for the same amount of memory). Because you can cache two sections of any given region, your odds of not having to fetch from DRAM (and, hence, your effective speed) go up. There are also "four-way" caches. In general, an n-way cache has n pages per region and improves your effective speed by some factor proportional to n. However, for n > 1 you need some auxilliary SRAM storage for a beast called a 'tag table", and pay some computation overhead on each fetch. Diminishing returns set in quickly, so one does not commonly see five-way or higher caches. It follows from the above that a "two-way" cache will actually need 128K per 16MB, possibly plus some tag storage (typically around 2K, though the storage may be on the cache controller chip itself). And a 4-way cache would need 256K per 16MB. "MFA" --- Most Frequently Accessed. A controller that supports MFA caching allocates cache pages in a fundamentally different way. Instead of doing simple address-mapping, it tracks the frequency of reference to each section of main memory that it picks up. When a new cache page is needed, the one occupying the least frequently used slot is tossed. This is very similar to the way OSs like UNIX manage virtual memory, and it's quite effective. The 8K primary cache on the 486 is write-back, four-way set-associative. For UNIX use, try to get a write-back external cache, preferably with two- or four-way set associativity and/or MFA. These features used to be quite expensive, but newer cache controllers like the SC82C348 are bringing them within everyone's reach. -------------------------------- CUT HERE ------------------------------------ -- Eric S. Raymond <esr@snark.thyrsus.com>