NetNews Usenet Archive 1992 #19

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #19 / NN_1992_19.iso / spool / comp / arch / 9238 < prev next >

Wrap

Text File | 1992-09-07 | 5.9 KB | 139 lines

Newsgroups: comp.arch Path: sparky!uunet!gumby!wupost!cs.utexas.edu!convex!hamrick From: hamrick@convex.com (Ed Hamrick) Subject: Re: alpha and vector processing... Message-ID: <1992Sep04.211955.22449@convex.com> Sender: usenet@convex.com (news access account) Nntp-Posting-Host: convex1.convex.com Organization: CONVEX Computer Corporation, Richardson, Tx., USA References: <1992Sep4.120503.1@uwovax.uwo.ca> Date: Fri, 04 Sep 1992 21:19:55 GMT X-Disclaimer: This message was written by a user at CONVEX Computer Corp. The opinions expressed are those of the user and not necessarily those of CONVEX. Lines: 123 > From: brent@uwovax.uwo.ca (Brent Sterner) > Not sure why this crossed my mind, but I'm curious about how well > the alpha architecture might support vector processing. I'm familiar > with the vector processing concepts as implemented for VAX 6000 and > 9000 systems, as well as the implementation on the Cyber 2000. Can > anyone comment on alpha futures in this direction? Thanks, b. Here are several tidbits from the Alpha Architecture Reference Manual: A.1 ... For Alpha, this means eventual 128-bit wide data paths. ... 1. Small (first-level) cache sizes will likely be in the range 2 KB to 64 KB 2. Small cache block sizes will likely be 16, 32, 64, or 128 bytes 3. Large (second- or third-level) cache sizes will likely be in the range 128 KB to 8 MB 4. Large cache block sizes will likely be 32, 64, 128, or 256 bytes A.3.1 ... In some implementations, a series of writes that completely fill a cache block may be a factor of 10 faster than a series of writes that partially fill a cache block, when that cache block would give a read miss. ... ... Implementors should give first priority to fast reads of aligned octawords and second priority to fast writes of full cache blocks. Partial-quadword writes need not have a fast repetition rate. A.3.2 ... <stuff about cache misses> ... Such accesses will likely be a factor of 30 slower than cache hits. ... A.3.4 ... To avoid overrunning memory bandwidth, sequences of more than eight quadword Loads or Stores should be broken up with intervening instructions (if there is any useful work to be done). For consecutive reads, implementors should give first priority to prefetching ascending cache blocks, and second priority to absorbing up to eight consecutive quadword Loads (aligned on a 64-byte boundary) without stalling. For consecutive writes, implementors should give first priority to avoiding read overhead for fully written aligned cache blocks, and second priority to absorbing up to eight consecutive quadword Stores (aligned on a 64-byte boundary) without stalling. A.3.5 1. Assume that at most two FETCH instructions can be outstanding at once ... 2. Assume, for maximum efficiency, that there should be about 64 unrelated memory access instructions (load or store) between a FETCH and the first actual data access to the prefetched data. ... 4. Assume that FETCH is worthwhile if, on average, at least half the data in a block will be accessed. Assume FETCH_M is worthwhile if, on average, at least half the data in a block will be modified. 5. Treat FETCH as a vector load. If a piece of code could usefully prefetch 4 operands, launch the first two prefetches, do about 128 memory references worth of work, then launch the next two prefetches, do about 128 more memory references worth of work, then start using the 4 sets of prefetched data. 6. Treat FETCH as having the same effect on a cache as a series of 64 quadword loads. If the loads would displace useful data, so will FETCH. If two sets of loads from specific addresses will thrash in a direct-mapped cache, so will two FETCH instructions using the same pair of addresses. How well will the Alpha architecture do on problems well-mapped to vector architectures? Two types of vector codes will run well on Alpha - small problems with small data sets that have a good cache hit rate, and large problems with unity-stride memory access patterns. Codes that will run very poorly (10x speed degradation) have non-unity stride memory accesses or scatter/gather memory accesses. Long-term (25 years from now <grin>), Alpha will be faced with the problem of having a BazillionHz processor that spends all of it's time waiting on memory loads and stores to complete. The limiting factor for performance on vector codes (most accesses being cache misses) will be the limited number of registers in the architecture. Since there are only 32 floating point registers, the architecture is limited to 32 outstanding floating-point loads to the memory system. The key question is, 25 years from now, will main memory be more than 32 cycles away (or 64, or 256, or 1024)? Vector applications that ran well on the VAX-6000 vector unit and/or the VAX-9000 vector unit will run fairly well on high-end Alpha systems, assuming DEC does enough tuning of the generated code for Alpha. Applications that ran poorly on these systems will also run poorly on high-end Alpha systems. I believe the vector units on the VAX-6000 and VAX-9000 were failures and dropped by DEC because of the limited need in the market for vector units that only run well from cache or with unity-stride accesses. DEC has clearly designed the Alpha architecture for applications with data accesses that have good cache hit rates. Vector architectures tend to excel on applications with poor cache hit rates. It's puzzling why Cray chose Alpha for their first MPP product (Y-MP add-on array processor), given Alpha's weaknesses on traditional vector applications. It's possible that Cray is doing this to offset their rather weak scalar performance and lack of a scalar cache. All in all, the Alpha seems to be an excellent piece of work by DEC to transition the VAX product line to new technology, but it isn't very well suited to vector processing problems. Regards, Ed Hamrick