home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.arch
- Path: sparky!uunet!gumby!wupost!cs.utexas.edu!convex!hamrick
- From: hamrick@convex.com (Ed Hamrick)
- Subject: Re: alpha and vector processing...
- Message-ID: <1992Sep04.211955.22449@convex.com>
- Sender: usenet@convex.com (news access account)
- Nntp-Posting-Host: convex1.convex.com
- Organization: CONVEX Computer Corporation, Richardson, Tx., USA
- References: <1992Sep4.120503.1@uwovax.uwo.ca>
- Date: Fri, 04 Sep 1992 21:19:55 GMT
- X-Disclaimer: This message was written by a user at CONVEX Computer
- Corp. The opinions expressed are those of the user and
- not necessarily those of CONVEX.
- Lines: 123
-
- > From: brent@uwovax.uwo.ca (Brent Sterner)
- > Not sure why this crossed my mind, but I'm curious about how well
- > the alpha architecture might support vector processing. I'm familiar
- > with the vector processing concepts as implemented for VAX 6000 and
- > 9000 systems, as well as the implementation on the Cyber 2000. Can
- > anyone comment on alpha futures in this direction? Thanks, b.
-
- Here are several tidbits from the Alpha Architecture Reference Manual:
-
- A.1
-
- ... For Alpha, this means eventual 128-bit wide data paths. ...
-
- 1. Small (first-level) cache sizes will likely be in the range
- 2 KB to 64 KB
-
- 2. Small cache block sizes will likely be 16, 32, 64, or 128 bytes
-
- 3. Large (second- or third-level) cache sizes will likely be in
- the range 128 KB to 8 MB
-
- 4. Large cache block sizes will likely be 32, 64, 128, or 256 bytes
-
- A.3.1
-
- ... In some implementations, a series of writes that completely
- fill a cache block may be a factor of 10 faster than a series of
- writes that partially fill a cache block, when that cache block
- would give a read miss. ...
-
- ... Implementors should give first priority to fast reads of aligned
- octawords and second priority to fast writes of full cache blocks.
- Partial-quadword writes need not have a fast repetition rate.
-
- A.3.2
-
- ... <stuff about cache misses> ... Such accesses will likely be a
- factor of 30 slower than cache hits. ...
-
- A.3.4
-
- ... To avoid overrunning memory bandwidth, sequences of more than
- eight quadword Loads or Stores should be broken up with intervening
- instructions (if there is any useful work to be done).
-
- For consecutive reads, implementors should give first priority to
- prefetching ascending cache blocks, and second priority to absorbing
- up to eight consecutive quadword Loads (aligned on a 64-byte boundary)
- without stalling.
-
- For consecutive writes, implementors should give first priority to
- avoiding read overhead for fully written aligned cache blocks, and
- second priority to absorbing up to eight consecutive quadword
- Stores (aligned on a 64-byte boundary) without stalling.
-
- A.3.5
-
- 1. Assume that at most two FETCH instructions can be outstanding at
- once ...
-
- 2. Assume, for maximum efficiency, that there should be about 64
- unrelated memory access instructions (load or store) between a
- FETCH and the first actual data access to the prefetched data. ...
-
- 4. Assume that FETCH is worthwhile if, on average, at least half the
- data in a block will be accessed. Assume FETCH_M is worthwhile if,
- on average, at least half the data in a block will be modified.
-
- 5. Treat FETCH as a vector load. If a piece of code could usefully
- prefetch 4 operands, launch the first two prefetches, do about
- 128 memory references worth of work, then launch the next two
- prefetches, do about 128 more memory references worth of work,
- then start using the 4 sets of prefetched data.
-
- 6. Treat FETCH as having the same effect on a cache as a series of
- 64 quadword loads. If the loads would displace useful data, so
- will FETCH. If two sets of loads from specific addresses will
- thrash in a direct-mapped cache, so will two FETCH instructions
- using the same pair of addresses.
-
- How well will the Alpha architecture do on problems well-mapped to vector
- architectures?
-
- Two types of vector codes will run well on Alpha - small problems with
- small data sets that have a good cache hit rate, and large problems with
- unity-stride memory access patterns.
-
- Codes that will run very poorly (10x speed degradation) have non-unity
- stride memory accesses or scatter/gather memory accesses.
-
- Long-term (25 years from now <grin>), Alpha will be faced with the problem
- of having a BazillionHz processor that spends all of it's time waiting on
- memory loads and stores to complete. The limiting factor for performance
- on vector codes (most accesses being cache misses) will be the limited
- number of registers in the architecture. Since there are only 32 floating
- point registers, the architecture is limited to 32 outstanding floating-point
- loads to the memory system. The key question is, 25 years from now, will
- main memory be more than 32 cycles away (or 64, or 256, or 1024)?
-
- Vector applications that ran well on the VAX-6000 vector unit and/or the
- VAX-9000 vector unit will run fairly well on high-end Alpha systems, assuming
- DEC does enough tuning of the generated code for Alpha. Applications that
- ran poorly on these systems will also run poorly on high-end Alpha systems.
-
- I believe the vector units on the VAX-6000 and VAX-9000 were failures and
- dropped by DEC because of the limited need in the market for vector units
- that only run well from cache or with unity-stride accesses.
-
- DEC has clearly designed the Alpha architecture for applications with
- data accesses that have good cache hit rates. Vector architectures tend
- to excel on applications with poor cache hit rates.
-
- It's puzzling why Cray chose Alpha for their first MPP product (Y-MP
- add-on array processor), given Alpha's weaknesses on traditional vector
- applications. It's possible that Cray is doing this to offset their
- rather weak scalar performance and lack of a scalar cache.
-
- All in all, the Alpha seems to be an excellent piece of work by DEC to
- transition the VAX product line to new technology, but it isn't very
- well suited to vector processing problems.
-
- Regards,
- Ed Hamrick
-