home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.arch:11830 comp.sys.intel:2802
- Newsgroups: comp.arch,comp.sys.intel
- Path: sparky!uunet!timbuk.cray.com!walter.cray.com!ferris!bradc
- From: bradc@ferris.cray.com (Bradley R. Carlile)
- Subject: Re: Superscalar vs. multiple CPUs ?
- Message-ID: <1992Dec21.134909.6185@walter.cray.com>
- Keywords: VLIW, vector, superscalar
- Lines: 87
- Nntp-Posting-Host: ferris.cray.com
- Organization: Cray Research, Inc.
- References: <PCG.92Dec9154602@aberdb.aber.ac.uk> <1992Dec9.211737.23911@walter.cray.com> <PCG.92Dec11162630@aberdb.aber.ac.uk>
- Date: 21 Dec 92 13:49:09 CST
-
- > On 10 Dec 92 03:17:36 GMT, bradc@ferris.cray.com (Bradley R. Carlile) said:
- >
- > pcg> as far as I can see, 6 instruction issue per cycle is virtually
- > pcg> pointless. The *limit* of superscalarity present in general purpose
- > ^^^^^^^^^^^^^^^
- > Piercarlo Grandi writes:
- > Note the "general purpose"; if the codes exhibit high regularity in data
- > access patterns then they are no longer "general purpose" codes, at
- > least in my understanding of that term, which encompasses things like
- > editors, databases, compilers, word processors, spreadsheets, ...
-
- Well these general pupose codes also involve code sections that can be
- software pipelined. Things like searching, sorting, list updates, ...
- As I see it, there are several levels of parallelism. We want to use
- different techniques at different grainularities (Instruction, Machine,
- cluster, network,...). I agree there is a limit to instruction level
- parallelism that depends on the code. But we could use more than 4.
- We may need more instructions than 2 to get that average level of speedup.
-
- > pcg> codes is 4, and actually we are hard pressed to find many codes
- > pcg> with superscalarity higher than 2.
-
- > bradc> The limit of 4 or 6 may be the limit programming a superscalar
- > bradc> chip like by simply letting the the chip group instructions
- > bradc> together. *However* if one uses the technique of software
- > bradc> pipelining like we used to use on the VLIW machines of yesteryear
- > bradc> FPS-120B (before 1976), FPS-164, and the FPS-264 (1985?).
- >
- > Ah yes, well known, for codes well suited to SIMD style computing. But
- > frankly the jury is still out on software pipelining even in that case.
- > It covers, like LIW/VLIW, a gray area between [super]scalar and vector,
- > one maybe suited to short vector lengths.
- >
- > bradc> These machines could issue up to 10 instruction every cycle {I
- > bradc> wrote software for 7 years that used these instructions}.
- >
- > But I still think that a proper vector processor is overall, except for
- > particular cases, a better bet, especially because memory queues fit in
- > more easily with a vector architecture than with everything else, and
- > SIMD-style codes are great bandwidth eaters. And one can design vector
- > architectures that do perform well even for the short vector length for
- > which a LIW/VLIW seems designed.
-
- My point here is would be that VLIW is more general than vector processing,
- my experience shows that there more codes/loops can be software pipelined than
- vectorized.
-
- > Also, If you can put to good use software pipelining, and issue 10
- > instructions per cycle, this means you have the same order of magnitude
- > memory transactions, and you need the same order of magnitude register
- > file depths, and so on. This looks like calling for vector instructions,
- > vector registers, vector memory access.
-
- You are not necessarily adding memory transactions. Only two of those
- 10 instructions that I mentioned earlier where memory. Memory issues
- can be separated from the manner in which data is moved.
-
- We need to be able to use different techniques at different levels. VLIW for
- effective instrution issue and other techniques to uses caches like vector
- registers. An example of this it is the CRAY APP (84 processor shared memory
- machine that uses these techniques [I do not include this as an advertisment,
- but as a proof of concept].
-
- > bradc> In addition, compilers can automatically perform software
- > bradc> pipelining. At FPS we had several. A "modern" example of a
- > bradc> compiler includes Portland Group Inc.'s i860 compiler.
- >
- > Fascinating compilers, but the i860 story (which is often considered a
- > sort of LIW architecture) seems to support my impressions: it would have
- > been easier, even for the limited degree of parallelism available in the
- > i860, to have proper vector style instructions; the ones that do exploit
- > the multiple functional units of the i860 really amount to such, only
- > with loads of complications and hazards, and difficulties with keeping
- > the pipes fed. Now the i860 is a particularly awkward quasi-LIW design,
- > and it should not be taken as a straw man, but still...
-
- I think that the i860 is a VLIW processor (no one seems to use the term LIW
- anymore). However we have used software pipelining effectively on
- superscalar machines. I agree the i860 has problems, but I feel I could
- get codes to run faster on a VLIW (and more of them) than on a vector processor
- of the same clock rate.
-
- My basic point is I think the VLIW is just a programmable vector processor.
-
- Brad Carlile
- Cray Research Superservers, Inc.
- bradc@oregon.cray.com
-