NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / arch / 11830 < prev next >

Wrap

Internet Message Format | 1992-12-22 | 5.1 KB

Xref: sparky comp.arch:11830 comp.sys.intel:2802 Newsgroups: comp.arch,comp.sys.intel Path: sparky!uunet!timbuk.cray.com!walter.cray.com!ferris!bradc From: bradc@ferris.cray.com (Bradley R. Carlile) Subject: Re: Superscalar vs. multiple CPUs ? Message-ID: <1992Dec21.134909.6185@walter.cray.com> Keywords: VLIW, vector, superscalar Lines: 87 Nntp-Posting-Host: ferris.cray.com Organization: Cray Research, Inc. References: <PCG.92Dec9154602@aberdb.aber.ac.uk> <1992Dec9.211737.23911@walter.cray.com> <PCG.92Dec11162630@aberdb.aber.ac.uk> Date: 21 Dec 92 13:49:09 CST > On 10 Dec 92 03:17:36 GMT, bradc@ferris.cray.com (Bradley R. Carlile) said: > > pcg> as far as I can see, 6 instruction issue per cycle is virtually > pcg> pointless. The *limit* of superscalarity present in general purpose > ^^^^^^^^^^^^^^^ > Piercarlo Grandi writes: > Note the "general purpose"; if the codes exhibit high regularity in data > access patterns then they are no longer "general purpose" codes, at > least in my understanding of that term, which encompasses things like > editors, databases, compilers, word processors, spreadsheets, ... Well these general pupose codes also involve code sections that can be software pipelined. Things like searching, sorting, list updates, ... As I see it, there are several levels of parallelism. We want to use different techniques at different grainularities (Instruction, Machine, cluster, network,...). I agree there is a limit to instruction level parallelism that depends on the code. But we could use more than 4. We may need more instructions than 2 to get that average level of speedup. > pcg> codes is 4, and actually we are hard pressed to find many codes > pcg> with superscalarity higher than 2. > bradc> The limit of 4 or 6 may be the limit programming a superscalar > bradc> chip like by simply letting the the chip group instructions > bradc> together. *However* if one uses the technique of software > bradc> pipelining like we used to use on the VLIW machines of yesteryear > bradc> FPS-120B (before 1976), FPS-164, and the FPS-264 (1985?). > > Ah yes, well known, for codes well suited to SIMD style computing. But > frankly the jury is still out on software pipelining even in that case. > It covers, like LIW/VLIW, a gray area between [super]scalar and vector, > one maybe suited to short vector lengths. > > bradc> These machines could issue up to 10 instruction every cycle {I > bradc> wrote software for 7 years that used these instructions}. > > But I still think that a proper vector processor is overall, except for > particular cases, a better bet, especially because memory queues fit in > more easily with a vector architecture than with everything else, and > SIMD-style codes are great bandwidth eaters. And one can design vector > architectures that do perform well even for the short vector length for > which a LIW/VLIW seems designed. My point here is would be that VLIW is more general than vector processing, my experience shows that there more codes/loops can be software pipelined than vectorized. > Also, If you can put to good use software pipelining, and issue 10 > instructions per cycle, this means you have the same order of magnitude > memory transactions, and you need the same order of magnitude register > file depths, and so on. This looks like calling for vector instructions, > vector registers, vector memory access. You are not necessarily adding memory transactions. Only two of those 10 instructions that I mentioned earlier where memory. Memory issues can be separated from the manner in which data is moved. We need to be able to use different techniques at different levels. VLIW for effective instrution issue and other techniques to uses caches like vector registers. An example of this it is the CRAY APP (84 processor shared memory machine that uses these techniques [I do not include this as an advertisment, but as a proof of concept]. > bradc> In addition, compilers can automatically perform software > bradc> pipelining. At FPS we had several. A "modern" example of a > bradc> compiler includes Portland Group Inc.'s i860 compiler. > > Fascinating compilers, but the i860 story (which is often considered a > sort of LIW architecture) seems to support my impressions: it would have > been easier, even for the limited degree of parallelism available in the > i860, to have proper vector style instructions; the ones that do exploit > the multiple functional units of the i860 really amount to such, only > with loads of complications and hazards, and difficulties with keeping > the pipes fed. Now the i860 is a particularly awkward quasi-LIW design, > and it should not be taken as a straw man, but still... I think that the i860 is a VLIW processor (no one seems to use the term LIW anymore). However we have used software pipelining effectively on superscalar machines. I agree the i860 has problems, but I feel I could get codes to run faster on a VLIW (and more of them) than on a vector processor of the same clock rate. My basic point is I think the VLIW is just a programmable vector processor. Brad Carlile Cray Research Superservers, Inc. bradc@oregon.cray.com