NetNews Usenet Archive 1992 #30

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #30 / NN_1992_30.iso / spool / comp / sys / intel / 2785 < prev next >

Wrap

Internet Message Format | 1992-12-21 | 4.9 KB

Xref: sparky comp.sys.intel:2785 comp.arch:11788 Path: sparky!uunet!pipex!warwick!uknet!gdt!aber!fronta.aber.ac.uk!pcg From: pcg@aber.ac.uk (Piercarlo Grandi) Newsgroups: comp.sys.intel,comp.arch Subject: Re: Superscalar vs. multiple CPUs ? Message-ID: <PCG.92Dec19165907@decb.aber.ac.uk> Date: 19 Dec 92 16:59:07 GMT References: <1992Dec7.012026.11482@athena.mit.edu> <1992Dec8.000357.26577@newsroom.utas.edu.au> <PCG.92Dec9154602@aberdb.aber.ac.uk> <1992Dec12.152224.168173@zeus.calpoly.edu> Sender: news@aber.ac.uk (USENET news service) Reply-To: pcg@aber.ac.uk (Piercarlo Grandi) Organization: Prifysgol Cymru, Aberystwyth Lines: 88 In-Reply-To: mneideng@thidwick.acs.calpoly.edu's message of 12 Dec 92 15: 22:24 GMT Nntp-Posting-Host: decb.aber.ac.uk On 12 Dec 92 15:22:24 GMT, mneideng@thidwick.acs.calpoly.edu (Mark Neidengard) said: mneideng> I see nothing wrong with many instruction units on one chip; mneideng> all you have to do is either A) use them like a sort of mneideng> on-chip diastolic pipeline, or B) have the system process mneideng> scheduler schedule different processes to use different units. Wonderful ideas! What I would actually like is a dataflow machine, then. mneideng> I could easily envision a chip with three FP adders and three mneideng> FP multipliers (maybe not with today's fabrication, but in mneideng> another year or so...) What you would have to do is increase mneideng> the sophistication of the chip's internal pipeline controller mneideng> and do a LOT of instruction predecoding. Easy said, easy done :-). A lot of research is going into that; as somebody posted there is a law that says that exploitable parallelism is a linear function of the ASPLOS number. But my point is not tabout not being able to do that; my point is about it being necessary or even useful. Given that there has been some incomprehension as to which issue I thought we were discussing, I will try to reformulate it again. The important limit to micro/macro parallelism is not the cleverness of the CPU implementor, it is the inherent parallelism in the application/algorithm/data strctures. As far as I have read and seen with my eyes, let me insist, most "general purpose" codes have limited degrees of exploitable micro/macro parallelism, say 2 to 4. The underlying reason is that most such codes are about is serial tree/graph walking/updating. Codes with easily "minable" parallelism are, almost invariably, "special purpose" codes, the underlying reason is that most such codes are about sequential array scanning/filtering, and even on those a trivial thing as strides may cause problems. Then in a much distant second position, are "one purpose" codes, for very special applications, such as those beloved by NSA. If there is any undrlying commonality to "one purpose" codes is probably that they are about associative access to memory, rather than graph walking or array scanning. Now, whatever architecture, even superscalar/multiple functional units/dataflow/systolic/VLIW or whatever trickery cannot extract from "general purpose" codes more parallelism than they intrinsically contain (Amdahl's law rules). The real issue we are discussing here is whether superscalar/multiple FUs/dataflow/systolic/VLIW are well suited to codes with high inherent, array like, parallelism. As to this I beg to submit that a special purpose architecture, vector, for a special purpose data structure, array, is the most cost/effective bet still. Hennessy and Patterson seem to say so in the first two pages of their book, incidentally. So let's go back to the original question: given that we shall shortly be able to have about half a dozen/a dozen functional units on a die, how should these be arranged? 1) As a single systolic/dataflow engine (or rather a poor imitation thereof, e.g. a supersuperscalar with lots of Tomasulization/register renaming tricks)? 2) As multiple superscalar[+vector] CPUs? I think that the second option looks more promising, from my armchair (I would be happy to receive as a donation a state-of-the-art CAD system for designing CPUs, and a fab line for 3 million transistor chips). The reasons are that the superscalar part can take advantage of the limited inherent microparallelism of most codes, the vector part can deal with the occasional but very important (graphics, sound, etc.) vector codes, and the limited degree of multiprocessing would allow exploiting the limited degree of macro parallelism in most applications. Just to make things clearer, I am thinking of the workstation of a few years hence, with realtime animated video, interactive WSYWYG hypertext, an integrated telephone, and running DOS 7/Windows 3.9 :-). This seems to be Intel's own vision, incidentally. -- Piercarlo Grandi, Dept of CS, PC/UW@Aberystwyth <pcg@aber.ac.uk> E l'italiano cantava, cantava. E le sue disperate invocazioni giunsero alle orecchie del suo divino protettore, il dio della barzelletta