home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.sys.intel:2785 comp.arch:11788
- Path: sparky!uunet!pipex!warwick!uknet!gdt!aber!fronta.aber.ac.uk!pcg
- From: pcg@aber.ac.uk (Piercarlo Grandi)
- Newsgroups: comp.sys.intel,comp.arch
- Subject: Re: Superscalar vs. multiple CPUs ?
- Message-ID: <PCG.92Dec19165907@decb.aber.ac.uk>
- Date: 19 Dec 92 16:59:07 GMT
- References: <1992Dec7.012026.11482@athena.mit.edu>
- <1992Dec8.000357.26577@newsroom.utas.edu.au>
- <PCG.92Dec9154602@aberdb.aber.ac.uk>
- <1992Dec12.152224.168173@zeus.calpoly.edu>
- Sender: news@aber.ac.uk (USENET news service)
- Reply-To: pcg@aber.ac.uk (Piercarlo Grandi)
- Organization: Prifysgol Cymru, Aberystwyth
- Lines: 88
- In-Reply-To: mneideng@thidwick.acs.calpoly.edu's message of 12 Dec 92 15: 22:24 GMT
- Nntp-Posting-Host: decb.aber.ac.uk
-
-
- On 12 Dec 92 15:22:24 GMT, mneideng@thidwick.acs.calpoly.edu (Mark
- Neidengard) said:
-
- mneideng> I see nothing wrong with many instruction units on one chip;
- mneideng> all you have to do is either A) use them like a sort of
- mneideng> on-chip diastolic pipeline, or B) have the system process
- mneideng> scheduler schedule different processes to use different units.
-
- Wonderful ideas! What I would actually like is a dataflow machine, then.
-
- mneideng> I could easily envision a chip with three FP adders and three
- mneideng> FP multipliers (maybe not with today's fabrication, but in
- mneideng> another year or so...) What you would have to do is increase
- mneideng> the sophistication of the chip's internal pipeline controller
- mneideng> and do a LOT of instruction predecoding.
-
- Easy said, easy done :-). A lot of research is going into that; as
- somebody posted there is a law that says that exploitable parallelism is
- a linear function of the ASPLOS number.
-
- But my point is not tabout not being able to do that; my point is about
- it being necessary or even useful.
-
- Given that there has been some incomprehension as to which issue I
- thought we were discussing, I will try to reformulate it again.
-
- The important limit to micro/macro parallelism is not the cleverness of
- the CPU implementor, it is the inherent parallelism in the
- application/algorithm/data strctures.
-
- As far as I have read and seen with my eyes, let me insist, most
- "general purpose" codes have limited degrees of exploitable micro/macro
- parallelism, say 2 to 4. The underlying reason is that most such codes
- are about is serial tree/graph walking/updating.
-
- Codes with easily "minable" parallelism are, almost invariably, "special
- purpose" codes, the underlying reason is that most such codes are about
- sequential array scanning/filtering, and even on those a trivial thing
- as strides may cause problems.
-
- Then in a much distant second position, are "one purpose" codes, for
- very special applications, such as those beloved by NSA. If there is any
- undrlying commonality to "one purpose" codes is probably that they are
- about associative access to memory, rather than graph walking or array
- scanning.
-
- Now, whatever architecture, even superscalar/multiple functional
- units/dataflow/systolic/VLIW or whatever trickery cannot extract from
- "general purpose" codes more parallelism than they intrinsically contain
- (Amdahl's law rules).
-
- The real issue we are discussing here is whether superscalar/multiple
- FUs/dataflow/systolic/VLIW are well suited to codes with high inherent,
- array like, parallelism. As to this I beg to submit that a special
- purpose architecture, vector, for a special purpose data structure,
- array, is the most cost/effective bet still. Hennessy and Patterson seem
- to say so in the first two pages of their book, incidentally.
-
- So let's go back to the original question: given that we shall shortly
- be able to have about half a dozen/a dozen functional units on a die,
- how should these be arranged?
-
- 1) As a single systolic/dataflow engine (or rather a poor imitation
- thereof, e.g. a supersuperscalar with lots of Tomasulization/register
- renaming tricks)?
-
- 2) As multiple superscalar[+vector] CPUs?
-
- I think that the second option looks more promising, from my armchair
- (I would be happy to receive as a donation a state-of-the-art CAD system
- for designing CPUs, and a fab line for 3 million transistor chips).
-
- The reasons are that the superscalar part can take advantage of the
- limited inherent microparallelism of most codes, the vector part can
- deal with the occasional but very important (graphics, sound, etc.)
- vector codes, and the limited degree of multiprocessing would allow
- exploiting the limited degree of macro parallelism in most applications.
-
- Just to make things clearer, I am thinking of the workstation of a few
- years hence, with realtime animated video, interactive WSYWYG hypertext,
- an integrated telephone, and running DOS 7/Windows 3.9 :-).
-
- This seems to be Intel's own vision, incidentally.
- --
- Piercarlo Grandi, Dept of CS, PC/UW@Aberystwyth <pcg@aber.ac.uk>
- E l'italiano cantava, cantava. E le sue disperate invocazioni giunsero
- alle orecchie del suo divino protettore, il dio della barzelletta
-