home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.sys.intel:2826 comp.arch:11901
- Newsgroups: comp.sys.intel,comp.arch
- Path: sparky!uunet!enterpoop.mit.edu!bloom-picayune.mit.edu!athena.mit.edu!solman
- From: solman@athena.mit.edu (Jason W Solinsky)
- Subject: Re: Superscalar vs. multiple CPUs ?
- Message-ID: <1992Dec23.172413.8798@athena.mit.edu>
- Sender: news@athena.mit.edu (News system)
- Nntp-Posting-Host: m37-318-11.mit.edu
- Organization: Massachusetts Institute of Technology
- References: <WAYNE.92Dec4093422@backbone.uucp> <37595@cbmvax.commodore.com> <PCG.92Dec23150744@decb.aber.ac.uk>
- Date: Wed, 23 Dec 1992 17:24:13 GMT
- Lines: 93
-
- In article <PCG.92Dec23150744@decb.aber.ac.uk>, pcg@aber.ac.uk (Piercarlo Grandi) writes:
- |> On 21 Dec 92 13:33:18 GMT, solman@athena.mit.edu (Jason W Solinsky) said:
- |>
- |> solman> If you define the codes which we are concerned with to be codes
- |> solman> which can only exploit ILP, then of course the level of
- |> solman> parallelism is limited, but you are not dealing with general
- |> solman> purpose computing anymore.
- |>
- |> Ah, this discussion was indeed mostly about ILP/superscalarity/VLIW.
-
- This is where our misunderstanding lies. It is my argument that hyperscalar
- chips will be able to exploit macro level parallelism. A hyperscalar processor
- would be able to issue eight instructions from a single thread simultaneously,
- but its primary purpose would be to issue eight instructions from several
- different threads simultaneously. This is where the talk of getting rid of
- "imaginary lines" comes from. The processor of the future must be able to
- excecute one highly parallelizable thread very quickly (which as you note
- could be easily done with vector processors) AND execute several different
- threads simultaneously.
-
- |> pcg> Indeed pipeline designs with more than a few stages of pipelining
- |> pcg> run into huge problems, and are worth doing only if a significant
- |> pcg> proportion of SIMD-like operation is expected. Pipeline bubbles
- |> pcg> start to become a significant problem beyond 4 pipeline stages on
- |> pcg> general purpose codes, even on non superscalar architectures.
- |>
- |> solman> This can be taken care of by interleaving different threads in
- |> solman> the software, or using hardware which will take care of the
- |> solman> interleaving on its own. The above statement is only true when
- |> solman> the compiler is too dumb to notice higher level parallelism.
- |>
- |> Well, here you are saying, if I read you right, that if an application
- |> is suited to MIMD style computing then multithreading is the answer.
- |> This in itself is nearly a tautology. If you are also implying that many
- |> more applications are suited to *massive* MIMD style (macro) parallelism
- |> then is commonly believed, I would ber skeptical; what I believe is that
- |> quite a few *important* applications can be successfully (macro)
- |> parallelized, MIMD style. For example compiling GNU Emacs; each
- |> compilation of each source module can proceed in parallel, and indeed
- |> one can spawn a separate thread for each function in each source module.
-
- In the near future, we are not going to have the capability to put "massive"
- numbers of execution units on a chip. I would be surprised if we saw chips
- with more than 50 execution units in the near future. What I am saying is that
- this level of MIMD (A macro parallelism, ILP product of a little over 100)
- can be supported efficiently by enough applications to call it general purpose.
- I think both the example given here and the paper that has been cited support
- this. More importantly, the tendancy of software with greater levels of
- abstraction to exhibit more macro-parallelism, suggests that parallelizability
- of software will continue to outpace that of the hardware.
-
- |> solman> The key question in chosing how large register files and caches
- |> solman> should be, is "How large a {register file or cache} do I need
- |> solman> for `good' performance on the algorithms I want to run?"
- |> solman> Invariably, the size chosen is too small some of the time, while
- |> solman> much of it is left unused at other times. In the multiple CPU
- |> solman> version, this still happens. In the hyperscalar version,
- |> solman> however, some of the execution units and threads will need a
- |> solman> larger {cache or reg file} and some will be unable to utilize
- |> solman> the existing space, but because they can share the same caches
- |> solman> and register files, it is far less likely for performance to be
- |> solman> limited by cache or register file size.
- |>
- |> I agree with the sentiment here; partitioning resources can indeed lead
- |> to starvation in one place but to overcommitment in another.
- |>
- |> The problem is that sharing resources can have tremendous costs; sharing
- |> register files between multiple parallel functional units looks nice
- |> until one figures out the cost in terms of multiporting.
-
- One thing is for sure, we can't solve the problems of resource allocation
- simply by extending current methods. Just adding ports to register files
- would be a horrendously expensive idea. Instead, just spread several files
- with a couple of ports around and tag the data to signify where it came
- from and where its going. There are many different ways to do it. Just
- extending the current trend of multi porting isn't one of them though.
- Besides, It will take too many cycles to wait for data to cross the entire
- length of the chip. This should only be necessary when the local register
- file is full.
-
- |> It's obvious you know this, but maybe you are underestimating this cost;
- |> moreover avoiding partitioning will really solve problems only when
- |> there are huge swings in resource allocations by different threads; if
- |> these are fairly predictable, then static partitioning does not look so
- |> bad. I reckon that this is the case in many instances.
-
- I'm sure it is, but the optimal partitioning changes from user to user. If you
- want to use a single design as a general purpose uP, then this no longer looks
- like a good idea. This is especially the case when you consider that most users
- will be multi-tasking several programs at once and these are unlikely to be
- very similar in resource consumption.
-
- Jason W. Solinsky
-