home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!usc!elroy.jpl.nasa.gov!swrinde!gatech!hubcap!fpst
- From: lycmit@x1sun6.ccl.itri.org.tw (Yin-Chih Lin)
- Newsgroups: comp.parallel
- Subject: Could I compete the low-end CM5 with multiple SPARCstations? (Summary)
- Message-ID: <1992Jul27.140928.26067@hubcap.clemson.edu>
- Date: 24 Jul 92 17:35:40 GMT
- Sender: fpst@hubcap.clemson.edu (Steve Stevenson)
- Organization: Clemson University
- Lines: 583
- Approved: parallel@hubcap.clemson.edu
-
- Greetings:
-
- Pardon me for using the net bandwidth again. I have received many replies
- about the subject. So I suppose maybe it will be fine to summarize these
- responses and repost again. (I apologize if someone don't think so).
-
- Most of notes imply that it is *almost impossible* to challenge the
- CM5 with the Sparcstations without extra facilities (vector processors,
- high-speed net, good s/w).
-
- However, it might be *not true* in my case. I know another team in
- ITRI/CCL has developed the VP processors for SPARCstations (I haven't
- inquired the actual spec. and performace of this processor). I also
- belileve there are vendors provide the FDDI add-on cards for SBus.
- The only hurt is the *software*. It is true that my organization
- is impossible to sanctify such project ( especially when there is
- one IBM ES-9000/820 machine will be available from National High Speed
- Computer Center in my country. I don't know *why* they select IBM).
-
- But I also got some directions from the receiving messages. Some of
- them (PVM, p4) are positive encourages for me to look for what I like
- with the current facilities.
-
- Anyway, I would be glad to hear any comments about the subject in any
- time. Please feel free to email me if you have further directions.
-
- Thanks go to following people (according the order in my mail box,
- date to July 24, 12:00 - London time plus 8 hours)
-
- Peter Su psu@cs.duke.edu
- Dirk Grunwald grunwald@foobar.cs.colorado.edu
- Jon Mellott jon@alpha.ee.ufl.edu
- Jerry Callen jcallen@Think.COM
- K. J. Chang kjchang@hpl.hp.com
- Steve Roy ssr@ama.caltech.edu
- Scott Townsend fsset@bach.lerc.nasa.gov
- Rowan Hughes csrdh@marlin.jcu.edu.au
- Mark Homewood fred@meiko.com
- Matthias Schumann schumanm@informatik.tu-muenchen.de
- Mike Davis davis@ee.udel.edu
- Michael Kolatis kolatis@cs.utk.edu
- Dave dave@sep.stanford.edu
- Henrik Klagges henrik@tazdevil.llnl.gov
- (anonymous) vu0208@bingsuns.cc.binghamton.edu
- Bob Felderman feldy@isi.edu
- Ian L. Kaplan ian@MasPar.COM
- Bill Maniatty maniattb@cs.rpi.edu
- Keith Bierman Keith.Bierman@Eng.Sun.COM
- Paul Whiting Paul.G.Whiting@mel.dit.csiro.au
-
- Here are the original message and replies:
-
- ------------------------------ ( original ) --------------------------------
-
- We have many Sun's SPARCstations (from 1, 1+, IPC, to 2, 670MP and
- SPARCstation 10 in the near future), and the new-version multi-thread
- OS (SunOS 5.0) will be installed on these computers.
-
- I find that the Think Machines Corp.'s latest MPP machine CM-5 is also
- adopted the SPARC processors (22 MIPS per processor) as its nodes.
- According the "Parallel Processing" Oct. 1991 issue, the price of 32-node
- CM5 model will sell at about $1.4 million.
-
- I am curious, if I connect 32 SPARCstation 10 model 30 (33MHz, SuperSPARC
- module with 86.1 MIPS - Sun Micro adverted!?) computers (loosely-coupled
- MIMD multicomputers?) either by Ethernet, ISDN or FDDI with the multi-
- thread OS and might use facilities from Linda-like distributed data
- structures. Is it possible for me to obtain the reasonable performance
- when compared with the 32-node TMC CM5 (SIMD machine)?
-
- I will be highly appreciated for receiving any comments. Please direct
- your opinion (or abuse) to me by email. so the news bandwidth won't be
- wasted (except this one!).
-
- Thanks,
-
- P.S. I would also like to know, is there any one config their SPARCstations
- to a message-passing MIMD multicomputers rather than just merely LAN-
- based networking workstations?
-
- ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- Yin-Chih Lin Industrial Technology Research
- email: lycmit@x1sun6.ccl.itri.org.tw Institute (ITRI)
- phone: 886-35-917331
- fax: 886-35-917503 Computer & Comm. Research Labs.(CCL)
- ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-
- ------------------------------ ( reply #1 ) --------------------------------
-
- From: Peter Su <psu@cs.duke.edu>
-
- The answer depends on two things:
-
- 1) Would your code be able to take advantage of the CM-5 vector units?
- 2) Does your code need to do a lot of communication/synchronization.
-
- If you have code that breaks up into big, independent chunks which do
- not vectorize easily, and you have libraries to manage workstaion
- clusters, then I can't see why the CM-5 would be much better.
-
- Oh, the CM-5 probably has more usable memory per node, since data
- processing nodes don't run the full OS.
-
- Pete
-
- ------------------------------ ( reply #2 ) --------------------------------
-
- From: Dirk Grunwald <grunwald@cs.colorado.edu>
-
- The CM-5 also has 200Mf vector processors and a very fast network.
- most importantly, it has good software.
-
- ------------------------------ ( reply #3 ) --------------------------------
-
- From: Jon Mellott <jon@alpha.ee.ufl.edu>
-
- It depends a lot on the type of tasks that you are doing. If your tasks
- are characterized by lots of communication between processes/processors,
- then the CM-5 will probably eat you alive. Also, the CM-5 does have
- the option of a vector processor for each processor module, so you
- could potentially be chewed up there also.
-
- BTW, if your intent is to do something like this, then you might
- consider going to something like the SS10 model 54 (four processors).
- You'd be money ahead on a per processor basis since you wouldn't have
- to buy as many pizza boxes per processor. Also, if your code is
- amenable to multi-threading then you might be able to derive a big
- benefit from running on a multiprocessor machine, especially from an
- inter-thread communications standpoint.
-
- Lemme see, the SS10 model 54 is listing at something like $50K, so
- for a million bucks you could get twenty of them. That'd give you
- eighty processors. You could spend the savings on FDDI interfaces
- so you don't get butchered by ethernet latencies.
-
- BTW, this sort of thing has been described before in the literature.
- I don't recall where, but it has been done before.
-
- Good luck,
- Jon Mellott
- High Speed Digital Architecture Laboratory
- (jon@alpha.ee.ufl.edu)
-
- ------------------------------ ( reply #4 ) --------------------------------
-
- From: Jerry Callen <jcallen@Think.COM>
-
- [Claimer: I work for Thinking Machines.]
-
- There is a lot more to a CM-5 than SPARCs and memory.
-
- Specifically, there are two high-bandwidth, low-latency networks. The
- "control network" can perform a number of combining operations (max,
- add, or, xor), broadcasting and synchronization functions. The "data
- network" allows point-to-point communications between multiple nodes
- SIMULTANEOUSLY at very high bandwidths (>5MB/sec/node). This gives a
- 32 node machine a (very conservatively rated) aggregate bandwidth of
- 160MB/sec. You just cannot get this kind of bandwidth out of Ethernet
- or even FDDI. Also, the network interface is mapped into user memory
- space, so you don't have to go through operating system layers to get
- at it.
-
- There are also 4 vector units on each node; each vector unit has a
- dedicated bank of memory, yielding very high memory bandwidth. Each
- vector unit can do 32MFLOPS (peak, but I have seen codes get >75% of
- this).
-
- Also, a point of clarification: the CM-5 is NOT a "SIMD machine." While
- the networks provide excellent support for a data-parallel programming
- model, TMC supports a message-passing programming model as well.
-
- Call TMC at (617) 234-1000 and ask for a CM-5 Technical Summary if you
- would like more information.
-
- -- Jerry Callen
- jcallen@world.std.com (preferred)
- jcallen@think.com (OK, too)
- {uunet,harvard}!think!jcallen (if you must)
-
- ------------------------------ ( reply #5 ) --------------------------------
-
- From: K. J. Chang <kjchang@hplabsz.hpl.hp.com>
-
- Of course! That's why there are at least 20 new RISC-based workstations
- being designed here at Silicon Valley. Mainframe or Supercomputer vendors
- do not release their SPCEfp92 or SPECint92 because their products are
- not cost-effective.
-
- You may want to read Chap. 2 of "Computer Architecture A Quantitative
- Approach" (John Hennessy and David Patterson), especially Figure 2.25
- which tells you that those expensive computers' MFLOPS are in the range
- of 8.3 to 24.4 only.
-
- --
- K J Chang, Hewlett-Packard ICBD R & D, (())_-_(())
- Palo Alto, CA 94304 | (* *) |
- Internet: kjchang@hpl.hp.com a UCLA Bruin --> { \_@_/ }
- voice mail: (415)857-4604 `-----'
-
- ------------------------------ ( reply #6 ) --------------------------------
-
- From: Steve Roy <ssr@ama.caltech.edu>
-
- There are three main differences that insure that a network of
- workstations will not compete with the CM5.
-
- The first difference is a hardware one: The Sparc in the CM5 is really
- just a controller/scheduler for custom vector/memory processors. Each
- sparc controls 4 of these processors. These vector boards are much
- much faster than the sparc for number crunching, the main point of
- this sort of machine. The fact that it has Sparcs in it is sort of
- irrelevant for number crunching considerations.
-
- The second difference is the speed of the communications: the CM5 has
- a very fast and sophisticated network that allows arbitrary
- permutation communication to proceed at not less than 5MByte/sec per
- processor. This is the high bandwidth network; there is also a lower
- bandwidth but also lower latency network that is used primarily for
- fast syncronizations between processors. There is a third network for
- diagnostics. The networks are redundant so the machine can continue
- to work given any single point failure, an important consideration for
- a massively parallel system.
-
- The third difference is the integrated and powerful software that
- allows easy parallel coding and debugging. The debugger is X-based
- and allows you to click on (for example) an array name and display its
- contents either as numbers or as a contour plot. The first compiler
- is an F90 variant, there will later be ports of C* and presumably
- *Lisp.
-
- It was a good question tho. I hope you post a summary to the net.
-
- Steve Roy.--
- ----------------------------------------------------------------------
- Steve Roy | Life. Don't talk to me about life.
- ssr@ama.caltech.edu |
- |
-
- ------------------------------ ( reply #7) --------------------------------
-
- From: Scott Townsend <fsset@bach.lerc.nasa.gov>
-
- While the individual processors will be comparable to those in the CM5,
- as a group your clustered processors will approximate the CM5 only if your
- parallel application does almost zero communications.
-
- We've run some experiments here with a cluster of RS6000's. On applications
- which require very little communication relative to the computation rate,
- things run very well. As soon as you get into more typical parallel
- applications, things slow down REAL fast!
-
- The main reason for this is latency. Local networks are not designed for
- very low latency. Unless the interconnect in a parallel processor has _very_
- low latency compared to CPU speed, you spend all your time waiting for
- messages instead of computing.
-
-
- --
- Scott Townsend, Sverdrup Technology Inc. NASA Lewis Research Center Group
- fsset@bach.lerc.nasa.gov
-
- ------------------------------ ( reply #8) --------------------------------
-
- From: Rowan Hughes <csrdh@marlin.jcu.edu.au>
-
- You'd need to connect them together with at least FDDI. The bandwidth between
- nodes is the biggest problem with MPP machines. 1000Mb/s would be better.
- Also, you probably won't be able to use more nodes than 16, since the
- search/latency overheads become prohibitive on a single commnication rail, ie
- you've got no secondary search engine.
-
- At the moment the CM-5 is just a bunch of Sparc stations in one box, the vector
- cpu's haven't been fabricated yet (and probably never will be). The $1.4M is
- really just for a bit of software.
-
- --
- Rowan Hughes James Cook University
- Marine Modelling Unit Townsville, Australia.
- Dept. Civil and Systems Engineering csrdh@marlin.jcu.edu.au
-
- ------------------------------ ( reply #9) --------------------------------
-
- From: Mark Homewood <fred@meiko.co.uk>
-
- Besides designing FPU's for SPARC Meiko manufactuers machines based around
- something like what you have described. We buy commodity boards or workatation
- components and produce (hopefully) cost effective parallel supercomputers.
- We do find it necessary to add some extra communication bandwidth and
- reduced latency. Clearly a 32node TMC is out to lunch, we do a similar size
- machine for a fraction of the price. The major difference is that ours can be
- used as a network of Sun as well, and at the same time.
-
- There is some public domain software which lets you use your machines as a
- parallel machine. This is called PVM (Parallel Virtual Machine), try an
- anonymous ftp to <research.att.com> in dir <netlib/pvm>. There is a Readme
- and index. You can compile and configure for your network.
-
- Though as one of your repliers states, latency is the thing. This is what
- Meiko (and TMC, though over-priced) are selling.
-
- Fred
-
- Meiko Limited
-
- fred@meiko.com
-
- ------------------------------ ( reply #10) -------------------------------
-
- From: Matthias Schumann <schumanm@informatik.tu-muenchen.de>
-
- Hi,
-
- here at the Technical University Munich an Environtment for parallel
- Programming called MMK was developed, that runs on Intel iPSC 2 / 860
- MIMD-Supercomputers.
- There is antother version called MMKx, which offers the same functions
- (task creation, deletion / communication / semaphores, synchronization /
- .....) on a Network of Sun-Workstations connected over an Ethernet.
-
- The question about the CM5 is application depended, i would say. Code
- that is written especially for SIMD machines will run faster on the
- CM. If you got general purpose code the perormance of the sun network
- will increase but if it could reach the CMs performance i got no idea,
- even though i got no information about the inter Processor communication
- network in the CM5.
-
- Ciao
- Matt
-
- --
- | Matthias Schumann Department of Computer Science/SAB TU Muenchen |
- | schumanm@informatik.tu-muenchen.de Arcisstr.21 |
- | Phone: +49 89 2105 2036 8000 Muenchen 2 |
-
- ------------------------------ ( reply #11) -------------------------------
-
- From: Mike Davis <davis@ee.udel.edu>
-
- I would be interested in any responses you get on this subject. I'am involved
- in a project looking into running distributed simulations on a group of FDDI
- connected SPARCstation10's running Linda. Since the closest thing we have
- to compare to is an older 16 processor Sequent, I can't help with your CM5
- questions..
-
- thanks
- mike
-
- davis@ee.udel.edu
-
- ------------------------------ ( reply #12) -------------------------------
-
- From: Michael Kolatis <kolatis@cs.utk.edu>
-
- Let's see, from what I know so far, one node of a TMC CM-5
- will have a SPARC chip that performs about 5 Mflops. Thus,
- with 32 nodes, you would expect a peak of 160 Mflops.
-
- Now, a SPARCstation 10 with 4 processors is supposed to perform
- at 19.7 Mflops per processor (i.e. almost 80 Mflops). Thus,
- 32 * 80 = 2560 Mflops = 2.56 Gflops.
-
- However, each node of a CM-5 will eventually include 4 vector
- processors (TMC calls them "floating point accelerators") which
- will increase the node power to 128 Mflops per node (which makes
- a 32 node CM-5 capable of 4 Gflops).
-
- Thus, theoretically, with the communication speed losses involved
- with the SPARC 10's & the lack of a parallel programming language for
- them the SPARC's would lose (also, note that if a SPARC 10 does have
- 4 processors, your connection would have 128 processors--not 32).
-
- However, hooking the SPARC 10's together using PVM(Parallel Virtual
- Machine) as parallel software would be very interesting...
- and certainly makes the efforts toward massively parallel
- computing very viable.
-
- Michael Kolatis
- UTenn
- Joint Institute for Computational Science
-
- ------------------------------ ( reply #13) -------------------------------
-
- From: Dave <dave@sep.stanford.edu>
-
- Unfortunately not, each CM-5 node also has 4 custom vector units running
- at 120MFlop peak each. That, and the high bandwidth network are what you are
- really paying for with a CM-5.
-
- ------------------------------ ( reply #14) -------------------------------
-
- From: Henrik Klagges <henrik@tazdevil.llnl.gov>
-
- I think it is a good idea. At LLNL we have a cluster of RS6000, connected
- via a very fast network, with good results.
-
- --
-
- Cheers, Henrik
- MPCI at LLNL
- IBM Research
-
- ------------------------------ ( reply #15) -------------------------------
-
- From: (anonymous) <vu0208@bingsuns.cc.binghamton.edu>
-
- If you want to do what you have said in your post, you might create a
- loosely-coupled loosely-performed-MIMD!! Why? becaz the CM-5 has 2
- extra things that you wont have: (1) A fast communication network
- rather than yours FDDI/Ethernet/ISDN etc.. (2) Their node is haveing a
- dedicated a 22-Mips RISC microprocessor with four vector pipes
- providing a total of 128 MFlops peak speed!! (3) They are capable of
- playing dual role of the machine at the same time SIM and MIMD.
-
- So I guess by just connecting 32 Sparcstations you wont hit the 129
- MFlops performance, however it may be at least 1/4 of 16-node CM-5.
-
- Let's see whta you get after you are done with your peoposed project.
-
- Keep me posted, i will appreciate.
-
- Thanks
-
- ------------------------------ ( reply #16) -------------------------------
-
- From: Bob Felderman <feldy@isi.edu>
-
- I suspect not. If you could, then TMC would quickly go out of business.
- My guess is that their interconnection network is MUCH better than
- any of your choices.
- For one thing, your choices were mostly shared-medium channels. If your
- taks requires significant communication, you'll get killed by the
- Ethernet or FDDI etc. If your taks are mostly independent, you'll
- probably do just fine.
-
-
- Bob Felderman
- USC/Information Sciences Institute
- 4676 Admiralty Way
- Marina del Rey, CA 90292-6695
- (310) 822-1511 x222
- (310) 823-6714 (fax)
-
- feldy@isi.edu
-
- ------------------------------ ( reply #17) -------------------------------
-
- From: Ian L. Kaplan <ian@MasPar.COM>
-
- Whether you can get comparable performance out of an network of 32
- Spark Stations that can be obtained form a 32 processor CM-5 depends
- on the application. If you have an application with little or no
- communication (some Monte Carlo simulations fall into this catagory)
- then the answer is yes. The more interprocessor communication the
- more the CM-5 will win. They have put a lot of effort into designing
- their communication network and it is much faster than anything you
- are likely to find connecting workstations. Another issue is software
- overhead in handling communication. I don't know what TMC has done in
- this area (we are one of their competitors, so they don't tell us
- these things), but I imagine that they have worked to reduce this
- overhead. This is not as important in SMP Solaris, so it is probably
- not as tightly tuned.
-
- Of course what you should really do is buy a MasPar machine, which
- will give you more processing power per dollar than either TMC or Sun
- can provide.
-
- Ian Kaplan
- ian@maspar.com
-
- ------------------------------ ( reply #18) -------------------------------
-
- From: Bill Maniatty <maniattb@cs.rpi.edu>
-
- I'd be skeptical of getting the same level of performance. I think the
- expense of the networked connections (latency in particular), would force
- you to use a larger granularity of task decomposition. I haven't played with
- a CM-5 so I would not really know for sure, but that is my gut reaction.
- You might also get a different resource contention pattern because of the
- different Operating System/Compiler technology. I don't think they are
- really equivalent, each is appropriate for a different sort of task.
-
- Bill
- --
- |
- | maniattb@cs.rpi.edu - in real life Bill Maniatty
- |
-
- ------------------------------ ( reply #19) -------------------------------
-
- From: Keith Bierman <Keith.Bierman@Eng.Sun.COM>
-
- the CM-5 has much faster interprocessor communication (fast custom hw,
- special sw) you won't come close to their performance for most codes.
-
- Many people use networks of workstations for distributed processing.
- They use Express, Strand, PVM and a bevy of research efforts. Wander
- through comp.parallel, or a local university library (if they have a
- good collection of conference papers and technical reports).
-
- ----------------------------- ( reply #20) -------------------------------
-
- From: Paul Whiting <pgw@mel.dit.csiro.au>
-
- G'day,
-
- yes this is possible to do using the macros from Argonne National Lab in the US.
-
- I will forward you another piece of mail about them and how to obtain them.
-
- regards,
-
- ---------------------------------------------------------------------
- Paul Whiting | CSIRO - DIT
- Computer Scientist | 723 Swanston Street
- High Performance Computing Program | Carlton 3053, Australia
- ---------------------------------------------------------------------
- E-mail: pgw@mel.dit.csiro.au Phone: + 61 3 282 2666
-
- There's no problem a good surf can't fix.
- ---------------------------------------------------------------------
-
- (****************************** p4 info **********************************)
-
- p4
-
-
- p4 is a library of macros and subroutines developed at Argonne National
- Laboratory for programming a variety of parallel machines. Its predecessor
- was the m4-based "Argonne macros" system described in the Holt, Rinehart, and
- Winston book "Portable Programs for Parallel Processors, by Lusk, Overbeek, et
- al., from which p4 takes its name. The current p4 system maintains the same
- basic computational models described there (monitors for the shared-memory
- model, message-passing for the distributed-memory model, and support for
- combining the two models) while significantly increasing ease and flexibility
- of use.
-
- The current release is version 0.2. New features added since version 0.1
- include:
-
- + xdr for communication in a heterogeneous network
-
- + asynchronous communication of large messages
-
- + global operations (broadcast, global sum, max, etc.)
-
- + both master-slave and SPMD models for message-passing programs
-
- + an improved and simplified Fortran interface
-
- + improved error handling
-
- + an optional secure server
-
- + ports to more machines
-
- p4 is intended to be portable, simple to install and use, and efficient. It
- can be used to program networks of workstations, advanced parallel
- supercomputers like the Intel Touchstone Delta and the Alliant Campus
- HiPPI-based system, and single shared-memory multiprocessors. It has
- currently been installed on the following list of machines: Sequent Symmetry,
- Encore Multimax, Alliant FX/8, FX/800, and FX/2800, Cray X/MP, Sun, NeXT, DEC,
- Silicon Graphics, and IBM RS6000 workstations, Stardent Titan, BBN GP-1000 and
- TC-2000, Intel IPSC/860, Intel Touchstone Delta, and Alliant Campus. It will
- soon be ported to the CM-5 and to the Intel Paragon. It is not difficult to
- port to new systems.
-
- Work in progress also includes use of monitors in Fortran and shared mem~~~~~~~~ory
- among multiple processes on multiprocessor workstations. A useful companion
- system is the upshot logging and X-based trace examination facility.
-
- You can obtain the complete distribution of p4 by anonymous ftp from
- info.mcs.anl.gov. Take the file p4.tar.Z from the directory pub/p4. The
- distribution contains all source code, installation instructions, a reference
- manual, and a collection of examples in both C and Fortran. The files
- alog.tar.Z and upshot.tar.Z contain logging and display facilities that can be
- used with both p4 and other systems.
-
- To ask questions about p4, report bugs, contribute examples, etc., send mail
- to p4@mcs.anl.gov.
-
- Rusty Lusk
- lusk@mcs.anl.gov
-