NetNews Usenet Archive 1992 #16

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #16 / NN_1992_16.iso / spool / comp / arch / 8274 < prev next >

Wrap

Internet Message Format | 1992-07-24 | 23.8 KB

Path: sparky!uunet!dtix!darwin.sura.net!wupost!uwm.edu!linac!att!ucbvax!x1sun6.ccl.itri.org.tw!lycmit From: lycmit@x1sun6.ccl.itri.org.tw (Yin-Chih Lin) Newsgroups: comp.arch Subject: Could I compete the low-end CM5 with multiple SPARCstations? (Summary) Message-ID: <9207240335.AA06549@x1sun6.ccl.itri.org.tw> Date: 24 Jul 92 17:35:32 GMT Sender: daemon@ucbvax.BERKELEY.EDU Lines: 583 Greetings: Pardon me for using the net bandwidth again. I have received many replies about the subject. So I suppose maybe it will be fine to summarize these responses and repost again. (I apologize if someone don't think so). Most of notes imply that it is *almost impossible* to challenge the CM5 with the Sparcstations without extra facilities (vector processors, high-speed net, good s/w). However, it might be *not true* in my case. I know another team in ITRI/CCL has developed the VP processors for SPARCstations (I haven't inquired the actual spec. and performace of this processor). I also belileve there are vendors provide the FDDI add-on cards for SBus. The only hurt is the *software*. It is true that my organization is impossible to sanctify such project ( especially when there is one IBM ES-9000/820 machine will be available from National High Speed Computer Center in my country. I don't know *why* they select IBM). But I also got some directions from the receiving messages. Some of them (PVM, p4) are positive encourages for me to look for what I like with the current facilities. Anyway, I would be glad to hear any comments about the subject in any time. Please feel free to email me if you have further directions. Thanks go to following people (according the order in my mail box, date to July 24, 12:00 - London time plus 8 hours) Peter Su psu@cs.duke.edu Dirk Grunwald grunwald@foobar.cs.colorado.edu Jon Mellott jon@alpha.ee.ufl.edu Jerry Callen jcallen@Think.COM K. J. Chang kjchang@hpl.hp.com Steve Roy ssr@ama.caltech.edu Scott Townsend fsset@bach.lerc.nasa.gov Rowan Hughes csrdh@marlin.jcu.edu.au Mark Homewood fred@meiko.com Matthias Schumann schumanm@informatik.tu-muenchen.de Mike Davis davis@ee.udel.edu Michael Kolatis kolatis@cs.utk.edu Dave dave@sep.stanford.edu Henrik Klagges henrik@tazdevil.llnl.gov (anonymous) vu0208@bingsuns.cc.binghamton.edu Bob Felderman feldy@isi.edu Ian L. Kaplan ian@MasPar.COM Bill Maniatty maniattb@cs.rpi.edu Keith Bierman Keith.Bierman@Eng.Sun.COM Paul Whiting Paul.G.Whiting@mel.dit.csiro.au Here are the original message and replies: ------------------------------ ( original ) -------------------------------- We have many Sun's SPARCstations (from 1, 1+, IPC, to 2, 670MP and SPARCstation 10 in the near future), and the new-version multi-thread OS (SunOS 5.0) will be installed on these computers. I find that the Think Machines Corp.'s latest MPP machine CM-5 is also adopted the SPARC processors (22 MIPS per processor) as its nodes. According the "Parallel Processing" Oct. 1991 issue, the price of 32-node CM5 model will sell at about $1.4 million. I am curious, if I connect 32 SPARCstation 10 model 30 (33MHz, SuperSPARC module with 86.1 MIPS - Sun Micro adverted!?) computers (loosely-coupled MIMD multicomputers?) either by Ethernet, ISDN or FDDI with the multi- thread OS and might use facilities from Linda-like distributed data structures. Is it possible for me to obtain the reasonable performance when compared with the 32-node TMC CM5 (SIMD machine)? I will be highly appreciated for receiving any comments. Please direct your opinion (or abuse) to me by email. so the news bandwidth won't be wasted (except this one!). Thanks, P.S. I would also like to know, is there any one config their SPARCstations to a message-passing MIMD multicomputers rather than just merely LAN- based networking workstations? ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Yin-Chih Lin Industrial Technology Research email: lycmit@x1sun6.ccl.itri.org.tw Institute (ITRI) phone: 886-35-917331 fax: 886-35-917503 Computer & Comm. Research Labs.(CCL) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ------------------------------ ( reply #1 ) -------------------------------- From: Peter Su <psu@cs.duke.edu> The answer depends on two things: 1) Would your code be able to take advantage of the CM-5 vector units? 2) Does your code need to do a lot of communication/synchronization. If you have code that breaks up into big, independent chunks which do not vectorize easily, and you have libraries to manage workstaion clusters, then I can't see why the CM-5 would be much better. Oh, the CM-5 probably has more usable memory per node, since data processing nodes don't run the full OS. Pete ------------------------------ ( reply #2 ) -------------------------------- From: Dirk Grunwald <grunwald@cs.colorado.edu> The CM-5 also has 200Mf vector processors and a very fast network. most importantly, it has good software. ------------------------------ ( reply #3 ) -------------------------------- From: Jon Mellott <jon@alpha.ee.ufl.edu> It depends a lot on the type of tasks that you are doing. If your tasks are characterized by lots of communication between processes/processors, then the CM-5 will probably eat you alive. Also, the CM-5 does have the option of a vector processor for each processor module, so you could potentially be chewed up there also. BTW, if your intent is to do something like this, then you might consider going to something like the SS10 model 54 (four processors). You'd be money ahead on a per processor basis since you wouldn't have to buy as many pizza boxes per processor. Also, if your code is amenable to multi-threading then you might be able to derive a big benefit from running on a multiprocessor machine, especially from an inter-thread communications standpoint. Lemme see, the SS10 model 54 is listing at something like $50K, so for a million bucks you could get twenty of them. That'd give you eighty processors. You could spend the savings on FDDI interfaces so you don't get butchered by ethernet latencies. BTW, this sort of thing has been described before in the literature. I don't recall where, but it has been done before. Good luck, Jon Mellott High Speed Digital Architecture Laboratory (jon@alpha.ee.ufl.edu) ------------------------------ ( reply #4 ) -------------------------------- From: Jerry Callen <jcallen@Think.COM> [Claimer: I work for Thinking Machines.] There is a lot more to a CM-5 than SPARCs and memory. Specifically, there are two high-bandwidth, low-latency networks. The "control network" can perform a number of combining operations (max, add, or, xor), broadcasting and synchronization functions. The "data network" allows point-to-point communications between multiple nodes SIMULTANEOUSLY at very high bandwidths (>5MB/sec/node). This gives a 32 node machine a (very conservatively rated) aggregate bandwidth of 160MB/sec. You just cannot get this kind of bandwidth out of Ethernet or even FDDI. Also, the network interface is mapped into user memory space, so you don't have to go through operating system layers to get at it. There are also 4 vector units on each node; each vector unit has a dedicated bank of memory, yielding very high memory bandwidth. Each vector unit can do 32MFLOPS (peak, but I have seen codes get >75% of this). Also, a point of clarification: the CM-5 is NOT a "SIMD machine." While the networks provide excellent support for a data-parallel programming model, TMC supports a message-passing programming model as well. Call TMC at (617) 234-1000 and ask for a CM-5 Technical Summary if you would like more information. -- Jerry Callen jcallen@world.std.com (preferred) jcallen@think.com (OK, too) {uunet,harvard}!think!jcallen (if you must) ------------------------------ ( reply #5 ) -------------------------------- From: K. J. Chang <kjchang@hplabsz.hpl.hp.com> Of course! That's why there are at least 20 new RISC-based workstations being designed here at Silicon Valley. Mainframe or Supercomputer vendors do not release their SPCEfp92 or SPECint92 because their products are not cost-effective. You may want to read Chap. 2 of "Computer Architecture A Quantitative Approach" (John Hennessy and David Patterson), especially Figure 2.25 which tells you that those expensive computers' MFLOPS are in the range of 8.3 to 24.4 only. -- K J Chang, Hewlett-Packard ICBD R & D, (())_-_(()) Palo Alto, CA 94304 | (* *) | Internet: kjchang@hpl.hp.com a UCLA Bruin --> { \_@_/ } voice mail: (415)857-4604 `-----' ------------------------------ ( reply #6 ) -------------------------------- From: Steve Roy <ssr@ama.caltech.edu> There are three main differences that insure that a network of workstations will not compete with the CM5. The first difference is a hardware one: The Sparc in the CM5 is really just a controller/scheduler for custom vector/memory processors. Each sparc controls 4 of these processors. These vector boards are much much faster than the sparc for number crunching, the main point of this sort of machine. The fact that it has Sparcs in it is sort of irrelevant for number crunching considerations. The second difference is the speed of the communications: the CM5 has a very fast and sophisticated network that allows arbitrary permutation communication to proceed at not less than 5MByte/sec per processor. This is the high bandwidth network; there is also a lower bandwidth but also lower latency network that is used primarily for fast syncronizations between processors. There is a third network for diagnostics. The networks are redundant so the machine can continue to work given any single point failure, an important consideration for a massively parallel system. The third difference is the integrated and powerful software that allows easy parallel coding and debugging. The debugger is X-based and allows you to click on (for example) an array name and display its contents either as numbers or as a contour plot. The first compiler is an F90 variant, there will later be ports of C* and presumably *Lisp. It was a good question tho. I hope you post a summary to the net. Steve Roy.-- ---------------------------------------------------------------------- Steve Roy | Life. Don't talk to me about life. ssr@ama.caltech.edu | | ------------------------------ ( reply #7) -------------------------------- From: Scott Townsend <fsset@bach.lerc.nasa.gov> While the individual processors will be comparable to those in the CM5, as a group your clustered processors will approximate the CM5 only if your parallel application does almost zero communications. We've run some experiments here with a cluster of RS6000's. On applications which require very little communication relative to the computation rate, things run very well. As soon as you get into more typical parallel applications, things slow down REAL fast! The main reason for this is latency. Local networks are not designed for very low latency. Unless the interconnect in a parallel processor has _very_ low latency compared to CPU speed, you spend all your time waiting for messages instead of computing. -- Scott Townsend, Sverdrup Technology Inc. NASA Lewis Research Center Group fsset@bach.lerc.nasa.gov ------------------------------ ( reply #8) -------------------------------- From: Rowan Hughes <csrdh@marlin.jcu.edu.au> You'd need to connect them together with at least FDDI. The bandwidth between nodes is the biggest problem with MPP machines. 1000Mb/s would be better. Also, you probably won't be able to use more nodes than 16, since the search/latency overheads become prohibitive on a single commnication rail, ie you've got no secondary search engine. At the moment the CM-5 is just a bunch of Sparc stations in one box, the vector cpu's haven't been fabricated yet (and probably never will be). The $1.4M is really just for a bit of software. -- Rowan Hughes James Cook University Marine Modelling Unit Townsville, Australia. Dept. Civil and Systems Engineering csrdh@marlin.jcu.edu.au ------------------------------ ( reply #9) -------------------------------- From: Mark Homewood <fred@meiko.co.uk> Besides designing FPU's for SPARC Meiko manufactuers machines based around something like what you have described. We buy commodity boards or workatation components and produce (hopefully) cost effective parallel supercomputers. We do find it necessary to add some extra communication bandwidth and reduced latency. Clearly a 32node TMC is out to lunch, we do a similar size machine for a fraction of the price. The major difference is that ours can be used as a network of Sun as well, and at the same time. There is some public domain software which lets you use your machines as a parallel machine. This is called PVM (Parallel Virtual Machine), try an anonymous ftp to <research.att.com> in dir <netlib/pvm>. There is a Readme and index. You can compile and configure for your network. Though as one of your repliers states, latency is the thing. This is what Meiko (and TMC, though over-priced) are selling. Fred Meiko Limited fred@meiko.com ------------------------------ ( reply #10) ------------------------------- From: Matthias Schumann <schumanm@informatik.tu-muenchen.de> Hi, here at the Technical University Munich an Environtment for parallel Programming called MMK was developed, that runs on Intel iPSC 2 / 860 MIMD-Supercomputers. There is antother version called MMKx, which offers the same functions (task creation, deletion / communication / semaphores, synchronization / .....) on a Network of Sun-Workstations connected over an Ethernet. The question about the CM5 is application depended, i would say. Code that is written especially for SIMD machines will run faster on the CM. If you got general purpose code the perormance of the sun network will increase but if it could reach the CMs performance i got no idea, even though i got no information about the inter Processor communication network in the CM5. Ciao Matt -- | Matthias Schumann Department of Computer Science/SAB TU Muenchen | | schumanm@informatik.tu-muenchen.de Arcisstr.21 | | Phone: +49 89 2105 2036 8000 Muenchen 2 | ------------------------------ ( reply #11) ------------------------------- From: Mike Davis <davis@ee.udel.edu> I would be interested in any responses you get on this subject. I'am involved in a project looking into running distributed simulations on a group of FDDI connected SPARCstation10's running Linda. Since the closest thing we have to compare to is an older 16 processor Sequent, I can't help with your CM5 questions.. thanks mike davis@ee.udel.edu ------------------------------ ( reply #12) ------------------------------- From: Michael Kolatis <kolatis@cs.utk.edu> Let's see, from what I know so far, one node of a TMC CM-5 will have a SPARC chip that performs about 5 Mflops. Thus, with 32 nodes, you would expect a peak of 160 Mflops. Now, a SPARCstation 10 with 4 processors is supposed to perform at 19.7 Mflops per processor (i.e. almost 80 Mflops). Thus, 32 * 80 = 2560 Mflops = 2.56 Gflops. However, each node of a CM-5 will eventually include 4 vector processors (TMC calls them "floating point accelerators") which will increase the node power to 128 Mflops per node (which makes a 32 node CM-5 capable of 4 Gflops). Thus, theoretically, with the communication speed losses involved with the SPARC 10's & the lack of a parallel programming language for them the SPARC's would lose (also, note that if a SPARC 10 does have 4 processors, your connection would have 128 processors--not 32). However, hooking the SPARC 10's together using PVM(Parallel Virtual Machine) as parallel software would be very interesting... and certainly makes the efforts toward massively parallel computing very viable. Michael Kolatis UTenn Joint Institute for Computational Science ------------------------------ ( reply #13) ------------------------------- From: Dave <dave@sep.stanford.edu> Unfortunately not, each CM-5 node also has 4 custom vector units running at 120MFlop peak each. That, and the high bandwidth network are what you are really paying for with a CM-5. ------------------------------ ( reply #14) ------------------------------- From: Henrik Klagges <henrik@tazdevil.llnl.gov> I think it is a good idea. At LLNL we have a cluster of RS6000, connected via a very fast network, with good results. -- Cheers, Henrik MPCI at LLNL IBM Research ------------------------------ ( reply #15) ------------------------------- From: (anonymous) <vu0208@bingsuns.cc.binghamton.edu> If you want to do what you have said in your post, you might create a loosely-coupled loosely-performed-MIMD!! Why? becaz the CM-5 has 2 extra things that you wont have: (1) A fast communication network rather than yours FDDI/Ethernet/ISDN etc.. (2) Their node is haveing a dedicated a 22-Mips RISC microprocessor with four vector pipes providing a total of 128 MFlops peak speed!! (3) They are capable of playing dual role of the machine at the same time SIM and MIMD. So I guess by just connecting 32 Sparcstations you wont hit the 129 MFlops performance, however it may be at least 1/4 of 16-node CM-5. Let's see whta you get after you are done with your peoposed project. Keep me posted, i will appreciate. Thanks ------------------------------ ( reply #16) ------------------------------- From: Bob Felderman <feldy@isi.edu> I suspect not. If you could, then TMC would quickly go out of business. My guess is that their interconnection network is MUCH better than any of your choices. For one thing, your choices were mostly shared-medium channels. If your taks requires significant communication, you'll get killed by the Ethernet or FDDI etc. If your taks are mostly independent, you'll probably do just fine. Bob Felderman USC/Information Sciences Institute 4676 Admiralty Way Marina del Rey, CA 90292-6695 (310) 822-1511 x222 (310) 823-6714 (fax) feldy@isi.edu ------------------------------ ( reply #17) ------------------------------- From: Ian L. Kaplan <ian@MasPar.COM> Whether you can get comparable performance out of an network of 32 Spark Stations that can be obtained form a 32 processor CM-5 depends on the application. If you have an application with little or no communication (some Monte Carlo simulations fall into this catagory) then the answer is yes. The more interprocessor communication the more the CM-5 will win. They have put a lot of effort into designing their communication network and it is much faster than anything you are likely to find connecting workstations. Another issue is software overhead in handling communication. I don't know what TMC has done in this area (we are one of their competitors, so they don't tell us these things), but I imagine that they have worked to reduce this overhead. This is not as important in SMP Solaris, so it is probably not as tightly tuned. Of course what you should really do is buy a MasPar machine, which will give you more processing power per dollar than either TMC or Sun can provide. Ian Kaplan ian@maspar.com ------------------------------ ( reply #18) ------------------------------- From: Bill Maniatty <maniattb@cs.rpi.edu> I'd be skeptical of getting the same level of performance. I think the expense of the networked connections (latency in particular), would force you to use a larger granularity of task decomposition. I haven't played with a CM-5 so I would not really know for sure, but that is my gut reaction. You might also get a different resource contention pattern because of the different Operating System/Compiler technology. I don't think they are really equivalent, each is appropriate for a different sort of task. Bill -- | | maniattb@cs.rpi.edu - in real life Bill Maniatty | ------------------------------ ( reply #19) ------------------------------- From: Keith Bierman <Keith.Bierman@Eng.Sun.COM> the CM-5 has much faster interprocessor communication (fast custom hw, special sw) you won't come close to their performance for most codes. Many people use networks of workstations for distributed processing. They use Express, Strand, PVM and a bevy of research efforts. Wander through comp.parallel, or a local university library (if they have a good collection of conference papers and technical reports). ----------------------------- ( reply #20) ------------------------------- From: Paul Whiting <pgw@mel.dit.csiro.au> G'day, yes this is possible to do using the macros from Argonne National Lab in the US. I will forward you another piece of mail about them and how to obtain them. regards, --------------------------------------------------------------------- Paul Whiting | CSIRO - DIT Computer Scientist | 723 Swanston Street High Performance Computing Program | Carlton 3053, Australia --------------------------------------------------------------------- E-mail: pgw@mel.dit.csiro.au Phone: + 61 3 282 2666 There's no problem a good surf can't fix. --------------------------------------------------------------------- (****************************** p4 info **********************************) p4 p4 is a library of macros and subroutines developed at Argonne National Laboratory for programming a variety of parallel machines. Its predecessor was the m4-based "Argonne macros" system described in the Holt, Rinehart, and Winston book "Portable Programs for Parallel Processors, by Lusk, Overbeek, et al., from which p4 takes its name. The current p4 system maintains the same basic computational models described there (monitors for the shared-memory model, message-passing for the distributed-memory model, and support for combining the two models) while significantly increasing ease and flexibility of use. The current release is version 0.2. New features added since version 0.1 include: + xdr for communication in a heterogeneous network + asynchronous communication of large messages + global operations (broadcast, global sum, max, etc.) + both master-slave and SPMD models for message-passing programs + an improved and simplified Fortran interface + improved error handling + an optional secure server + ports to more machines p4 is intended to be portable, simple to install and use, and efficient. It can be used to program networks of workstations, advanced parallel supercomputers like the Intel Touchstone Delta and the Alliant Campus HiPPI-based system, and single shared-memory multiprocessors. It has currently been installed on the following list of machines: Sequent Symmetry, Encore Multimax, Alliant FX/8, FX/800, and FX/2800, Cray X/MP, Sun, NeXT, DEC, Silicon Graphics, and IBM RS6000 workstations, Stardent Titan, BBN GP-1000 and TC-2000, Intel IPSC/860, Intel Touchstone Delta, and Alliant Campus. It will soon be ported to the CM-5 and to the Intel Paragon. It is not difficult to port to new systems. Work in progress also includes use of monitors in Fortran and shared mem~~~~~~~~ory among multiple processes on multiprocessor workstations. A useful companion system is the upshot logging and X-based trace examination facility. You can obtain the complete distribution of p4 by anonymous ftp from info.mcs.anl.gov. Take the file p4.tar.Z from the directory pub/p4. The distribution contains all source code, installation instructions, a reference manual, and a collection of examples in both C and Fortran. The files alog.tar.Z and upshot.tar.Z contain logging and display facilities that can be used with both p4 and other systems. To ask questions about p4, report bugs, contribute examples, etc., send mail to p4@mcs.anl.gov. Rusty Lusk lusk@mcs.anl.gov