NetNews Usenet Archive 1992 #19

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #19 / NN_1992_19.iso / spool / comp / parallel / 2033 < prev next >

Wrap

Text File | 1992-09-02 | 8.8 KB | 240 lines

Newsgroups: comp.parallel Path: sparky!uunet!gatech!hubcap!fpst From: swamy@cs.ubc.ca (H.V. Sreekantaswamy) Subject: SUMMARY: Communication startup costs Message-ID: <1992Sep3.123312.14637@hubcap.clemson.edu> Apparently-To: comp-parallel@uunet.uu.net Followup-To: comp.parallel Sender: usenet@cs.ubc.ca (Usenet News) Organization: Computer Science, University of B.C., Vancouver, B.C., Canada Date: Wed, 2 Sep 92 23:27:37 GMT Approved: parallel@hubcap.clemson.edu Lines: 226 Here are the replies I got for my request about communication startup costs on various distributed memory machines. I have only included the replies that had any information or pointers. Thanks to everyone who replied. --Swamy ------------------------------------------------------------------------ H.V. Sreekantaswamy E-mail: swamy@cs.ubc.ca Dept. of Computer Science Tel: (604)822-3731 University of British Columbia (604)224-4431 (Home) Vancouver, B.C., Canada V6T 1Z2 ------------------------------------------------------------------------ Original Request: ---------------- I am interested in knowing typical values for communication startup costs on various distributed memory machines such as CM-5, nCUBE, Intel IPSC, Delta, Paragon, etc. What I mean by communication startup cost is the time it takes to begin the actual message transmission from the time the user program makes the appropriate function call. This should include all the operating system or runtime system overheads in initiating the communication in addition to hardware setup time. I would appreciate if anyone can point to any papers/reports that discuss about these overheads or give the values for any system. If there is sufficient interest, I will post a summary of replies. Thanks, --Swamy ******************************************* From: <rsb@mcc.com> See Gordon Bell's article in the August CACM. He provides some of the communication cost you seem to want. ------------------- From: <yoshio@CS.UCLA.EDU> You could look at "low-latency message communication support for the ap1000" in this year's ACM computer architecture conference. ---------------------- From: Roland Ruehl <ruehl@iis.ethz.ch> >Organization: IIS, Swiss Fed. Inst. of Technology Suppose the delay T_del of a communication from cell to cell of a message of size M is decomposed into latency and bandwidth contributions: T_del = T_lat + M / B and T_c is the time to compute locally a DAXPY iteration. The following table is an extract from my thesis (all times are in musec and B in MBytes/sec): Machine | T_c | T_lat | B | Reference ----------------+---------------+---------------+---------------+----------- nCube/2 | 2.56 | 154 | 1.7 | (1 + 2) iPSC/i860 | 0.31 | M<=100: 75 | 2.5 | (1 + 2) | | M >100: 136 | | Delta | 0.31 | 12.5 | 2.1 | (1 + 2) CM5 (scalar) | 0.5 | >= 3 | 5 ... 20 | (3) (vector) | 0.18 | >= 3 | 5 ... 20 | AP1000 | 1.00 | 20 | 8.5 | (4) References: 1) LINPACK performance measured by Dongarra 2) Tutorial by Dongarra at Supercomputing 91 3) CM5 Performance profile 4) my thesis I would be interested in getting the numbers for other new machines (like the Paragon or Parsytec GC) if you can get them. --------------------------------------------------------------------------- Roland Ruehl Phone: +41 1 256 51 46 Integrated Systems Laboratory Telex: 53 178 ethbi ch ETH-Zuerich Telefax: +41 1 252 09 94 Switzerland E-mail: ruehl@iis.ethz.ch ------------------------- From: <jan@ceres.neuroinformatik.ruhr-uni-bochum.de> An example you haven't mentioned are transputers. The current generation takes ~20 cycles for the appropriate instruction, add a few cycles for loading operands - say 25 cycles at 25 MHz = one microsecond. Data transfer is done by a dedicated DMA machine, which also reschedules the process at the end of a transfer. As transputers use synchronous communication, a message can be any length. The new generation, due out at the end of the year, will have automatic demultiplexing/multiplexing for the physical links and a worm-hole routing chip; IN now takes 16 cycles - lets say 20 cycles at 50 MHz = 400 nanoseconds. -- Jan Vorbrueggen, jan@neuroinformatik.ruhr-uni-bochum.de ------------------------ From: <deb@k2.dartmouth.edu> >Organization: Dartmouth College, Hanover, NH A paper by Marco Annaratone which I think came out sometime this year in IEEE TPDS has some latency numbers. The machine they report about are Ncube, Intel ipsc/860 and the Paragon. If you don't get hold of it, send me e-mail after 3 weeks. I am in the process of moving right now. --Deb Banerjee -------------------------------------------- From: <duncan@meiko.co.uk> Swamy Your definition of communications startup time is poor. It allows someone to quote a figure which is just the time taken to put data into a FIFO. It doesn't require any processing at the destination. The only reason for communicating is to have a second process cooperate with the first so you should be measuring the time taken to have something occur on a remote processor. The standard way of doing this is the so called "ping-pong" benchmark in which process 1 sends a message to process 2 who replies. The message latency is half the roundtrip time as measured on 1. -- Best Wishes Duncan Roweth ------------------------------------------------------------- | Meiko Limited Phone : +44 454 616171 | | 650 Aztec West FAX : +44 454 618188 | | Bristol BS12 4SD E-Mail: duncan@meiko.co.uk | | England | ------------------------------------------------------------- ----------------------------- From: <jc@prosun.first.gmd.de> Hello Swamy, a colleague has done examinations on the message startup time for the PEACE operating system on the SUPRENUM distributed memory machine. The paper is called: "Overcoming the Startup Time Problem in Distributed Memory Architectures" by W. Schroeder-Preikschat from Gmd First in Berlin, Germany. You can obtain it from "Proceedings of the 24th Hawaii International Conference on System Sciences", Vol. 1, pp. 551-559, 1991. Additional information on PEACE or SUPRENUM can be obtained by anonymous ftp at ftp.gmd.de in the directory gmd/peace Cheers, Joerg -- Joerg Cordsen | INTERNET: jc@first.gmd.de c/o GMD-First Berlin | KOMEX: cordsen@kmx.gmd.dbp.de O-1199 Berlin-Adlershof | TEL: +49 30 6704 5159 Rudower Chausee 5 (13.7) | FAX: +49 30 6704 5088 --------------------------- From: <Mark.Monger@eng.sun.com> I'm interested in what you find out. I've been using numbers of 2ms for workstation RPC and a very *round* number of 100us for MPP message overhead. The 100us could be very wrong. ------------------------- From: John Salmon <johns@haggis.ccsf.caltech.edu> Rik Littlefield from PNL (rik@ccsf.caltech.edu) has done a lot of work unraveling the mysteries of communication timing on the Delta. The situation is far from simple. I know he has given a couple of talks on the subject. I recommend contacting him to see if he's got anything written up. By the way, you're asking the right question. Vendors will try to tell you some hardware number that is only distantly related to the "real" cost to go from user-code to user-code. John ------------------------------------------------------------------------ From: <alan@msc.edu> To: <swamy@cs.ubc.ca> >Organization: Minnesota Supercomputer Center, Inc. Injecting a message into the CM-5 data network takes just over one microsecond. On a relatively uncongested communication pattern it will reach the other end in another microsecond or so. Note 'startup time' cannot be directly measured, only indirectly, by comparing the time of an n unit message with an n+1 unit message. This function is not linear for all n, however. <This should include all the operating system or runtime system overheads <in initiating the communication in addition to hardware setup time. TMC did the Right Thing and kept the operating system totally out the picture. The communication FIFOs are directly mapped into user memory space so there is no "system" overhead for communication. (I wish they went further mapped the FIFOs into the CPU registers themselves, a la the J machine.) I'm glad you posted your request. I think too many people emphasize bandwidth and forget that low latency is equally important. -- Alan E. Klietz Minnesota Supercomputer Center, Inc. 1200 Washington Avenue South Minneapolis, MN 55415 Ph: +1 612 626 1737 Internet: alan@msc.edu