home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.parallel
- Path: sparky!uunet!gatech!hubcap!fpst
- From: swamy@cs.ubc.ca (H.V. Sreekantaswamy)
- Subject: SUMMARY: Communication startup costs
- Message-ID: <1992Sep3.123312.14637@hubcap.clemson.edu>
- Apparently-To: comp-parallel@uunet.uu.net
- Followup-To: comp.parallel
- Sender: usenet@cs.ubc.ca (Usenet News)
- Organization: Computer Science, University of B.C., Vancouver, B.C., Canada
- Date: Wed, 2 Sep 92 23:27:37 GMT
- Approved: parallel@hubcap.clemson.edu
- Lines: 226
-
- Here are the replies I got for my request about communication startup
- costs on various distributed memory machines. I have only included
- the replies that had any information or pointers.
-
- Thanks to everyone who replied.
- --Swamy
-
- ------------------------------------------------------------------------
- H.V. Sreekantaswamy E-mail: swamy@cs.ubc.ca
- Dept. of Computer Science Tel: (604)822-3731
- University of British Columbia (604)224-4431 (Home)
- Vancouver, B.C., Canada V6T 1Z2
- ------------------------------------------------------------------------
-
- Original Request:
- ----------------
- I am interested in knowing typical values for communication startup
- costs on various distributed memory machines such as CM-5, nCUBE,
- Intel IPSC, Delta, Paragon, etc. What I mean by communication startup
- cost is the time it takes to begin the actual message transmission
- from the time the user program makes the appropriate function call.
- This should include all the operating system or runtime system overheads
- in initiating the communication in addition to hardware setup time.
- I would appreciate if anyone can point to any papers/reports that
- discuss about these overheads or give the values for any system.
-
- If there is sufficient interest, I will post a summary of replies.
-
- Thanks,
- --Swamy
- *******************************************
-
- From: <rsb@mcc.com>
-
-
- See Gordon Bell's article in the August CACM. He provides some of the
- communication cost you seem to want.
-
- -------------------
-
- From: <yoshio@CS.UCLA.EDU>
-
- You could look at "low-latency message communication support
- for the ap1000" in this year's ACM computer architecture conference.
-
- ----------------------
- From: Roland Ruehl <ruehl@iis.ethz.ch>
- >Organization: IIS, Swiss Fed. Inst. of Technology
-
-
- Suppose the delay T_del of a communication from cell to cell of a message
- of size M is decomposed into latency and bandwidth contributions:
-
- T_del = T_lat + M / B
-
- and T_c is the time to compute locally a DAXPY iteration.
-
- The following table is an extract from my thesis (all times are in
- musec and B in MBytes/sec):
-
- Machine | T_c | T_lat | B | Reference
- ----------------+---------------+---------------+---------------+-----------
- nCube/2 | 2.56 | 154 | 1.7 | (1 + 2)
- iPSC/i860 | 0.31 | M<=100: 75 | 2.5 | (1 + 2)
- | | M >100: 136 | |
- Delta | 0.31 | 12.5 | 2.1 | (1 + 2)
- CM5 (scalar) | 0.5 | >= 3 | 5 ... 20 | (3)
- (vector) | 0.18 | >= 3 | 5 ... 20 |
- AP1000 | 1.00 | 20 | 8.5 | (4)
-
- References:
-
- 1) LINPACK performance measured by Dongarra
- 2) Tutorial by Dongarra at Supercomputing 91
- 3) CM5 Performance profile
- 4) my thesis
-
- I would be interested in getting the numbers for other new machines
- (like the Paragon or Parsytec GC) if you can get them.
-
- ---------------------------------------------------------------------------
-
- Roland Ruehl Phone: +41 1 256 51 46
- Integrated Systems Laboratory Telex: 53 178 ethbi ch
- ETH-Zuerich Telefax: +41 1 252 09 94
- Switzerland E-mail: ruehl@iis.ethz.ch
-
- -------------------------
-
- From: <jan@ceres.neuroinformatik.ruhr-uni-bochum.de>
-
- An example you haven't mentioned are transputers. The current generation
- takes ~20 cycles for the appropriate instruction, add a few cycles for
- loading operands - say 25 cycles at 25 MHz = one microsecond. Data transfer
- is done by a dedicated DMA machine, which also reschedules the process at
- the end of a transfer. As transputers use synchronous communication, a
- message can be any length. The new generation, due out at the end of the year,
- will have automatic demultiplexing/multiplexing for the physical links and
- a worm-hole routing chip; IN now takes 16 cycles - lets say 20 cycles at
- 50 MHz = 400 nanoseconds.
-
- -- Jan Vorbrueggen, jan@neuroinformatik.ruhr-uni-bochum.de
- ------------------------
- From: <deb@k2.dartmouth.edu>
- >Organization: Dartmouth College, Hanover, NH
-
-
- A paper by Marco Annaratone which I think came out sometime this year
- in IEEE TPDS has some latency numbers. The machine they report about
- are Ncube, Intel ipsc/860 and the Paragon.
-
- If you don't get hold of it, send me e-mail after 3 weeks. I am in the
- process of moving right now.
-
-
- --Deb Banerjee
-
- --------------------------------------------
- From: <duncan@meiko.co.uk>
-
- Swamy
-
- Your definition of communications startup time is poor. It allows someone
- to quote a figure which is just the time taken to put data into a FIFO. It
- doesn't require any processing at the destination. The only reason for
- communicating is to have a second process cooperate with the first so you
- should be measuring the time taken to have something occur on a
- remote processor.
-
- The standard way of doing this is the so called "ping-pong" benchmark in which
- process 1 sends a message to process 2 who replies. The message latency is half
- the roundtrip time as measured on 1.
-
-
- --
-
- Best Wishes
- Duncan Roweth
-
- -------------------------------------------------------------
- | Meiko Limited Phone : +44 454 616171 |
- | 650 Aztec West FAX : +44 454 618188 |
- | Bristol BS12 4SD E-Mail: duncan@meiko.co.uk |
- | England |
- -------------------------------------------------------------
- -----------------------------
- From: <jc@prosun.first.gmd.de>
-
- Hello Swamy,
-
- a colleague has done examinations on the message
- startup time for the PEACE operating system on
- the SUPRENUM distributed memory machine.
- The paper is called:
- "Overcoming the Startup Time Problem in
- Distributed Memory Architectures"
- by
- W. Schroeder-Preikschat from Gmd First in
- Berlin, Germany.
-
- You can obtain it from "Proceedings of the 24th
- Hawaii International Conference on System Sciences",
- Vol. 1, pp. 551-559, 1991. Additional information on
- PEACE or SUPRENUM can be obtained by anonymous ftp at
- ftp.gmd.de in the directory gmd/peace
-
- Cheers, Joerg
- --
- Joerg Cordsen | INTERNET: jc@first.gmd.de
- c/o GMD-First Berlin | KOMEX: cordsen@kmx.gmd.dbp.de
- O-1199 Berlin-Adlershof | TEL: +49 30 6704 5159
- Rudower Chausee 5 (13.7) | FAX: +49 30 6704 5088
- ---------------------------
- From: <Mark.Monger@eng.sun.com>
-
- I'm interested in what you find out. I've been using numbers of 2ms for
- workstation RPC and a very *round* number of 100us for MPP message
- overhead. The 100us could be very wrong.
- -------------------------
- From: John Salmon <johns@haggis.ccsf.caltech.edu>
-
- Rik Littlefield from PNL (rik@ccsf.caltech.edu) has done a lot of work
- unraveling the mysteries of communication timing on the Delta. The
- situation is far from simple. I know he has given a couple of talks
- on the subject. I recommend contacting him to see if he's got anything
- written up.
-
- By the way, you're asking the right question. Vendors will try to tell
- you some hardware number that is only distantly related to the "real" cost
- to go from user-code to user-code.
-
- John
- ------------------------------------------------------------------------
- From: <alan@msc.edu>
- To: <swamy@cs.ubc.ca>
- >Organization: Minnesota Supercomputer Center, Inc.
-
-
- Injecting a message into the CM-5 data network takes just over one
- microsecond. On a relatively uncongested communication pattern
- it will reach the other end in another microsecond or so.
-
- Note 'startup time' cannot be directly measured, only indirectly, by
- comparing the time of an n unit message with an n+1 unit message. This
- function is not linear for all n, however.
-
- <This should include all the operating system or runtime system overheads
- <in initiating the communication in addition to hardware setup time.
-
- TMC did the Right Thing and kept the operating system totally out the
- picture. The communication FIFOs are directly mapped into user memory
- space so there is no "system" overhead for communication. (I wish
- they went further mapped the FIFOs into the CPU registers themselves,
- a la the J machine.)
-
- I'm glad you posted your request. I think too many people emphasize
- bandwidth and forget that low latency is equally important.
-
- --
- Alan E. Klietz
- Minnesota Supercomputer Center, Inc.
- 1200 Washington Avenue South
- Minneapolis, MN 55415
- Ph: +1 612 626 1737 Internet: alan@msc.edu
-
-
-