NetNews Usenet Archive 1992 #30

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #30 / NN_1992_30.iso / spool / comp / parallel / 2741 < prev next >

Wrap

Text File | 1992-12-15 | 10.9 KB | 242 lines

Newsgroups: comp.parallel Path: sparky!uunet!gatech!hubcap!fpst From: gottlieb@allan.ultra.nyu.edu (Allan Gottlieb) Subject: Re: Kiendal Square machine with >32 nodes In-Reply-To: lenos@tardis's message of Sun, 13 Dec 1992 00:37:15 GMT Message-ID: <1992Dec15.134517.7504@hubcap.clemson.edu> Sender: fpst@hubcap.clemson.edu (Steve Stevenson) Nntp-Posting-Host: allan.ultra.nyu.edu Organization: New York University, Ultracomputer project References: <lenos.724207035@tardis.union.edu> Date: 14 Dec 92 12:13:04 Approved: parallel@hubcap.clemson.edu Lines: 227 In article <lenos.724207035@tardis.union.edu> lenos@tardis (Scott Leno) writes: I have heard that Cornell and NCSC have KSR machines with 128 and 64 nodes each. What I was wondering was how they do when they have to reference nodes on another ring of 32. I have seen info on how a program does on a single ring of 32 nodes, but nothing about how it does when it is spread over more than one ring of nodes (ie >32 procs). Thanks for any info you might have. peace, Scott Here is the KSR part of a paper I presented at PACTA'92 in Barcelona this sept. .\" New Century Schoolbook fonts .fp 1 NR \" normal .fp 2 NI \" italic .fp 3 NB \" bold .sz 11 .nr pp 11 .nr ps 1v .\" They want double space before paragraph .nr sp 12 .nr fp 10 .pl 26c .m1 1c .m2 0 .m3 0 .m4 0 .ll 14c .tp .(l C .sz +2 .b "Architectures for Parallel Supercomputing .sz -2 .sp .5c Allan Gottlieb .sp 1.5c Ultracomputer Research Laboratory New York University 715 Broadway, Tenth Floor New York NY 10003 USA .)l .sp 1c .sh 1 Introduction .lp In this talk, I will describe the architectures of new commercial offerings from Kendall Square Research, Thinking Machines Incorporated, Intel Corporation, and the MasPar Computer Corporation. These products span much of the currently active design space for parallel supercomputers, including shared-memory and message-passing, MIMD and SIMD, and processor sizes from a square millimeter to hundreds of square centimeters. However, there is at least one commercially important class omitted: the parallel vector supercomputers, whose death at the hands of the highly parallel invaders has been greatly exaggerated (shades of Mark Twain). Another premature death notice may have been given to FORTRAN since all these machines speak (or rather understand) this language\*-but that is another talk. .sh 1 "New Commercial Offerings" .lp I will describe the architectures of four new commercial offerings: The shared-memory MIMD KSR1 from Kendall Square Research; two message-passing MIMD computers, the Connection Machine CM-5 from Thinking Machines Corporation and the Paragon XP/S from Intel Corporation; and the SIMD MP-1 from the MasPar Computer Corporation. Much of this section is adapted from material prepared for the forthcoming second edition of .i "Highly Parallel Computing" , a book I co-author with George Almasi from IBM's T.J. Watson Research Center. .sh 2 "The Kendall Square Research KSR1" .lp The KSR1 is a shared-memory MIMD computer with private, consistent caches, that is, each processor has its own cache and the system hardware guarantees that the multiple caches are kept in agreement. In this regard the design is similar to the MIT Alewife [ACDJ91] and the Stanford Dash [LLSJ92]. There are, however, three significant differences between the KSR1 and the two University designs. First, the Kendall Square machine is a large-scale, commercial effort: the current design supports 1088 processors and can be extended to tens of thousands. Second, the KSR1 features an ALLCACHE memory, which we explain below. Finally, the KSR1, like the Illinois Cedar [GKLS84], is a hierarchical design: a small machine is a ring or .q "Selection Engine" of up to 32 processors (called an SE:0); to achieve 1088 processors, an SE:1 ring of 34 SE:0 rings is assembled. Larger machines would use yet higher level rings. More information on the KSR1 can be found in [Roth92]. .sh 3 Hardware .lp A 32-processor configuration (i.e. a full SE:0 ring) with 1 gigabyte of memory and 10 gigabytes of disk requires 6 kilowatts of power and 2 square meters of floor space. This configuration has a peak computational performance of 1.28 GFLOPS and a peak I/O bandwidth of 420 megabytes/sec. In a March 1992 posting to the comp.parallel electronic newsgroup, Tom Dunigan reported that a 32-processor KSR1 at the Oak Ridge National Laboratory attained 513 MFLOPS on the 1000\(mu1000 LINPACK benchmark. A full SE:1 ring with 1088 processors equipped with 34.8 gigabytes of memory and 1 terabyte of disk would require 150 kilowatts and 74 square meters. Such a system would have a peak floating point performance of 43.5 GFLOPS and a peak I/O bandwidth of 15.3 gigabytes/sec. .pp Each KSR1 processor is a superscalar 64-bit unit able to issue up to two instructions every 50ns., giving a peak performance rating of 40 MIPS. (KSR is more conservative and rates the processor as 20 MIPS since only one of the two instructions issued can be computational but I feel that both instructions should be counted. If there is any virtue in peak MIPS ratings, and I am not sure there is, it is that the ratings are calculated the same way for all architectures.) Since a single floating point instruction can perform a multiply and an add, the peak floating point performance is 40 MFLOPS. At present, a KSR1 system contains from eight to 1088 processors (giving a system-wide peak of 43,520 MIPS and 43,520 MFLOPS) all sharing a common virtual address space of one million megabytes. .pp The processor is implemented as a four chip set consisting of a control unit and three co-processors, with all chips fabricated in 1.2 micron CMOS. Up to two instructions are issued on each clock cycle. The floating point co-processor supports IEEE single and double precision and includes linked triads similar to the multiply and add instructions found in the Intel Paragon. The integer/logical co-processor contains its own set of thirty-two 64-bit registers and performs the the usual arithmetic and logical operations. The final co-processor provides a 32-MB/sec I/O channel at each processor. Each processor board also contains a 256KB data cache and a 256KB instruction cache. These caches are conventional in organization though large in size, and should not be confused with the ALLCACHE (main) memory discussed below. .sh 3 "ALLCACHE Memory and the Ring of Rings" .lp Normally, caches are viewed as small temporary storage vehicles for data, whose permanent copy resides in central memory. The KSR1 is more complicated in this respect. It does have, at each processor, standard instruction and data caches, as mentioned above. However, these are just the first-level caches. .i Instead of having main memory to back up these first-level caches, the KSR1 has second-level caches, which are then backed up by .i disks . That is, there is no central memory; all machine resident data and instructions are contained in one or more caches, which is why KSR uses the term ALLCACHE memory. The data (as opposed to control) portion of the second-level caches are implemented using the same DRAM technology normally found in central memory. Thus, although they function as caches, these structures have the capacity and performance of main memory. .pp When a (local, second-level) cache miss occurs on processor A, the address is sent around the SE:0 ring. If the requested address resides in B, another one of the processor/local-cache pairs on the same SE:0 ring, B forwards the cache line (a 128-byte unit, called a subpage by KSR) to A again using the (unidirectional) SE:0 ring. Depending on the access performed, B may keep a copy of the subpage (thus sharing it with A) or may cause all existing copies to be invalidated (thus giving A exclusive access to the subpage). When the response arrives at A, it is stored in the local cache, possibly evicting previously stored data. (If this is the only copy of the old data, special actions are taken not to evict it.) Measurements at Oak Ridge indicate a 6.7 microsecond latency for their (32-processor) SE:0 ring. .pp If the requested address resides in processor/local-cache C, which is located on .i another SE:0 ring, the situation is more interesting. Each SE:0 includes an ARD (ALLCACHE routing and directory cell), containing a large directory with an entry for every subpage stored on the entire SE:0.\** .(f \**Actually an entry for every page giving the state of every subpage. .)f If the ARD determines that the subpage is not contained in the current ring, the request is sent .q up the hierarchy to the (unidirectional) SE:1 ring, which is composed solely of ARDs, each essentially a copy of the ARD .q below it. When the request reaches the SE:1 ARD above the SE:0 ring containing C, the request is sent down and traverses the ring to C, where it is satisfied. The response from C continues on the SE:0 ring to the ARD, goes back up, then around the SE:1 ring, down to the SE:0 ring containing A, and finally around this ring to A. .pp Another difference between the KSR1 caches and the more conventional variety is size. These are BIG caches, 32MB per processor. Recall that they replace the conventional main memory and hence are implemented using dense DRAM technology. .pp The SE:0 bandwidth is 1 GB/sec. and the SE:1 bandwidth can be configured to be 1, 2, or 4 GB/sec., with larger values more appropriate for systems with many SE:0s (cf. the fat-trees used in the CM5). Readers interested in a performance comparison between ALLCACHE and more conventional memory organizations should read [SJG92]. Another architecture using the ALLCACHE design is the Data Diffusion Machine from the Swedish Institute of Computer Science [HHW90]. .sh 4 Software .lp The KSR operating system is an extension of the OSF/1 version of Unix. As is often the case with shared-memory systems, the KSR operating system runs on the KSR1 itself and not on an additional .q host system. The later approach is normally used on message passing systems like the CM-5, in which case only a subset of the OS functions run directly on the main system. Using the terminology of [AG89] the KSR operating system is symmetric; whereas the CM-5 uses a master-slave approach. Processor allocation is performed dynamically by the KSR operating system, i.e. the number of processors assigned to a specific job varies with time. .pp A fairly rich software environment is supplied including the X window system with the Motif user interface; FORTRAN, C, and COBOL; the ORACLE relational database management system; and AT&T's Tuxedo for transaction processing. .pp A FORTRAN programmer may request automatic parallelization of his/her program or may specify the parallelism explicitly; a C programmer has only the latter option. .sh 2 "The TMC Connection Machine CM-5" .lp [Omitted to save space--question was about KSR] .sh 2 "The Intel Paragon XP/S" .lp [Omitted to save space--question was about KSR] .sh 2 "The MasPar MP-1" .lp [Omitted to save space--question was about KSR]