home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.parallel
- Path: sparky!uunet!gatech!hubcap!fpst
- From: gottlieb@allan.ultra.nyu.edu (Allan Gottlieb)
- Subject: Re: Kiendal Square machine with >32 nodes
- In-Reply-To: lenos@tardis's message of Sun, 13 Dec 1992 00:37:15 GMT
- Message-ID: <1992Dec15.134517.7504@hubcap.clemson.edu>
- Sender: fpst@hubcap.clemson.edu (Steve Stevenson)
- Nntp-Posting-Host: allan.ultra.nyu.edu
- Organization: New York University, Ultracomputer project
- References: <lenos.724207035@tardis.union.edu>
- Date: 14 Dec 92 12:13:04
- Approved: parallel@hubcap.clemson.edu
- Lines: 227
-
- In article <lenos.724207035@tardis.union.edu> lenos@tardis (Scott Leno) writes:
-
- I have heard that Cornell and NCSC have KSR machines with 128 and 64 nodes
- each. What I was wondering was how they do when they have to reference nodes
- on another ring of 32. I have seen info on how a program does on a single
- ring of 32 nodes, but nothing about how it does when it is spread over more
- than one ring of nodes (ie >32 procs). Thanks for any info you might have.
- peace,
- Scott
-
- Here is the KSR part of a paper I presented at PACTA'92 in Barcelona
- this sept.
-
- .\" New Century Schoolbook fonts
- .fp 1 NR \" normal
- .fp 2 NI \" italic
- .fp 3 NB \" bold
- .sz 11
- .nr pp 11
- .nr ps 1v .\" They want double space before paragraph
- .nr sp 12
- .nr fp 10
- .pl 26c
- .m1 1c
- .m2 0
- .m3 0
- .m4 0
- .ll 14c
- .tp
- .(l C
- .sz +2
- .b "Architectures for Parallel Supercomputing
- .sz -2
- .sp .5c
- Allan Gottlieb
- .sp 1.5c
- Ultracomputer Research Laboratory
- New York University
- 715 Broadway, Tenth Floor
- New York NY 10003 USA
- .)l
- .sp 1c
- .sh 1 Introduction
- .lp
- In this talk, I will describe the architectures of new commercial
- offerings from Kendall Square Research, Thinking Machines
- Incorporated, Intel Corporation, and the MasPar Computer Corporation.
- These products span much of the currently active design space for
- parallel supercomputers, including shared-memory and message-passing,
- MIMD and SIMD, and processor sizes from a square millimeter to
- hundreds of square centimeters. However, there is at least one
- commercially important class omitted: the parallel vector
- supercomputers, whose death at the hands of the highly parallel
- invaders has been greatly exaggerated (shades of Mark Twain). Another
- premature death notice may have been given to FORTRAN since all these
- machines speak (or rather understand) this language\*-but that is
- another talk.
- .sh 1 "New Commercial Offerings"
- .lp
- I will describe the architectures of four new commercial offerings:
- The shared-memory MIMD KSR1 from Kendall Square Research; two
- message-passing MIMD computers, the Connection Machine CM-5 from
- Thinking Machines Corporation and the Paragon XP/S from Intel
- Corporation; and the SIMD MP-1 from the MasPar Computer Corporation.
- Much of this section is adapted from material prepared for the
- forthcoming second edition of
- .i "Highly Parallel Computing" ,
- a book I co-author with George Almasi from IBM's T.J. Watson Research
- Center.
- .sh 2 "The Kendall Square Research KSR1"
- .lp
- The KSR1 is a shared-memory MIMD computer with private, consistent
- caches, that is, each processor has its own cache and the system
- hardware guarantees that the multiple caches are kept in agreement.
- In this regard the design is similar to the MIT Alewife [ACDJ91] and the
- Stanford Dash [LLSJ92]. There are, however, three significant differences
- between the KSR1 and the two University designs. First, the Kendall
- Square machine is a large-scale, commercial effort: the current design
- supports 1088 processors and can be extended to tens of thousands.
- Second, the KSR1 features an ALLCACHE memory, which we explain below.
- Finally, the KSR1, like the Illinois Cedar [GKLS84], is a hierarchical
- design: a small machine is a ring or
- .q "Selection Engine"
- of up to 32 processors (called an SE:0); to achieve
- 1088 processors, an SE:1 ring of 34 SE:0 rings is assembled. Larger
- machines would use yet higher level rings. More information on the
- KSR1 can be found in [Roth92].
- .sh 3 Hardware
- .lp
- A 32-processor configuration (i.e. a full SE:0 ring) with 1 gigabyte
- of memory and 10 gigabytes of disk requires 6 kilowatts of power and 2
- square meters of floor space. This configuration has a peak
- computational performance of 1.28 GFLOPS and a peak I/O bandwidth of
- 420 megabytes/sec. In a March 1992 posting to the comp.parallel
- electronic newsgroup, Tom Dunigan reported that a 32-processor KSR1 at
- the Oak Ridge National Laboratory attained 513 MFLOPS on the
- 1000\(mu1000 LINPACK benchmark. A full SE:1 ring with 1088 processors
- equipped with 34.8 gigabytes of memory and 1 terabyte of disk would
- require 150 kilowatts and 74 square meters. Such a system would have
- a peak floating point performance of 43.5 GFLOPS and a peak I/O
- bandwidth of 15.3 gigabytes/sec.
- .pp
- Each KSR1 processor is a superscalar 64-bit unit able to issue up to
- two instructions every 50ns., giving a peak performance rating of 40
- MIPS. (KSR is more conservative and rates the processor as 20 MIPS
- since only one of the two instructions issued can be computational but
- I feel that both instructions should be counted. If there is any
- virtue in peak MIPS ratings, and I am not sure there is, it is that
- the ratings are calculated the same way for all architectures.) Since
- a single floating point instruction can perform a multiply and an add,
- the peak floating point performance is 40 MFLOPS. At present, a KSR1
- system contains from eight to 1088 processors (giving a system-wide
- peak of 43,520 MIPS and 43,520 MFLOPS) all sharing a common virtual
- address space of one million megabytes.
- .pp
- The processor is implemented as a four chip set consisting of a
- control unit and three co-processors, with all chips fabricated in 1.2
- micron CMOS. Up to two instructions are issued on each clock cycle.
- The floating point co-processor supports IEEE single and double
- precision and includes linked triads similar to the multiply and add
- instructions found in the Intel Paragon. The integer/logical
- co-processor contains its own set of thirty-two 64-bit registers and
- performs the the usual arithmetic and logical operations. The final
- co-processor provides a 32-MB/sec I/O channel at each processor. Each
- processor board also contains a 256KB data cache and a 256KB
- instruction cache. These caches are conventional in organization
- though large in size, and should not be confused with the ALLCACHE
- (main) memory discussed below.
- .sh 3 "ALLCACHE Memory and the Ring of Rings"
- .lp
- Normally, caches are viewed as small temporary storage vehicles for
- data, whose permanent copy resides in central memory. The KSR1 is
- more complicated in this respect. It does have, at each processor,
- standard instruction and data caches, as mentioned above. However,
- these are just the first-level caches.
- .i Instead
- of having main memory to back up these first-level caches, the KSR1
- has second-level caches, which are then backed up by
- .i disks .
- That is,
- there is no central memory; all machine resident data and instructions
- are contained in one or more caches, which is why KSR uses the term
- ALLCACHE memory. The data (as opposed to control) portion of the
- second-level caches are implemented using the same DRAM technology
- normally found in central memory. Thus, although they function as
- caches, these structures have the capacity and performance of main memory.
- .pp
- When a (local, second-level) cache miss occurs on processor A,
- the address is sent around the SE:0 ring. If the requested address
- resides in B, another one of the processor/local-cache pairs on the same
- SE:0 ring, B
- forwards the cache line (a 128-byte unit, called a subpage by KSR) to A
- again using the (unidirectional) SE:0 ring. Depending on the access
- performed, B may keep a copy of the subpage (thus sharing it with A) or
- may cause all existing copies to be invalidated (thus giving A
- exclusive access to the subpage). When the response arrives at A, it
- is stored in the local cache, possibly evicting previously stored
- data. (If this is the only copy of the old data, special actions are
- taken not to evict it.) Measurements at Oak Ridge indicate a 6.7 microsecond
- latency for their (32-processor) SE:0 ring.
- .pp
- If the requested address resides in processor/local-cache C, which is
- located on
- .i another
- SE:0 ring, the situation is more interesting. Each SE:0 includes an
- ARD (ALLCACHE routing and directory cell), containing a large
- directory with an entry for every subpage stored on the entire
- SE:0.\**
- .(f
- \**Actually an entry for every page giving the state of every subpage.
- .)f
- If the ARD determines that the subpage is not contained in the current
- ring, the request is sent
- .q up
- the hierarchy to the (unidirectional) SE:1 ring,
- which is composed solely of ARDs, each essentially a copy of the ARD
- .q below
- it. When the request reaches the SE:1 ARD above the SE:0 ring
- containing C, the request is sent down and traverses the ring to C, where
- it is satisfied. The response from C continues on the SE:0 ring to
- the ARD, goes back up, then around the SE:1 ring, down to the SE:0
- ring containing A, and finally around this ring to A.
- .pp
- Another difference between the KSR1 caches and the more conventional
- variety is size. These are BIG caches, 32MB per processor. Recall
- that they replace the conventional main memory and hence are
- implemented using dense DRAM technology.
- .pp
- The SE:0 bandwidth is 1 GB/sec. and the SE:1 bandwidth can be
- configured to be 1, 2, or 4 GB/sec., with larger values more
- appropriate for systems with many SE:0s (cf. the fat-trees used in the
- CM5). Readers interested in a performance comparison between ALLCACHE
- and more conventional memory organizations should read [SJG92].
- Another architecture using the ALLCACHE design is the Data Diffusion
- Machine from the Swedish Institute of Computer Science [HHW90].
- .sh 4 Software
- .lp
- The KSR operating system is an extension of the OSF/1 version of Unix.
- As is often the case with shared-memory systems, the KSR operating
- system runs on the KSR1 itself and not on an additional
- .q host
- system. The later approach is normally used on message passing
- systems like the CM-5, in which case only a subset of the OS functions
- run directly on the main system. Using the terminology of [AG89] the
- KSR operating system is symmetric; whereas the CM-5 uses a
- master-slave approach. Processor allocation is performed dynamically
- by the KSR operating system, i.e. the number of processors assigned to
- a specific job varies with time.
- .pp
- A fairly rich software environment is supplied including the X window
- system with the Motif user interface; FORTRAN, C, and COBOL; the
- ORACLE relational database management system; and AT&T's Tuxedo for
- transaction processing.
- .pp
- A FORTRAN programmer may request automatic parallelization of his/her
- program or may specify the parallelism explicitly; a C programmer has
- only the latter option.
- .sh 2 "The TMC Connection Machine CM-5"
- .lp
- [Omitted to save space--question was about KSR]
- .sh 2 "The Intel Paragon XP/S"
- .lp
- [Omitted to save space--question was about KSR]
- .sh 2 "The MasPar MP-1"
- .lp
- [Omitted to save space--question was about KSR]
-
-