home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.parallel
- Path: sparky!uunet!usc!cs.utexas.edu!sdd.hp.com!ncr-sd!ncrcae!hubcap!fpst
- From: gottlieb@allan.ultra.nyu.edu (Allan Gottlieb)
- Subject: Info on some new parallel machines
- Message-ID: <1992Dec18.175429.28010@hubcap.clemson.edu>
- Sender: fpst@hubcap.clemson.edu (Steve Stevenson)
- Nntp-Posting-Host: allan.ultra.nyu.edu
- Organization: New York University, Ultracomputer project
- Date: 18 Dec 92 12:31:15
- Approved: parallel@hubcap.clemson.edu
- Lines: 627
-
- A week or two ago, in response to a request for information on ksr,
- I posted the ksr section of a paper I presented at PACTA'92 in
- Barcelona in sept. I received a bunch of requests for a posting of
- the entire paper, which I "did". Unfortunately, it seems to have
- disappeared somewhere between here and Clemson so I am trying again.
- I doubt if anyone will get this twice but if so, please let me know
- and accept my appologies.
-
- Allan Gottlieb
-
- .\" Format via
- .\" troff -me filename
- .\" New Century Schoolbook fonts
- .\" Delete next three lines if you don't have the font
- .fp 1 NR \" normal
- .fp 2 NI \" italic
- .fp 3 NB \" bold
- .sz 11
- .nr pp 11
- .nr ps 1v .\" They want double space before paragraph
- .nr sp 12
- .nr fp 10
- .pl 26c
- .m1 1c
- .m2 0
- .m3 0
- .m4 0
- .ll 14c
- .tp
- .(l C
- .sz +2
- .b "Architectures for Parallel Supercomputing
- .sz -2
- .sp .5c
- Allan Gottlieb
- .sp 1.5c
- Ultracomputer Research Laboratory
- New York University
- 715 Broadway, Tenth Floor
- New York NY 10003 USA
- .)l
- .sp 1c
- .sh 1 Introduction
- .lp
- In this talk, I will describe the architectures of new commercial
- offerings from Kendall Square Research, Thinking Machines
- Incorporated, Intel Corporation, and the MasPar Computer Corporation.
- These products span much of the currently active design space for
- parallel supercomputers, including shared-memory and message-passing,
- MIMD and SIMD, and processor sizes from a square millimeter to
- hundreds of square centimeters. However, there is at least one
- commercially important class omitted: the parallel vector
- supercomputers, whose death at the hands of the highly parallel
- invaders has been greatly exaggerated (shades of Mark Twain). Another
- premature death notice may have been given to FORTRAN since all these
- machines speak (or rather understand) this language\*-but that is
- another talk.
- .sh 1 "New Commercial Offerings"
- .lp
- I will describe the architectures of four new commercial offerings:
- The shared-memory MIMD KSR1 from Kendall Square Research; two
- message-passing MIMD computers, the Connection Machine CM-5 from
- Thinking Machines Corporation and the Paragon XP/S from Intel
- Corporation; and the SIMD MP-1 from the MasPar Computer Corporation.
- Much of this section is adapted from material prepared for the
- forthcoming second edition of
- .i "Highly Parallel Computing" ,
- a book I co-author with George Almasi from IBM's T.J. Watson Research
- Center.
- .sh 2 "The Kendall Square Research KSR1"
- .lp
- The KSR1 is a shared-memory MIMD computer with private, consistent
- caches, that is, each processor has its own cache and the system
- hardware guarantees that the multiple caches are kept in agreement.
- In this regard the design is similar to the MIT Alewife [ACDJ91] and the
- Stanford Dash [LLSJ92]. There are, however, three significant differences
- between the KSR1 and the two University designs. First, the Kendall
- Square machine is a large-scale, commercial effort: the current design
- supports 1088 processors and can be extended to tens of thousands.
- Second, the KSR1 features an ALLCACHE memory, which we explain below.
- Finally, the KSR1, like the Illinois Cedar [GKLS84], is a hierarchical
- design: a small machine is a ring or
- .q "Selection Engine"
- of up to 32 processors (called an SE:0); to achieve
- 1088 processors, an SE:1 ring of 34 SE:0 rings is assembled. Larger
- machines would use yet higher level rings. More information on the
- KSR1 can be found in [Roth92].
- .sh 3 Hardware
- .lp
- A 32-processor configuration (i.e. a full SE:0 ring) with 1 gigabyte
- of memory and 10 gigabytes of disk requires 6 kilowatts of power and 2
- square meters of floor space. This configuration has a peak
- computational performance of 1.28 GFLOPS and a peak I/O bandwidth of
- 420 megabytes/sec. In a March 1992 posting to the comp.parallel
- electronic newsgroup, Tom Dunigan reported that a 32-processor KSR1 at
- the Oak Ridge National Laboratory attained 513 MFLOPS on the
- 1000\(mu1000 LINPACK benchmark. A full SE:1 ring with 1088 processors
- equipped with 34.8 gigabytes of memory and 1 terabyte of disk would
- require 150 kilowatts and 74 square meters. Such a system would have
- a peak floating point performance of 43.5 GFLOPS and a peak I/O
- bandwidth of 15.3 gigabytes/sec.
- .pp
- Each KSR1 processor is a superscalar 64-bit unit able to issue up to
- two instructions every 50ns., giving a peak performance rating of 40
- MIPS. (KSR is more conservative and rates the processor as 20 MIPS
- since only one of the two instructions issued can be computational but
- I feel that both instructions should be counted. If there is any
- virtue in peak MIPS ratings, and I am not sure there is, it is that
- the ratings are calculated the same way for all architectures.) Since
- a single floating point instruction can perform a multiply and an add,
- the peak floating point performance is 40 MFLOPS. At present, a KSR1
- system contains from eight to 1088 processors (giving a system-wide
- peak of 43,520 MIPS and 43,520 MFLOPS) all sharing a common virtual
- address space of one million megabytes.
- .pp
- The processor is implemented as a four chip set consisting of a
- control unit and three co-processors, with all chips fabricated in 1.2
- micron CMOS. Up to two instructions are issued on each clock cycle.
- The floating point co-processor supports IEEE single and double
- precision and includes linked triads similar to the multiply and add
- instructions found in the Intel Paragon. The integer/logical
- co-processor contains its own set of thirty-two 64-bit registers and
- performs the the usual arithmetic and logical operations. The final
- co-processor provides a 32-MB/sec I/O channel at each processor. Each
- processor board also contains a 256KB data cache and a 256KB
- instruction cache. These caches are conventional in organization
- though large in size, and should not be confused with the ALLCACHE
- (main) memory discussed below.
- .sh 3 "ALLCACHE Memory and the Ring of Rings"
- .lp
- Normally, caches are viewed as small temporary storage vehicles for
- data, whose permanent copy resides in central memory. The KSR1 is
- more complicated in this respect. It does have, at each processor,
- standard instruction and data caches, as mentioned above. However,
- these are just the first-level caches.
- .i Instead
- of having main memory to back up these first-level caches, the KSR1
- has second-level caches, which are then backed up by
- .i disks .
- That is,
- there is no central memory; all machine resident data and instructions
- are contained in one or more caches, which is why KSR uses the term
- ALLCACHE memory. The data (as opposed to control) portion of the
- second-level caches are implemented using the same DRAM technology
- normally found in central memory. Thus, although they function as
- caches, these structures have the capacity and performance of main memory.
- .pp
- When a (local, second-level) cache miss occurs on processor A,
- the address is sent around the SE:0 ring. If the requested address
- resides in B, another one of the processor/local-cache pairs on the same
- SE:0 ring, B
- forwards the cache line (a 128-byte unit, called a subpage by KSR) to A
- again using the (unidirectional) SE:0 ring. Depending on the access
- performed, B may keep a copy of the subpage (thus sharing it with A) or
- may cause all existing copies to be invalidated (thus giving A
- exclusive access to the subpage). When the response arrives at A, it
- is stored in the local cache, possibly evicting previously stored
- data. (If this is the only copy of the old data, special actions are
- taken not to evict it.) Measurements at Oak Ridge indicate a 6.7 microsecond
- latency for their (32-processor) SE:0 ring.
- .pp
- If the requested address resides in processor/local-cache C, which is
- located on
- .i another
- SE:0 ring, the situation is more interesting. Each SE:0 includes an
- ARD (ALLCACHE routing and directory cell), containing a large
- directory with an entry for every subpage stored on the entire
- SE:0.\**
- .(f
- \**Actually an entry for every page giving the state of every subpage.
- .)f
- If the ARD determines that the subpage is not contained in the current
- ring, the request is sent
- .q up
- the hierarchy to the (unidirectional) SE:1 ring,
- which is composed solely of ARDs, each essentially a copy of the ARD
- .q below
- it. When the request reaches the SE:1 ARD above the SE:0 ring
- containing C, the request is sent down and traverses the ring to C, where
- it is satisfied. The response from C continues on the SE:0 ring to
- the ARD, goes back up, then around the SE:1 ring, down to the SE:0
- ring containing A, and finally around this ring to A.
- .pp
- Another difference between the KSR1 caches and the more conventional
- variety is size. These are BIG caches, 32MB per processor. Recall
- that they replace the conventional main memory and hence are
- implemented using dense DRAM technology.
- .pp
- The SE:0 bandwidth is 1 GB/sec. and the SE:1 bandwidth can be
- configured to be 1, 2, or 4 GB/sec., with larger values more
- appropriate for systems with many SE:0s (cf. the fat-trees used in the
- CM5). Readers interested in a performance comparison between ALLCACHE
- and more conventional memory organizations should read [SJG92].
- Another architecture using the ALLCACHE design is the Data Diffusion
- Machine from the Swedish Institute of Computer Science [HHW90].
- .sh 4 Software
- .lp
- The KSR operating system is an extension of the OSF/1 version of Unix.
- As is often the case with shared-memory systems, the KSR operating
- system runs on the KSR1 itself and not on an additional
- .q host
- system. The later approach is normally used on message passing
- systems like the CM-5, in which case only a subset of the OS functions
- run directly on the main system. Using the terminology of [AG89] the
- KSR operating system is symmetric; whereas the CM-5 uses a
- master-slave approach. Processor allocation is performed dynamically
- by the KSR operating system, i.e. the number of processors assigned to
- a specific job varies with time.
- .pp
- A fairly rich software environment is supplied including the X window
- system with the Motif user interface; FORTRAN, C, and COBOL; the
- ORACLE relational database management system; and AT&T's Tuxedo for
- transaction processing.
- .pp
- A FORTRAN programmer may request automatic parallelization of his/her
- program or may specify the parallelism explicitly; a C programmer has
- only the latter option.
- .sh 2 "The TMC Connection Machine CM-5"
- .lp
- Thinking Machines Corporation has become well known for their SIMD
- connection machines CM-1 and CM-2. Somewhat
- surprisingly their next offering CM-5 has moved into the MIMD world
- (although, as we shall see, there is still hardware support for a
- synchronous style of programming). Readers seeking additional
- information should consult [TMC91].
- .sh 3 Architecture
- .lp
- At the very coarsest level of detail, the CM-5 is simply a
- message-passing MIMD machine, another descendent of the Caltech cosmic
- cube [Seit85]. But such a description leaves out a great deal. The
- interconnection topology is a fat tree, there is support for SIMD, a
- combining control network is provided, vector units are available, and
- the machine is powerful. We discuss each of these in turn.
- .pp
- A fat tree is a binary tree in which links higher in the tree have
- greater bandwidth (e.g. one can keep the clock constant and use wider
- busses near the root). Unlike hypercube machines such as CM-1 and
- CM-2, a node in the CM-5 has a constant number of nearest neighbors
- independent of the size of the machine. In addition, the bandwidth
- available per processor for random communication patterns remains
- constant as the machine size increases; whereas this bandwidth
- decreases for meshes (or non-fat trees). Local communication is
- favored by the CM-5 but by only a factor of 4 over random
- communication (20MB/sec vs. 5MB/sec), which is much less than in other
- machines such as CM-2. Also attached to this fat tree are I/O
- interfaces. The device side of these interfaces can support 20MB/sec;
- higher speed devices are accommodated by ganging together multiple
- interfaces. (If the destination node for the I/O is far from the
- interface, the sustainable bandwidth is also limited by the fat
- tree to 5MB/sec.)
- .pp
- The fat tree just discussed is actually one of three networks on the
- CM-5. In addition to this
- .q "data network" ,
- there is a diagnostic network used for fault detection and a control
- network that we turn to next.
- One function of the control network is to provide rapid
- synchronization of the processors, which is accomplished by by a
- global OR operation that completes shortly after the last
- participating processor sets its value. This
- .q "cheap barrier"
- permits the main advantage of SIMD (permanent synchrony implying no
- race conditions) without requiring that the processors always execute
- the same instruction.
- .pp
- A second function of the control network is to provide a form of
- hardware combining, specifically to support reduction and parallel
- prefix calculations. A parallel prefix computation for a given binary
- operator \(*f (say addition) begins with each processor specifying a
- value and ends with each processor obtaining the sum of the values
- provided by itself and all lower-numbered processors. These parallel
- prefix computations may be viewed as the synchronous, and hence
- deterministic, analogue of the fetch-and-phi operation found in the
- NYU Ultracomputer [GGKM83]. The CM-5 supports addition, maximum,
- logical OR, and XOR. Two variants are also supplied: a parallel
- suffix and a segmented parallel prefix (and suffix). With a segmented
- operation (think of worms, not virtual memory, and see [SCHW80]), each
- processor can set a flag indicating that it begins a segment and the
- prefix computation is done separately for each segment. Reduction
- operations are similar: each processor supplies a value and all
- processors obtain the sum of all values (again max, OR, and XOR are
- supported as well).
- .pp
- Each node of a CM-5 contains a SPARC microprocessor for scalar
- operations (users are advised against coding in assembler, a hint that
- the engine may change), a 64KB cache, and up to 32 MB of local memory.
- Memory is accessed 64 bits at a time (plus 8 bits for ECC). An option
- available with the CM-5 is the incorporation of 4 vector units in
- between each processor and its associated memory. When the vector
- units are installed, memory is organized as four 8 MB banks, one
- connected to each unit. Each vector unit can perform both
- floating-point and integer operations, either one at a peak rate of 32
- mega 64-bit operations per second.
- .pp
- As mentioned above, the CM-5 is quite a powerful computer. With the
- vector units present, each node has a peak performance of 128 64-bit
- MFLOPS or 128 64-bit integer MOPS. The machine is designed for a
- maximum of 256K nodes but the current implementation is
- .q "limited"
- to 16K due to restrictions on cable lengths. Since the peak
- computational rate for a 16K node system exceeds 2 Teraflops one might
- assert that the age of (peak)
- .q "teraflop computing"
- has arrived. However, as I write this in May 1992, the largest
- announced delivery of a CM-5 is a 1K node configuration without vector
- units. A full 16K system would cost about one-half Billion U.S.
- dollars.
- .sh 3 "Software and Environment"
- .lp
- In addition to the possibly thousands of computation nodes just
- described, a CM-5 contains a few control processors that act as hosts
- into which users login. The reason for multiple control processors is
- that the system administrator can divide the CM-5 into partitions,
- each with an individual control processor as host. The host provides
- a conventional
- .sm UNIX -like
- operating system; in particular users can timeshare a single
- partition. Each computation node runs an operating system microkernel
- supporting a subset of the full functionality available on the control
- processor acting as its host (a master-slave approach, see [AG89].
- .pp
- Parallel versions of Fortran, C, and Lisp are provided. CM Fortran is
- a mild extension of Fortran 90. Additional features include a
- \f(CWforall\fP statement and vector-valued subscripts. For an example
- of the latter assume that \f(CWA\fP and \f(CWP\fP are vectors of size
- 20 with all \f(CWP(I)\fP between 1 and 20, then \f(CWA=A(P)\fP does
- the 20 parallel assignments \f(CWA(I)=A(P(I))\fP.
- .pp
- An important contribution is the CM Scientific Software Library a
- growing set of numerical routines hand tailored to exploit the CM-5
- hardware. Although primarily intended for the CM Fortran user, the
- library is also usable from TMC's versions of C and Lisp, C* and
- *Lisp. To date the library developers have concentrated on linear
- algebra, FFTs, random number generators, and statistical analyses.
- .pp
- In addition to supporting the data parallel model of computing
- typified by Fortran 90, the CM-5 also supports synchronous (i.e.
- blocking) message passing in which the sender does not proceed until
- its message is received. (This is the rendezvous model used in
- Ada and CSP.) Limited support for asynchronous message passing is
- provided and further support is expected.
- .sh 2 "The Intel Paragon XP/S"
- .lp
- The Intel Paragon XP/S Supercomputer [Inte91] is powered by a
- collection of up to 4096 Intel i860 XP processors and can be
- configured to provide peak performance ranging from 5 to 300 GFLOPS
- (64-bit). The processing nodes are connected in a rectangular mesh
- pattern, unlike the hypercube connection pattern used in the earlier
- Intel iPSC/860.
- .pp
- The i860 XP node processor chip (2.5 million transistors)
- has a peak performance of 75 MFLOPS (64-bit)
- and 42 MIPS when operating at 50 MHz.
- The chip contains 16KByte data and instruction caches,
- and can issue a multiply and add instruction in one cycle
- [DS90].
- The maximum bandwidth from cache to floating point unit is
- 800 MBytes/s.
- Communication bandwidth
- between any two nodes is 200 MByte/sec
- full duplex. Each node also has 16-128 MBytes of memory and
- a second i860 XP processor devoted to
- communication.
- .pp
- The prototype for the Paragon, the Touchstone Delta, was installed at
- Caltech\** in 1991
- .(f
- \**^The machine is owned by the Concurrent Supercomputing Consortium,
- an alliance of universities, laboratories, federal agencies, and
- industry.
- .)f
- and immediately began to compete with the CM2 Connection Machine for
- the title of
- .q "world's fastest supercomputer" .
- The lead changed
- hands several times.\**
- .(f
- \**\^One point of reference is the 16 GFLOPS reported at the
- Supercomputing '91 conference for seismic modeling on the CM2
- [MS91].
- .)f
- .pp
- The Delta system consists of 576 nodes arranged in a mesh that has 16
- rows and 36 columns. Thirty-three of the columns form a computational
- array of 528 numeric nodes (computing nodes) that each contain an
- Intel i860 microprocessor and 16 MBytes of memory. This computational
- array is flanked on each side by a column of I/O nodes that each
- contain a 1.4 GByte disk (the number of disks is to be doubled later).
- The last column contains two HIPPI interfaces (100 Mbyte/sec each) and
- an assortment of tape, ethernet, and service nodes. Routing chips are
- used to provide internode communication with an internode speed of 25
- MByte/sec and a latency of 80 microseconds. The peak performance of
- the i860 processor is 60 MFLOPS (64-bit), which translates to a peak
- performance for the Delta of over 30 GFLOPS (64-bit).
- Achievable speeds in the range 1-15 GFLOPS have been claimed.
- Total memory is 8.4 GBytes, on-line disk capacity is 45 GBytes, to be
- increased to 90 GBytes.
- .pp
- The operating system being developed for the Delta consists of OSF/1
- with extensions for massively parallel systems. The extensions
- include a decomposition of OSF/1 into a pure Mach kernel (OSF/1 is
- based on Mach), and a modular server framework that can be used to
- provide distributed file, network, and process management service.
- .pp
- The system software for interprocess communication is compatible with
- that of the iPSC/860. The Express environment is also available.
- Language support includes Fortran and C.
- The Consortium intends to allocate 80% of the Delta's time for
- .q "Grand Challenge"
- problems (q.v.).
- .sh 2 "The MasPar MP-1"
- .lp
- Given the success of the CM1 and CM2, it is not surprising to see another
- manufacturer produce a machine in the same architectural class (SIMD, tiny
- processor). What perhaps
- .i "is"
- surprising is that Thinking Machines, with the new CM-5, has moved to an
- MIMD design. The MasPar Computer
- Corporation's MP-1 system, introduced in 1990, features an SIMD array of up
- to 16K 4-bit processors organized as a 2-dimensional array with each
- processor connected to its 8 nearest neighbors (i.e., the NEWS of CM1 plus
- the four diagonals). MasPar refers to this interconnection topology as the
- X-Net. The MP-1 also contains an array control unit that fetches and
- decodes instructions, computes addresses and other scalars, and sends
- control signals to the processor array.
- .pp
- An MP-1 system of maximum size has a peak speed of 26 GIPS (32-bit
- operations) or 550 MFLOPS (double precision) and dissipates about a
- kilowatt (not including I/O). The maximum memory size is 1GB and the
- maximum bandwidth to memory is 12 GB/sec. When the X-Net is used, the
- maximum aggregate inter-PE communication bandwidth is 23GB/sec. In
- addition, a three-stage global routing network is provided, utilizing
- custom routing chips and achieving up to 1.3 GB/sec aggregate bandwidth.
- This same network is also connected to a 256 MB I/O RAM buffer that is in
- turn connected to a frame buffer and various I/O devices.
- .pp
- Although the processor is internally a 4-bit device (e.g. the datapaths are
- 4-bits wide), it contains 40 programmer-visible, 32-bit registers and
- supports integer operands of 1, 8, 16, 32, or 64 bits. In addition, the
- same hardware performs 32- and 64-bit floating point operations. This last
- characteristic is reminiscent of the CM1 design, but not the CM2 with its
- separate Weiteks. Indeed a 16K MP-1 does perform 16K floating point adds
- as fast as it performs one; whereas a 64K CM2 performs only 2K floating
- point adds concurrently (one per Weitek). The tradeoff is naturally in
- single processor floating point speed. The larger, and hence less
- numerous, Weiteks produce several MFLOPS each; the MP-1 processors achieve
- only a few dozen KFLOPS (which surpasses the older CM1 processors).
- .pp
- MasPar is able to package 32 of these 4-bit processors on a single chip,
- illustrating the improved technology now available (two-level metal, 1.6
- micron CMOS with 450,000 transistors) compared to the circa 1985 technology
- used in CM1, which contained only 16 1-bit processors per chip. Each
- 14"x19" processor board contains 1024 processors, clocked at 80ns, and
- 16 MB of ECC memory, the latter organized as 16KB per processor and
- implemented using page mode 1Mb DRAMs.
- .pp
- A DECstation 5000 is used as a host and manages program execution, user
- interface, and network communications for an MP-1 system. The languages
- supported include data parallel versions of FORTRAN and C as well as the
- MasPar Parallel Application Language (MPL) that permits direct program
- control of the hardware. Ultrix, DEC's version of UNIX, runs on the host
- and provides a standard user interface. DEC markets the MP-1 as the DECmpp
- 12000.
- .pp
- Further information on the MP-1 can be found in [Chri90], [Nick90],
- [Blan90], and [Masp91]. An unconventional assessment of virtual
- processors, as used for example in CM2, appears in [Chri91].
- .uh References
- .(b I F
- .ll 14c
- .ti 0
- [ACDJ91]
- Anant Agarwal, David Chaiken, Godfrey D'Souza, Kirk Johnson, David
- Kranz, John Kubiatowicz, Kiyoshi Kurihara, Beng-Hong Lim, Gino Maa,
- Dan Nussbaum, Mike Parkin, and Donald Yeung,
- .q "The MIT Alewife Machine: A Large-Scale Distributed-Memory Multiprocessor" ,
- in
- .i "Proceedings of Workshop on Scalable Shared Memory Multiprocessors" ,
- Kluwer Academic Publishers,
- 1991
- .)b
- .(b I F
- .ll 14c
- .ti 0
- [AG89]
- George Almasi and Allan Gottlieb,
- .i "Highly Parallel Computing" ,
- Benjamin/Cummings,
- 1989, 519 pages.
- .)b
- .(b I F
- .ll 14c
- .ti 0
- [Blan90]
- Tom Blank,
- .q "The MasPar MP-1 Architecture" ,
- .i "IEEE COMPCON Proceedings" ,
- 1990, pp. 20-24.
- .)b
- .(b I F
- .ll 14c
- .ti 0
- [Chri90]
- Peter Christy,
- .q "Software to Support Massively Parallel Computing on the MasPar MP-1" ,
- .i "IEEE COMPCON Proceedings" ,
- 1990,
- pp. 29-33.
- .)b
- .(b I F
- .ll 14c
- .ti 0
- [Chri91]
- Peter Christy,
- .q "Virtual Processors Considered Harmful" ,
- .i "Sixth Distributed Memory Computing Conference Proceedings" ,
- 1991.
- .)b
- .(b I F
- .ll 14c
- .ti 0
- [DS90]
- Robert B.K. Dewar and Matthew Smosna,
- .i "Microprocessors: A Programmers View" ,
- McGraw-Hill, New York, 1990.
- .)b
- .(b I F
- .ll 14c
- .ti 0
- [GKLS84]
- Daniel Gajski, David Kuck, Duncan Lawrie, and Ahmed Sameh,
- .q Cedar
- in
- .i "Supercomputers: Design and Applications" ,
- Kai Hwang, ed. 1984.
- .)b
- .(b I F
- .ll 14c
- .ti 0
- [HHW90]
- E. Hagersten, S. Haridi, and D.H.D. Warren,
- .q "The Cache-Coherent Protocol of the Data Diffusion Machine" ,
- .i "Cache and Interconnect Architectures in Multiprocessors" ,
- edited by Michel Dubois and Shreekant Thakkar, 1990.
- .)b
- .(b I F
- .ll 14c
- .ti 0
- [Inte91]
- Intel Corporation literature, November 1991.
- .)b
- .(b I F
- .ll 14c
- .ti 0
- [LLSJ92]
- Dan Lenoski, James Laudon, Luis Stevens, Truman Joe,
- Dave Nakahira, Anoop Gupta, and John Hennessy,
- .q "The DASH Prototype: Implementation and Performance" ,
- .i "Proc. 19th Annual International Symposium on Computer Archtecture" ,
- May, 1992,
- Gold Coast, Australia,
- pp. 92-103.
- .)b
- .(b I F
- .ll 14c
- .ti 0
- [Masp91]
- .q "MP-1 Family Massively Parallel Computers" ,
- MasPar Computer Corporation,
- 1991.
- .)b
- .(b I F
- .ll 14c
- .ti 0
- [MS91]
- Jacek Myczkowski and Guy Steele,
- .q "Seismic Modeling at 14 gigaflops on the Connection Machine" ,
- .i "Proc. Supercomputing '91" ,
- Albuquerque, November, 1991.
- .)b
- .(b I F
- .ll 14c
- .ti 0
- [Nick90]
- John R. Nickolls,
- .q "The Design of the MasPar MP-1: A Cost Effective Massively Parallel Computer" ,
- .i "IEEE COMPCON Proceedings" , 1990, pp. 25-28.
- .)b
- .(b I F
- .ll 14c
- .ti 0
- [ROTH92]
- James Rothnie,
- .q "Overview of the KSR1 Computer System" ,
- Kendall Square Research Report TR 9202001,
- March, 1992
- .)b
- .(b I F
- .ll 14c
- .ti 0
- [Seit85]
- Charles L. Seitz,
- .q "The Cosmic Cube" ,
- .i "Communications of the ACM" ,
- .b 28
- (1),
- January 1985,
- pp. 22-33.
- .)b
- .(b I F
- .ll 14c
- .ti 0
- [SJG92]
- Per Stenstrom, Truman Joe, and Anoop Gupta,
- .q "Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures" ,
- .i "Proceedings, 19th International Symposium on Computer Architecture" ,
- 1992.
- .)b
- .(b I F
- .ll 14c
- .ti 0
- [TMC91]
- .q "The Connection Machine CM-5 Technical Summary" ,
- Thinking Machines Corporation,
- 1991.
- .)b
-
-