SUPERCOMPUTING TUTORIAL Part 1. Thå  folowinç ió á basiã overvie÷ oæ Supercomputinç anä somå     thå keù issueó withiî thió markeô segment® Iô waó writteî foò     thå  PC-LIÂ BBÓ Supercomputeò SIÇ (612© 435-255´  locateä  iî     Burnsville¬  Minnesotá bù ZIDEË Inc® Thå Supercomputeò SIÇ ió     devoteä   tï   alì  aspectó   oæ   Supercomputing¬   Computeò     Architecture¬   Paralleì   Processinç   anä  Scientifiã   anä     Engineering Software.     ZIDEË Inc® ió aî internationaì Supercomputeò Systemó Serviceó     and Consulting firm located at:      13195 Flamingo Court      Apple Valley, Minnesota 55124      Telephone: (612) 432-2835      FAX (612) 432-4492      Telexº 910-576-0061.     Thió texô maù bå freelù distributeä withiî thå publiã domaiî     for educational purposes provideä thaô crediô anä     acknowledgemenô bå giveî tï the authors at ZIDEK Inc.     Copyright ZIDEK Inc. 1987 Š 1 OVERVIEW There are currently five manufacturers of supercomputers that have their products in the marketplace. These are the Cray X-MP, and Cray-2, CDC/ETA CYBER 205 and ETA-10, the Fujitsu VP, the Hitachi S810; and the NEC SX. All of these machines are vector processors; that is, a single computer instruction may be used to call in a large number of operands which then flow pair-wise through the one or more arithmetic units or pipelines where the specified operation is executed in a segmented, parallel or overlapped fashion. Usually, the longer the length operand array (or vector) the greater the effectiveness per operation. However, in many cases users are faced with numerical problems that cannot easily be organized in a vectorized form or if they can, the vector length is very short. If this is true for any significant part of the computation, the overall performance slows to nearly the speed of a single processor, the so called scalar speed of the machine. This phenomena is often referred to as Amdahl's law. It is for this reason that computer users and designers stress the importance of fast scalar speed and frown on vector technology and promote parallel processing instead. With parallel processing, the problem is decomposed into a number of (possibly interacting) subproblems and these are spread among a plurality of closely coupled processors. In this case, the decomposition need not be done on a functional basis. But, if the problem cannot be broken into approximately equal independent segments, the maximum performance possible may not be sustainable in most designs. It is also generally true, that whenever the problem is such as to permit vector processing, it can in most cases be reformulated as a parallel problem. There are also cases where the problem cannot be vectorized, but is nevertheless highly parallel. What is needed is a computer capable of easily exploiting all the various forms of parallelism Š 2 (including parallelism of the vector type) and which has the fastest possible scalar speed. Recognizing this, current supercomputer vendors are in a race to design, develop, and introduce systems in the next decade that will embody these desireable attributes. 3 When a program is transferred to a Cray (or other vector processing computers), the Fortran language compiler detects parallel portions and identifies those which can be expressed in vector form. These portions are executed by the vector unit; the rest (the scalar portion) of the program is executed by the sequential portion of the machine, which in most instances issues one instruction at a time. IBM¬  DEC¬  CDC¬  UNISYÓ  anä  otheò majoò manufactureró  arå     providinç twï tï eighô CPUó witè whicè tï experiment®  ELXSI¬     Intel¬  BBN¬  Multiflow¬  Ncube¬  Thinkinç Machines¬ Floatinç     Poinô  Systems¬  anä  otheò  ne÷  start-uð  smalì  firmó  arå     buildinç systemó witè eveî á largeò numbeò oæ CPU's® However¬     nonå  oæ  theså appeaò tï bå comparablå  iî  generaì  overalì     performancå tï currenô supercomputers. There are only a few instances where any parallel processor systems are capable of being utilized in a highly effective way. This includes current supercomputers. Supercomputers primarily execute scientific and engineering programs; the overwhelming majority of these programs are written in the high level language Fortran. Typical users have thousands to hundreds of thousands of lines of existing Fortran code which they regularly execute. Additionally, the typical user regularly generates new programs (to solve new scientific or engineering problems). For these new problems, the language of choice for most of the users is Fortran, although the languages PASCAL, MODULA-2, ADA, and "C" are gaining some recognition. Today, insufficient software exists for any of these systems to demonstrate their true capability. Vector processor software is far ahead of parallel processor software, but is still inadequate to demonstrate the full capability of the hardware. To date, very few programs have been written that achieve more than 50% of the vector or parallel processing capability. Little research work has gone into studying optimization of programs for highly parallel computer systems. The five current supercomputer manufacturers, however, offer Fortan vectorizers/optimizers which enable the user to interact with his program to provide for more effective program code. With respect to parallel supercomputing, there has been even less optimizing work done. Cray Research and ETA Systems are developing UNIX based systems that offer enhanced Fortran parallel processing features based on the ANSI 8X Standard for their new parallel supercomputer designs; Cray-2, Cray-XMP, Cray-YMP, Cray-3, ETA-10, and ETA-30 respectivily. Except for the recently announced HITACHI S-810-80, the Japanese vendors have not yet embraced parallel supercomputing in any forthcoming commercial design. 4 Many of the basic technological advances which are expected in semiconductors, biotechnology, aircraft, nuclear power, etc. depend for their realization the availability of higher performance supercomputers. Consequently the hardware and software aspects of high performance computing are the focus of much research activity. There are at least 100 experimental parallel processing projects throughout the world, mostly at universities. This research has identified a number of key questions which include: Hardware: o What is the optimum interconnection method between parallel processors? o Will synchronization costs be high? o How should parallelism be controlled? o Can hardware be built to support both fine-grained or coarse-grained parallelism? o Can caching be synchronized or be made coherent for a larger number of parallel processors? o What level of granularity will provide the most efficient execution? o Can parallel architectures be extendible? Software: o Can automatic software to parallelize existing programs be developed? o Can existing languages support parallelism? o What is the optimum granularity? o How can algorithms and programs be mapped onto parallel architectures? o What are general guidelines for developing parallel algorithms or programs? o How does one debug a parallel processor? o How can an operating system support parallel processing? All current commercially extant parallel processing systems have to one extent or another addressed only subsets of these issues in their respective designs. Interconnection The interconnection method affects speed and generality. It affects the speed of a parallel processor because an inadequate interconnection can create a bottleneck which slows down computation. It affects generality because some interconnection structures are well adapted to certain computations but are poorly structured for others. 5 Control of Parallelism There are two basic control philosophies: static and dynamic. Static resource allocation requires a detailed analysis of the computation, prior to execution. Such an analysis is only available for a few specialized programs and we are not aware that it can be achieved today by compilers or other software in the general case. Dynamic resource allocation has classically been achieved in multiprocessors by the operating system. In theory, this approach has general applicability. In practice it is not particularly useful in speeding up a single large computation, because the operating systems tend to introduce high overhead. Granularity Some research approaches (eg. data flow) generally tend to deal only with the parallelism in which elementary operations such as add or multiply are considered (fine-grained parallelism). Other approaches (eg. multiprocessors such as the 4 processor Cray XMP or the ETA10 can exploit parallelism only when the problem can be decomposed into large, essentially independent subcomputations (course grained parallelism). 6 Cache Many designs employ cache memories to compensate for relatively low performance interconnection designs or slow main memory systems. This type of structure introduces the coherency problem in which the data in different caches is inconsistant. In the absence of cache, high performance interconnection structures and fast main memory are essential. Level of Granularity Different research machines are effective at different levels of granularity, and consequently the research community has addressed the problems of which level of granularity is most efficient. Extensibility There is a widespread belief that applications of the future will require hundreds or thousands of times the computing power currently available. Moreover it is thought that these applications will be highly ammenable to parallel processing of some form. It is therefore desirable to have an architecture which can be extended to very large designs utilizing hundreds, thousands or millions of processors. Such an architecture could be used whenever the price of components decrease or as soon as customers are willing to pay higher prices for larger machines. 7 Automatic parallelization Supercomputers users have an enormous investment in application programs, which must be protected. These programs are mostly written in Fortran. Unfortunately there can be no compiler which can make Fortran code highly parallel throughout the computation. A Fortran program is sometimes parallel and sometimes not. For this reason, the research and special purpose machines perform comparatively poorly on run of the mill Fortran programs. Because of their fast scalar speed, current supercomputers show very high performance for program segments which do not contain much parallelism. Compilers have been in use for some time which identify the parallelism available in Fortran programs. These compilers have been used for vector processing supercomputers as well as multiprocessors, such as Alliant. Also available are Fortran pre-processors, such Pacific Sierra's VAST and Kuck & Associates' KAP. Existing Languages to support Parallelism Because ordinary Fortran programs are usually not suitable for research machines, some researchers have argued that special languages, such as ID, OCCAM, SISAL, VAL, etc. might be used to generate more parallelism, and perhaps enough to achieve high performance. Unfortunately, these efforts are as rule not applicable to the mainstream supercomputer market, because the users are, at this time, either unable to afford or unwilling to rewrite existing application programs. Optimum granularity For most parallel architecture implementations, the programmer must adjust the granularity to suit the implementation. Mapping In many research machines, performance depends critically on which sections of program code are assigned to which processors, on which segments of data are assigned to which memory units and similar issues. These issues are collectivily known as the mapping problem. The challenge for the programmer using vector or parallel machines is to devise algorithms and arrange the computations so that the architectural features of a particular machine are fully utilized. General purpose machines are generally less likely to demand this kind of mapping to achieve high performance. 8 Guidelines for developing parallel algorithms As applications migrate to parallel computers, a central questions becomes how algorithms should be written to exploit parallelism. Forcing the algorithm designer or programmer to figure out and program explicit parallel control and synchronization is recognized as not being the best approach. Explicit hand-coded algorithms introduce a new level of complexity, complicates debugging by introducing time-dependent errors, and can reduce portability and robustness of algorithms by forcing the recoding of programs for each different model of parallel computer, and in some cases for the same computer, as individual processors are added or removed for repair. Unfortunately, inspite of massive worldwide research, no general unifying principles for parallel processing have yet been discovered. 9 Debugging Debugging parallel algorithms is a difficult task indeed, because of the intrinsic susceptibility of time-dependent errors. Programmers dealing with real time events and operating systems have been wrestling with the problems since the beginning of the computer age. Undoubtedly this will continue to be a problem. Even in the absence of general unifying principles of parallel processing, it is a problem that is difficult for all supercomputer vendors; and all are devoting considerable resources to develop workable debugging support programs. Operating System Support Many but not all multiprocessors use the operating system to support parallelism. It seems to be a major problem to prevent overheads from cancelling the benefits of parallelism, when the operating system is used in this way. 10 The Market Cray Research currently dominates the supercomputer market, with approximately 63 percent of the worldwide installed base. The market is characterized by an increasing recognition of computer simulation or modeling as a highly productive alternative to traditional experimental engineering techniques such as prototype construction. The market has been broadened by continuing price/performance improvements and deepened by an increasing appetite for more and more simulation. Supercomputers are used in applications such as environmental forecasting, aerodynamics, structural analysis, nuclear research, computational physics, astronomy, chemical engineering, fluid dynamics, molecular science, image processing, graphics, electronic circuit design, and combustion analysis. The supercomputer market includes both government and commercial sectors. U.S. Government laboratories and agencies have been the historic testing grounds for large-scale, innovative computers. They are the principle targets for supercomputer systems. The U.S. Government sector has a shorter selling cycle than the commercial sector. U.S. Government installations have grown at a compound rate of 37.3% over the last five years. The rapidly developing commercial sector exists because of dramatic price/performance improvements in supercomputers during the last decade, the learning curve generally experienced in the utilization of these systems, and because recently hired university graduates that have used supercomputers demand the fastest computers when they enter industry. Commercial sector selling cycles can range from six months to four years. Installations in the commercial sector have grown at a compound rate of 47.3% over the last five years. 11 Mini-supercomputers The term is a misnomer. It could mean a physically small supercomputer or, as it is used in the trade media, defines computers that are less powerful than supercomputers. This is also contradictory because the term supercomputers is defined by authoritative sources as being "the most powerful computers in any given point in time". In technical publications, marketing literature, and general news media, other terms that describe this market segment are such terms as "Near", "Entry", and "affordable" supercomputers and such variants as "crayettes" and "personal". There is also a growing tendency to attach the label "supercomputer" to almost any machine that employs vector, multiprocessing or parallel computing architectural concepts in its design. Also seen are frequent references to architectures, particularly those employing parallel processing and dataflow techniques to the artificial intelligence domain as supercomputers. Such "super intelligent" machines will no doubt depend on many of the same technologies as supercomputers, from both the architectural and device technology point of view. But, in our opinion, at this point in their development they are not powerful enough to be included in the supercomputer category. Some array processors are capable, in a highly restricted range of applications, of achieving performance comparable to that of supercomputers, but these cannot be categorized as supercomputers because of their lack of general purpose supercomputer capability across a broad spectrum of applications. Therefore these machines are not directly competitive with supercomputers since these products have substantially lower performance, much smaller memories, etc. This distinction is also inherent: the cost of manufacture of a typical supercomputer is several times the average selling price of the typical mini-supercomputer. 12 This segment of the computer market is growing very rapidly and there seems little doubt that some of the participants will continue to grow, for the time being. While it might be thought that this growth might erode supercomputer sales, statistics, market research and our own experiences do not support this view. It is more likely that the success of "mini-supercomputers" will foster new applications in scientific and engineering much as DEC's VAX has. This in turn will drive a larger demand for supercomputers. The major strength of this group of competitors is price/performance advantages over minicomputer and departmental-level scientific and engineering computing. The group is actually composed of two distinct sub-groups; those that offer compatibility with current Supercomputers i.e. Cray, on either a total Operating System basis or on a Fortran compiler basis, and those that are incompatible with Supercomputers and manifest new architectural and software innovations. Virtually all are supporting the UNIX Operating System Environment. The subgroup that offers Supercomputer software compatibility can exploit the large collection of Scientific and Engineering Applications Software that has already been developed and vectorized. Therefore, by insuring compatibility, applications software development is minimized for these vendors. The other subgroup comprising this market are the non-compatibles. Because of frantic competition not only with entrenched vendors and others within this segment, the group is perhaps one of the most innovative. One of the most promising of these, Alliant, has a very innovative parallel and vector processor architecture and a Fortran Compiler that supports both automatic vectorization and automatic parallelization. 13 Other examples of innovative design include: Intel's iSPC System based on the Stanford Univ. Cosmic Cube Architecture (Ncube and FPS are also building variants of this design). Culler Scientific and Multiflow is offering a system based on a VLIW (Very long instruction word) design similar to CDC's Cyberplus system. Elxsi is offering a multi parallel processing system based on a very fast interprocessor system bus architecture, and many others are building machines that utilize hundreds and in some cases thousands of micro processors. Both groups offer systems with claimed superior price/performance in the under $500k scientific/engineering minicomputer market. Compared to large-scale general purpose Scientific Systems the major weakness of most of these systems is system throughput and robust system software. Most are depending either on the UNIX market or the Supercomputer applications software market to fill the software void. Another major weakness in this group is that most of the entrants do not have adequate financial resources and mature, marketing, sales, distribution, services and corporate management infra-structures to compete massively in both domestic and international markets against established major vendors. The incompatible and esoteric designs will encounter the key bottlenecks: amount of old software and applications base that can be converted to use "parallelism". Existing minicomputer suppliers will have to respond to price/performance pressure; parallel architecture is not the only way. The history of esoteric non-von Neumann architectures is full of failed attempts at commercialization. 14 Outlook and Conclusion The market has now has been estimated by many sources to be in excess of $1 Billion worldwide, and at its current growth rate may reach $2 Billion in the early 1990 time frame. 1) Architecture Parallel architecture is generally agreed to be the next step in higher performance supercomputing. Cray, ETA, Supercomputers Inc., and other competitors are all developing parallel supercomputers. 2) Software A successful entry into the supercomputer market requires a software system compatible with the existing user environment, which permits not only new applications but also protects the users with large investments in software. 15