SUPERCOMPUTING TUTORIAL Part 1.        






             Thĺ  folowinç ió á basiă overvie÷ oć Supercomputinç anä somĺ Ť
            thĺ keů issueó withiî thió markeô segment® Iô waó writteî foň Ť
            thĺ  PC-LIÂ BBÓ Supercomputeň SIÇ (612© 435-255´  locateä  iî Ť
            Burnsville¬  Minnesotá bů ZIDEË Inc® Thĺ Supercomputeň SIÇ ió Ť
            devoteä   tď   alě  aspectó   oć   Supercomputing¬   Computeň Ť
            Architecture¬   Paralleě   Processinç   anä  Scientifiă   anä Ť
            Engineering Software.

            ZIDEË Inc® ió aî internationaě Supercomputeň Systemó Serviceó Ť
            and Consulting firm located at:


                           13195 Flamingo Court
                           Apple Valley, Minnesota 55124

                           Telephone:    (612) 432-2835
                           FAX           (612) 432-4492
                           Telexş          910-576-0061.

            Thió texô maů bĺ freelů distributeä withiî thĺ publiă domaiî Ť
            for educational purposes provideä thaô crediô anä
            acknowledgemenô bĺ giveî tď the authors at ZIDEK Inc.

          


            Copyright ZIDEK Inc. 1987







             







Š

                                        1




















            OVERVIEW  

            There are currently five manufacturers of supercomputers 
            that have their products in the marketplace. These are the 
            Cray X-MP, and Cray-2, CDC/ETA CYBER 205 and ETA-10, the 
            Fujitsu VP, the Hitachi S810; and the NEC SX. All of these 
            machines are vector processors; that is, a single computer 
            instruction may be used to call in a large number of 
            operands which then flow pair-wise through the one or more 
            arithmetic units or pipelines where the specified  operation 
            is executed in a segmented, parallel or overlapped fashion. 
            Usually, the longer the length operand array (or vector) the 
            greater the effectiveness per operation.

            However, in many cases users are faced with numerical 
            problems that cannot easily be organized in a vectorized 
            form or if they can, the vector length is very short. If 
            this is true for any significant part of the computation, 
            the overall performance slows to nearly the speed of a 
            single processor, the so called scalar speed of the machine. 
            This phenomena is often referred to as Amdahl's law. It is 
            for this reason that computer users and designers stress the 
            importance of fast scalar speed and frown on vector 
            technology and promote parallel processing instead. With 
            parallel processing, the problem is decomposed into a number 
            of (possibly interacting) subproblems and these are spread 
            among a plurality of closely coupled processors. In this 
            case, the decomposition need not be done on a functional 
            basis. But, if the problem cannot be broken into 
            approximately equal independent segments, the maximum 
            performance possible may not be sustainable in most designs.

            It is also generally true, that whenever the problem is such 
            as to permit vector processing, it can in most cases be 
            reformulated as a parallel problem. There are also cases 
            where the problem cannot be vectorized, but is nevertheless 
            highly parallel. What is needed is a computer capable of 
            easily exploiting all the various forms of parallelism Š

                                        2



            (including parallelism of the vector type) and which has the 
            fastest possible scalar speed. Recognizing this, current 
            supercomputer vendors are in a race to design, develop, and 
            introduce systems in the next decade that will embody these 
            desireable attributes.




















































                                        3



            When a program is transferred to a Cray (or other vector  
            processing computers), the Fortran language compiler detects 
            parallel portions and identifies those which can be 
            expressed in vector form. These portions are executed by the 
            vector unit; the rest (the scalar portion) of the program is 
            executed by the sequential portion of the machine, which in 
            most instances issues one instruction at a time.

            IBM¬  DEC¬  CDC¬  UNISYÓ  anä  otheň majoň manufactureró  arĺ Ť
            providinç twď tď eighô CPUó witč whicč tď experiment®  ELXSI¬ Ť
            Intel¬  BBN¬  Multiflow¬  Ncube¬  Thinkinç Machines¬ Floatinç Ť
            Poinô  Systems¬  anä  otheň  ne÷  start-uđ  smalě  firmó  arĺ Ť
            buildinç systemó witč eveî á largeň numbeň oć CPU's® However¬ Ť
            nonĺ  oć  thesĺ appeaň tď bĺ comparablĺ  iî  generaě  overalě Ť
            performancĺ tď currenô supercomputers.

            There are only a few instances where any parallel processor 
            systems are capable of being utilized in a highly effective 
            way. This includes current supercomputers.

            Supercomputers primarily execute scientific and engineering 
            programs; the overwhelming majority of these programs are 
            written in the high level language Fortran. Typical users 
            have thousands to hundreds of thousands of lines of existing 
            Fortran code which they regularly execute. Additionally, the 
            typical user regularly generates new programs (to solve new 
            scientific or engineering problems). For these new problems, 
            the language of choice for most of the users is Fortran, 
            although the languages PASCAL, MODULA-2, ADA, and  "C" are 
            gaining some recognition.

            Today, insufficient software exists for any of these systems 
            to demonstrate their true capability. Vector processor 
            software is far ahead of parallel processor software, but is 
            still inadequate to demonstrate the full capability of the 
            hardware. To date, very few programs have been written that 
            achieve more than 50% of the vector or parallel processing 
            capability. Little research work has gone into studying 
            optimization of programs for highly parallel computer 
            systems. The five current supercomputer manufacturers, 
            however, offer Fortan vectorizers/optimizers which enable 
            the user to interact with his program to provide for more 
            effective program code.

            With respect to parallel supercomputing, there has been even 
            less optimizing work done. Cray Research and ETA Systems are 
            developing UNIX based systems that offer enhanced Fortran 
            parallel processing features based on the ANSI 8X Standard 
            for their new parallel supercomputer designs; Cray-2, 
            Cray-XMP, Cray-YMP, Cray-3, ETA-10, and ETA-30 respectivily. 
            Except for the recently announced HITACHI S-810-80, the 
            Japanese vendors have not yet embraced parallel 
            supercomputing in any forthcoming commercial design.




                                        4



            Many of the basic technological advances which are expected 
            in semiconductors, biotechnology, aircraft, nuclear power, 
            etc. depend for their realization the availability of higher 
            performance supercomputers. Consequently the hardware and 
            software aspects of high performance computing are the focus 
            of much research activity. There are at least 100 
            experimental parallel processing projects throughout the 
            world, mostly at universities. This research has identified 
            a number of key questions which include:

            Hardware:

            o    What is the optimum interconnection method between 
                 parallel processors?
            o    Will synchronization costs be high?
            o    How should parallelism be controlled?
            o    Can hardware be built to support both fine-grained or 
                 coarse-grained parallelism?
            o    Can caching be synchronized or be made coherent for a 
                 larger number of parallel processors?
            o    What level of granularity will provide the most 
                 efficient execution?
            o    Can parallel architectures be extendible?

            Software:

            o    Can automatic software to parallelize existing programs 
                 be developed?
            o    Can existing languages support parallelism?
            o    What is the optimum granularity?
            o    How can algorithms and programs be mapped onto parallel 
                 architectures?
            o    What are general guidelines for developing parallel 
                 algorithms or programs?
            o    How does one debug a parallel processor?
            o    How can an operating system support parallel 
                 processing?

            All current commercially extant parallel processing systems 
            have to one extent or another addressed only subsets of 
            these issues in their respective designs.

            Interconnection

            The interconnection method affects speed and generality. It 
            affects the speed of a parallel processor because an 
            inadequate interconnection can create a bottleneck which 
            slows down computation. It affects generality because some 
            interconnection structures are well adapted to certain 
            computations but are poorly structured for others.







                                        5



            Control of Parallelism

            There are two basic control philosophies: static and 
            dynamic. Static resource allocation requires a detailed 
            analysis of the computation, prior to execution.  Such an 
            analysis is only available for a few specialized programs  
            and we are not aware that it can be achieved today by 
            compilers or other software in the general case. Dynamic 
            resource allocation has classically been achieved in 
            multiprocessors by the operating system. In theory, this 
            approach has general applicability. In practice it is not 
            particularly useful in speeding up a single large 
            computation, because the operating systems tend to introduce 
            high overhead.

            Granularity

            Some research approaches (eg. data flow) generally tend to 
            deal only with the parallelism in which elementary 
            operations such as add or multiply are considered 
            (fine-grained parallelism). Other approaches (eg. 
            multiprocessors such as the 4 processor Cray XMP or the 
            ETA10  can exploit parallelism only when the problem can be 
            decomposed into large, essentially independent 
            subcomputations (course grained parallelism).
































                                        6



            Cache

            Many designs employ cache memories to compensate for 
            relatively low performance interconnection designs or slow 
            main memory systems. This type of structure introduces the 
            coherency problem in which the data in different caches is 
            inconsistant. In the absence of cache, high performance 
            interconnection structures and fast main memory are 
            essential. 

            Level of Granularity

            Different research machines are effective at different 
            levels of granularity, and consequently the research 
            community has addressed the problems of which level of 
            granularity is most efficient.

            Extensibility

            There is a widespread belief that applications of the future 
            will require hundreds or thousands of times the computing 
            power currently available. Moreover it is thought that these 
            applications will be highly ammenable to parallel processing 
            of some form. It is therefore desirable to have an 
            architecture which can be extended to very large designs 
            utilizing hundreds, thousands or millions of processors. 
            Such an architecture could be used whenever the price of 
            components decrease or as soon as customers are willing to 
            pay higher prices for larger machines. 




























                                        7



            Automatic parallelization

            Supercomputers users have an enormous investment in 
            application programs, which must be protected. These 
            programs are mostly written in Fortran. Unfortunately there 
            can be no compiler which can make Fortran code highly 
            parallel throughout the computation. A Fortran program is 
            sometimes parallel and sometimes not. For this reason, the 
            research and special purpose machines perform comparatively 
            poorly on run of the mill Fortran programs. Because of their 
            fast scalar speed, current supercomputers show very high 
            performance for program segments which do not contain much 
            parallelism. Compilers have been in use for some time which 
            identify the parallelism available in Fortran programs. 
            These compilers have been used for vector processing 
            supercomputers as well as multiprocessors, such as Alliant. 
            Also available are Fortran pre-processors, such Pacific 
            Sierra's VAST and Kuck & Associates' KAP.

            Existing Languages to support Parallelism

            Because ordinary Fortran programs are usually not suitable 
            for research machines, some researchers have argued that 
            special languages, such as ID, OCCAM, SISAL, VAL, etc. might 
            be used to generate more parallelism, and perhaps enough to 
            achieve high performance. Unfortunately, these efforts are 
            as rule not applicable to the mainstream supercomputer 
            market, because the users are, at this time, either unable 
            to afford or unwilling to rewrite existing application 
            programs.

            Optimum granularity

            For most parallel architecture implementations, the 
            programmer must adjust the granularity to suit the 
            implementation.

            Mapping

            In many research machines, performance depends critically on 
            which sections of program code are assigned to which 
            processors, on which segments of data are assigned to which 
            memory units and similar issues. These issues are 
            collectivily known as the mapping problem. The challenge for 
            the programmer using vector or parallel machines is to 
            devise algorithms and arrange the computations so that the 
            architectural features of a particular machine are fully 
            utilized. General purpose machines are generally less likely 
            to  demand this kind of mapping to achieve high performance.








                                        8



            Guidelines for developing parallel algorithms

            As applications migrate to parallel computers, a central 
            questions becomes how algorithms should be written to 
            exploit parallelism. Forcing the algorithm designer or 
            programmer to figure out and program explicit parallel 
            control and synchronization is recognized as not being the 
            best approach. Explicit hand-coded algorithms introduce a 
            new level of complexity, complicates debugging by 
            introducing time-dependent errors, and can reduce 
            portability and robustness of algorithms by forcing the 
            recoding of programs for each different model of parallel 
            computer, and in some cases for the same computer, as 
            individual processors are added or removed for repair. 
            Unfortunately, inspite of massive worldwide research, no 
            general unifying principles for parallel processing have yet 
            been discovered.








































                                        9



            Debugging

            Debugging parallel algorithms is a difficult task indeed, 
            because of the intrinsic susceptibility of time-dependent 
            errors. Programmers dealing with real time events and 
            operating systems have been wrestling with the problems 
            since the beginning of the computer age. Undoubtedly this 
            will continue to be a problem. Even in the absence of 
            general unifying principles of parallel processing, it is a 
            problem that is difficult for all supercomputer vendors; and 
            all are devoting considerable resources to develop workable 
            debugging support programs.

            Operating System Support

            Many but not all multiprocessors use the operating system to 
            support parallelism. It seems to be a major problem to 
            prevent overheads from cancelling the benefits of 
            parallelism, when the operating system is used in this way. 






































                                       10



            The Market

            Cray Research currently dominates the supercomputer market, 
            with approximately 63 percent of the worldwide installed 
            base. The market is characterized by an increasing 
            recognition of computer simulation or modeling as a highly 
            productive alternative to traditional experimental 
            engineering techniques such as prototype construction. The 
            market has been broadened by continuing price/performance 
            improvements and deepened by an increasing appetite for more 
            and more simulation.

            Supercomputers are used in applications such as 
            environmental forecasting, aerodynamics, structural 
            analysis, nuclear research, computational physics, 
            astronomy, chemical engineering, fluid dynamics, molecular 
            science, image processing, graphics, electronic circuit 
            design, and combustion analysis.

            The supercomputer market includes both government and 
            commercial sectors. U.S. Government laboratories and 
            agencies have been the historic testing grounds for 
            large-scale, innovative computers. They are the principle 
            targets for supercomputer systems. The U.S. Government 
            sector has a shorter selling cycle than the commercial 
            sector. U.S. Government installations have grown at a 
            compound rate of 37.3% over the last five years.

            The rapidly developing commercial sector exists because of 
            dramatic price/performance improvements in supercomputers 
            during the last decade, the learning curve generally 
            experienced in the utilization of these systems, and because 
            recently hired university graduates that have used 
            supercomputers demand the fastest computers when they enter 
            industry. Commercial sector selling cycles can range from 
            six months to four years. Installations in the commercial 
            sector have grown at a compound rate of 47.3% over the last 
            five years.



















                                       11



            Mini-supercomputers

            The term is a misnomer. It could mean a physically small 
            supercomputer or, as it is used in the trade media, defines 
            computers that are less powerful than supercomputers. This 
            is also contradictory because the term supercomputers is 
            defined by authoritative sources as being "the most powerful 
            computers in any given point in time". In technical 
            publications, marketing literature, and general news media, 
            other terms that describe this market segment are such terms 
            as "Near", "Entry", and "affordable" supercomputers and such 
            variants as "crayettes" and "personal".

            There is also a growing tendency to attach the label 
            "supercomputer" to almost any machine that employs vector, 
            multiprocessing or parallel computing architectural concepts 
            in its design.

            Also seen are frequent references to architectures, 
            particularly those employing parallel processing and 
            dataflow techniques to the artificial intelligence domain as 
            supercomputers. Such "super intelligent" machines will no 
            doubt depend on many of the same technologies as 
            supercomputers, from both the architectural and device 
            technology point of view. But, in our opinion, at this point 
            in their development they are not powerful enough to be 
            included in the supercomputer category.

            Some array processors are capable, in a highly restricted 
            range of applications, of achieving performance comparable 
            to that of supercomputers, but these cannot be categorized 
            as supercomputers because of their lack of general purpose 
            supercomputer capability across a broad spectrum of 
            applications.

            Therefore these machines are not directly competitive with 
            supercomputers since these products have substantially lower 
            performance, much smaller memories, etc. This distinction is 
            also inherent: the cost of manufacture of a typical 
            supercomputer is several times the average selling price of 
            the typical mini-supercomputer.
















                                       12



            This segment of the computer market is growing very rapidly 
            and there seems little doubt that some of the participants 
            will continue to grow, for the time being. While it might be 
            thought that this growth might erode supercomputer sales, 
            statistics, market research and our own experiences do not 
            support this view. It is more likely that the success of 
            "mini-supercomputers"  will foster new applications in 
            scientific and engineering much as DEC's VAX has. This in 
            turn will drive a larger demand for supercomputers.

            The major strength of this group of competitors is 
            price/performance advantages over minicomputer and 
            departmental-level scientific and engineering computing. The 
            group is actually composed of two distinct sub-groups; those 
            that offer compatibility with current Supercomputers i.e. 
            Cray, on either a total Operating System basis or on a 
            Fortran compiler basis, and those that are incompatible with 
            Supercomputers and manifest new architectural and software 
            innovations.  Virtually all are supporting the UNIX 
            Operating System Environment.

            The subgroup that offers Supercomputer software 
            compatibility can exploit the large collection of Scientific 
            and Engineering Applications Software that has already been 
            developed and vectorized. Therefore, by insuring 
            compatibility, applications software development is 
            minimized for these vendors. 
            
            The other subgroup comprising this market are the 
            non-compatibles.  Because of frantic competition not only 
            with entrenched vendors and others within this segment, the 
            group is perhaps one of the most innovative. One of the most 
            promising of these, Alliant, has a very innovative parallel 
            and vector processor architecture and a Fortran Compiler 
            that supports both automatic vectorization and automatic 
            parallelization.





















                                       13



            Other examples of innovative design include: Intel's iSPC 
            System based on the Stanford Univ. Cosmic Cube Architecture 
            (Ncube and FPS are also building variants of this design).  
            Culler Scientific and Multiflow is offering a system based 
            on a VLIW (Very long instruction word) design similar to 
            CDC's Cyberplus system.  Elxsi is offering a multi parallel 
            processing system based on a very fast interprocessor system 
            bus architecture, and many others are building machines that 
            utilize hundreds and in some cases thousands of micro 
            processors.

            Both groups offer systems with claimed superior 
            price/performance in the under $500k scientific/engineering 
            minicomputer market.

            Compared to large-scale general purpose Scientific Systems  
            the major weakness of most of these systems is system 
            throughput and robust system software.  Most are depending 
            either on the UNIX market or the Supercomputer applications 
            software market to fill the software void.

            Another major weakness in this group is that most of the 
            entrants do not have adequate financial resources and 
            mature, marketing, sales, distribution, services and 
            corporate management infra-structures to compete massively 
            in both domestic and international markets against 
            established major vendors.

            The incompatible and esoteric designs will encounter the key 
            bottlenecks: amount of old software and applications base 
            that can be converted to use "parallelism".

            Existing minicomputer suppliers will have to respond to 
            price/performance pressure; parallel architecture is not the 
            only way.

            The history of esoteric non-von Neumann architectures is 
            full of failed attempts at commercialization.



















                                       14



            Outlook and Conclusion

            The market has now has been estimated by many sources to be 
            in excess of $1 Billion worldwide, and at its current growth 
            rate may reach $2 Billion in the early 1990 time frame. 

            1)   Architecture

                 Parallel architecture is generally agreed to be the 
                 next step in higher performance supercomputing. Cray, 
                 ETA, Supercomputers Inc., and other competitors are all 
                 developing parallel supercomputers. 

            2)   Software

                 A successful entry into the supercomputer market 
                 requires a software system compatible with the existing 
                 user environment, which permits not only new 
                 applications but also protects the users with large 
                 investments in software.


            


































                                       15
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                