home *** CD-ROM | disk | FTP | other *** search
Text File | 1998-11-29 | 49.9 KB | 1,387 lines |
- Beowulf HOWTO
- Jacek Radajewski and Douglas Eadline
- v1.1.1, 22 November 1998
-
- This document introduces the Beowulf Supercomputer architecture and
- provides background information on parallel programming, including
- links to other more specific documents, and web pages.
- ______________________________________________________________________
-
- Table of Contents
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- 1. Preamble
-
- 1.1 Disclaimer
- 1.2 Copyright
- 1.3 About this HOWTO
- 1.4 About the authors
- 1.5 Acknowledgements
-
- 2. Introduction
-
- 2.1 Who should read this HOWTO ?
- 2.2 What is a Beowulf ?
- 2.3 Classification
-
- 3. Architecture Overview
-
- 3.1 What does it look like ?
- 3.2 How to utilise the other nodes ?
- 3.3 How does Beowulf differ from a COW ?
-
- 4. System Design
-
- 4.1 A brief background on parallel computing.
- 4.2 The methods of parallel computing
- 4.2.1 Why more than one CPU?
- 4.2.2 The Parallel Computing Store
- 4.2.2.1 Single-tasking Operating System
- 4.2.2.2 Multi-tasking Operating System:
- 4.2.2.3 Multitasking Operating Systems with Multiple CPUs:
- 4.2.2.4 Threads on a Multitasking Operating Systems extra CPUs
- 4.2.2.5 Sending Messages on Multitasking Operating Systems with extra CPUs:
- 4.3 Architectures for parallel computing
- 4.3.1 Hardware Architectures
- 4.3.2 Software API Architectures
- 4.3.2.1 Messages
- 4.3.2.2 Threads
- 4.3.3 Application Architecture
- 4.4 Suitability
- 4.5 Writing and porting parallel software
- 4.5.1 Determine concurrent parts of your program
- 4.5.2 Estimate parallel efficiency
- 4.5.3 Describing the concurrent parts of your program
- 4.5.3.1 Explicit Methods
- 4.5.3.2 Implicit Methods
-
- 5. Beowulf Resources
-
- 5.1 Starting Points
- 5.2 Documentation
- 5.3 Papers
- 5.4 Software
- 5.5 Beowulf Machines
- 5.6 Other Interesting Sites
- 5.7 History
-
- 6. Source code
-
- 6.1 sum.c
- 6.2 sigmasqrt.c
- 6.3 prun.sh
-
-
- ______________________________________________________________________
-
-
-
- 1. Preamble
-
- 1.1. Disclaimer
-
- We will not accept any responsibility for any incorrect information
- within this document, nor for any damage it might cause when applied.
-
-
- 1.2. Copyright
-
- Copyright ⌐ 1997 - 1998 Jacek Radajewski and Douglas Eadline.
- Permission to distribute and modify this document is granted under the
- GNU General Public Licence.
-
-
- 1.3. About this HOWTO
-
- Jacek Radajewski started work on this document in November 1997 and
- was soon joined by Douglas Eadline. Over a few months the Beowulf
- HOWTO grew into a large document, and in August 1998 it was split into
- three documents: Beowulf HOWTO, Beowulf Architecture Design HOWTO, and
- the Beowulf Installation and Administration HOWTO. Version 1.0.0 of
- the Beowulf HOWTO was released to the Linux Documentation Project on
- 11 November 1998. We hope that this is only the beginning of what
- will become a complete Beowulf Documentation Project.
-
-
- 1.4. About the authors
-
-
- ╖ Jacek Radajewski works as a Network Manager, and is studying for an
- honors degree in computer science at the University of Southern
- Queensland, Australia. Jacek's first contact with Linux was in
- 1995 and it was love at first sight. Jacek built his first Beowulf
- cluster in May 1997 and has been playing with the technology ever
- since, always trying to find new and better ways of setting things
- up. You can contact Jacek by sending e-mail to jacek@usq.edu.au
-
- ╖ Douglas Eadline, Ph.D. is President and Principal Scientist at
- Paralogic, Inc., Bethlehem, PA, USA. Trained as
- Physical/Analytical Chemist, he has been involved with computers
- since 1978 when he built his first single board computer for use
- with chemical instrumentation. Dr. Eadline's interests now include
- Linux, Beowulf clusters, and parallel algorithms. Dr. Eadline can
- be contacted by sending email to deadline@plogic.com
-
-
- 1.5. Acknowledgements
-
- The writing of the Beowulf HOWTO was a long proces and is finally
- complete, thanks to many individuals. I would like to thank the
- following people for their help and contribution to this HOWTO.
-
- ╖ Becky for her love, support, and understanding.
-
- ╖ Tom Sterling, Don Becker, and other people at NASA who started the
- Beowulf project.
-
- ╖ Thanh Tran-Cong and the Faculty of Engineering and Surveying for
- making the topcat Beowulf machine available for experiments.
-
- ╖ My supervisor Christopher Vance for many great ideas.
-
- ╖ My friend Russell Waldron for great programming ideas, his general
- interest in the project, and support.
-
- ╖ My friend David Smith for proof reading this document.
-
- ╖ Many other people on the Beowulf mailing list who provided me with
- feedback and ideas.
-
- ╖ All the people who are responsible for the Linux operating system
- and all the other free software packages used on topcat and other
- Beowulf machines.
-
-
- 2. Introduction
-
-
- As the performance of commodity computer and network hardware
- increase, and their prices decrease, it becomes more and more
- practical to build parallel computational systems from off-the-shelf
- components, rather than buying CPU time on very expensive
- Supercomputers. In fact, the price per performance ratio of a Beowulf
- type machine is between three to ten times better than that for
- traditional supercomputers. Beowulf architecture scales well, it is
- easy to construct and you only pay for the hardware as most of the
- software is free.
-
-
- 2.1. Who should read this HOWTO ?
-
- This HOWTO is designed for a person with at least some exposure to the
- Linux operating system. Knowledge of Beowulf technology or
- understanding of more complex operating system and networking concepts
- is not essential, but some exposure to parallel computing would be
- advantageous (after all you must have some reason to read this
- document). This HOWTO will not answer all possible questions you
- might have about Beowulf, but hopefully will give you ideas and guide
- you in the right direction. The purpose of this HOWTO is to provide
- background information, links and references to more advanced
- documents.
-
-
- 2.2. What is a Beowulf ?
-
- Famed was this Beowulf: far flew the boast of him, son of Scyld, in
- the Scandian lands. So becomes it a youth to quit him well with his
- father's friends, by fee and gift, that to aid him, aged, in after
- days, come warriors willing, should war draw nigh, liegemen loyal: by
- lauded deeds shall an earl have honor in every clan. Beowulf is the
- earliest surviving epic poem written in English. It is a story about
- a hero of great strength and courage who defeted a monster called
- Grendel. See ``History'' to find out more about the Beowulf hero.
-
- There are probably as many Beowulf definitions as there are people who
- build or use Beowulf Supercomputer facilities. Some claim that one
- can call their system Beowulf only if it is built in the same way as
- the NASA's original machine. Others go to the other extreme and call
- Beowulf any system of workstations running parallel code. My
- definition of Beowulf fits somewhere between the two views described
- above, and is based on many postings to the Beowulf mailing list:
-
-
- Beowulf is a multi computer architecture which can be used for
- parallel computations. It is a system which usually consists of one
- server node, and one or more client nodes connected together via
- Ethernet or some other network. It is a system built using commodity
- hardware components, like any PC capable of running Linux, standard
- Ethernet adapters, and switches. It does not contain any custom
- hardware components and is trivially reproducible. Beowulf also uses
- commodity software like the Linux operating system, Parallel Virtual
- Machine (PVM) and Message Passing Interface (MPI). The server node
- controls the whole cluster and serves files to the client nodes. It
- is also the cluster's console and gateway to the outside world. Large
- Beowulf machines might have more than one server node, and possibly
- other nodes dedicated to particular tasks, for example consoles or
- monitoring stations. In most cases client nodes in a Beowulf system
- are dumb, the dumber the better. Nodes are configured and controlled
- by the server node, and do only what they are told to do. In a disk-
- less client configuration, client nodes don't even know their IP
- address or name until the server tells them what it is. One of the
- main differences between Beowulf and a Cluster of Workstations (COW)
- is the fact that Beowulf behaves more like a single machine rather
- than many workstations. In most cases client nodes do not have
- keyboards or monitors, and are accessed only via remote login or
- possibly serial terminal. Beowulf nodes can be thought of as a CPU +
- memory package which can be plugged in to the cluster, just like a CPU
- or memory module can be plugged into a motherboard.
-
-
- Beowulf is not a special software package, new network topology or the
- latest kernel hack. Beowulf is a technology of clustering Linux
- computers to form a parallel, virtual supercomputer. Although there
- are many software packages such as kernel modifications, PVM and MPI
- libraries, and configuration tools which make the Beowulf architecture
- faster, easier to configure, and much more usable, one can build a
- Beowulf class machine using standard Linux distribution without any
- additional software. If you have two networked Linux computers which
- share at least the /home file system via NFS, and trust each other to
- execute remote shells (rsh), then it could be argued that you have a
- simple, two node Beowulf machine.
-
-
-
- 2.3. Classification
-
- Beowulf systems have been constructed from a variety of parts. For
- the sake of performance some non-commodity components (i.e. produced
- by a single manufacturer) have been employed. In order to account
- for the different types of systems and to make discussions about
- machines a bit easier, we propose the following simple classification
- scheme:
-
- CLASS I BEOWULF:
-
- This class of machines built entirely from commodity "off-the-shelf"
- parts. We shall use the "Computer Shopper" certification test to
- define commodity "off-the-shelf" parts. (Computer Shopper is a 1 inch
- thick monthly magazine/catalog of PC systems and components.) The test
- is as follows:
-
- A CLASS I Beowulf is a machine that can be assembled from parts found
- in at least 3 nationally/globally circulated advertising catalogs.
-
- The advantages of a CLASS I system are:
-
- ╖ hardware is available form multiple sources (low prices, easy
- maintenance)
-
- ╖ no reliance on a single hardware vendor
-
- ╖ driver support from Linux commodity
-
- ╖ usually based on standards (SCSI, Ethernet, etc.)
-
- The disadvantages of a CLASS I system are:
-
- ╖ best performance may require CLASS II hardware
-
- CLASS II BEOWULF
-
- A CLASS II Beowulf is simply any machine that does not pass the
- Computer Shopper certification test. This is not a bad thing.
- Indeed, it is merely a classification of the machine.
-
- The advantages of a CLASS II system are:
-
- ╖ Performance can be quite good!
-
- The disadvantages of a CLASS II system are:
-
- ╖ driver support may vary
-
- ╖ reliance on single hardware vendor
-
- ╖ may be more expensive than CLASS I systems.
-
- One CLASS is not necessarily better than the other. It all depends on
- your needs and budget. This classification system is only intended to
- make discussions about Beowulf systems a bit more succinct. The
- "System Design" section may help determine what kind of system is best
- suited for your needs.
-
-
-
-
-
- 3. Architecture Overview
-
-
-
- 3.1. What does it look like ?
-
- I think that the best way of describing the Beowulf supercomputer
- architecture is to use an example which is very similar to the actual
- Beowulf, but familiar to most system administrators. The example that
- is closest to a Beowulf machine is a Unix computer laboratory with a
- server and a number of clients. To be more specific I'll use the DEC
- Alpha undergraduate computer laboratory at the Faculty of Sciences,
- USQ as the example. The server computer is called beldin and the
- client machines are called scilab01, scilab02, scilab03, up to
- scilab20. All clients have a local copy of the Digital Unix 4.0
- operating system installed, but get the user file space (/home) and
- /usr/local from the server via NFS (Network File System). Each client
- has an entry for the server and all the other clients in its
- /etc/hosts.equiv file, so all clients can execute a remote shell (rsh)
- to all others. The server machine is a NIS server for the whole
- laboratory, so account information is the same across all the
- machines. A person can sit at the scilab02 console, login, and have
- the same environment as if he logged onto the server or scilab15. The
- reason all the clients have the same look and feel is that the
- operating system is installed and configured in the same way on all
- machines, and both the user's /home and /usr/local areas are
- physically on the server and accessed by the clients via NFS. For
- more information on NIS and NFS please read the NIS and NFS HOWTOs.
-
-
-
- 3.2. How to utilise the other nodes ?
-
-
- Now that we have some idea about the system architecture, let us take
- a look at how we can utilise the available CPU cycles of the machines
- in the computer laboratory. Any person can logon to any of the
- machines, and run a program in their home directory, but they can also
- spawn the same job on a different machine simply by executing remote
- shell. For example, assume that we want to calculate the sum of the
- square roots of all integers between 1 and 10 inclusive. We write a
- simple program called sigmasqrt (please see ``source code'') which
- does exactly that. To calculate the sum of the square roots of
- numbers from 1 to 10 we execute :
-
- [jacek@beldin sigmasqrt]$ time ./sigmasqrt 1 10
- 22.468278
-
- real 0m0.029s
- user 0m0.001s
- sys 0m0.024s
-
-
- The time command allows us to check the wall-clock (the elapsed time)
- of running this job. As we can see, this example took only a small
- fraction of a second (0.029 sec) to execute, but what if I want to add
- the square root of integers from 1 to 1 000 000 000 ? Let us try
- this, and again calculate the wall-clock time.
-
-
- [jacek@beldin sigmasqrt]$ time ./sigmasqrt 1 1000000000
- 21081851083600.559000
-
- real 16m45.937s
- user 16m43.527s
- sys 0m0.108s
-
-
-
-
- This time, the execution time of the program is considerably longer.
- The obvious question to ask is what can we do to speed up the
- execution time of the job? How can we change the way the job is
- running to minimize the wall-clock time of running this job? The
- obvious answer is to split the job into a number of sub-jobs and to
- run these sub-jobs in parallel on all computers. We could split one
- big addition task into 20 parts, calculating one range of square roots
- and adding them on each node. When all nodes finish the calculation
- and return their results, the 20 numbers could be added together to
- obtain the final solution. Before we run this job we will make a
- named pipe which will be used by all processes to write their results.
-
-
- [jacek@beldin sigmasqrt]$ mkfifo output
- [jacek@beldin sigmasqrt]$ ./prun.sh & time cat output | ./sum
- [1] 5085
- 21081851083600.941000
- [1]+ Done ./prun.sh
-
- real 0m58.539s
- user 0m0.061s
- sys 0m0.206s
-
-
-
- This time we get about 58.5 seconds. This is the time from starting
- the job until all the nodes have finished their computations and
- written their results into the pipe. The time does not include the
- final addition of the twenty numbers, but this time is a very small
- fraction of a second and can be ignored. We can see that there is a
- significant improvement in running this job in parallel. In fact the
- parallel job ran about 17 times faster, which is very reasonable for a
- 20 fold increase in the number of CPUs. The purpose of the above
- example is to illustrate the simplest method of parallelising
- concurrent code. In practice such simple examples are rare and
- different techniques (PVM and PMI APIs) are used to achieve the
- parallelism.
-
-
-
- 3.3. How does Beowulf differ from a COW ?
-
- The computer laboratory described above is a perfect example of a
- Cluster of Workstations (COW). So what is so special about Beowulf,
- and how is it different from a COW? The truth is that there is not
- much difference, but Beowulf does have few unique characteristics.
- First of all, in most cases client nodes in a Beowulf cluster do not
- have keyboards, mice, video cards nor monitors. All access to the
- client nodes is done via remote connections from the server node,
- dedicated console node, or a serial console. Because there is no need
- for client nodes to access machines outside the cluster, nor for
- machines outside the cluster to access client nodes directly, it is a
- common practice for the client nodes to use private IP addresses like
- the 10.0.0.0/8 or 192.168.0.0/16 address ranges (RFC 1918
- http://www.alternic.net/rfcs/1900/rfc1918.txt.html). Usually the only
- machine that is also connected to the outside world using a second
- network card is the server node. The most common ways of using the
- system is to access the server's console directly, or either telnet or
- remote login to the server node from personal workstation. Once on
- the server node, users can edit and compile their code, and also spawn
- jobs on all nodes in the cluster. In most cases COWs are used for
- parallel computations at night, and over weekends when people do not
- actually use the workstations for every day work, thus utilising idle
- CPU cycles. Beowulf on the other hand is a machine usually dedicated
- to parallel computing, and optimised for this purpose. Beowulf also
- gives better price/performance ratio as it is built from off-the-shelf
- components and runs mainly free software. Beowulf has also more
- single system image features which help the users to see the Beowulf
- cluster as a single computing workstation.
-
-
-
- 4. System Design
-
- Before you purchase any hardware, it may be a good idea to consider
- the design of your system. There are basically two hardware issues
- involved with design of a Beowulf system: the type of nodes or
- computers you are going to use; and way you connect the computer
- nodes. There is one software issue that may effect your hardware
- decisions; the communication library or API. A more detailed
- discussion of hardware and communication software is provided later in
- this document.
-
- While the number of choices is not large, there are some important
- design decisions that must be made when constructing a Beowulf
- systems. Because the science (or art) of "parallel computing" has
- many different interpretations, an introduction is provided below. If
- you do not like to read background material, you may skip this
- section, but it is advised that you read section ``Suitability''
- before you make you final hardware decisions.
-
-
- 4.1. A brief background on parallel computing.
-
- This section provides background on parallel computing concepts. It
- is NOT an exhaustive or complete description of parallel computing
- science and technology. It is a brief description of the issues that
- may be important to a Beowulf designer and user.
- As you design and build your Beowulf, many of these issues described
- below will become important in your decision process. Due to its
- component nature, a Beowulf Supercomputer requires that we consider
- many factors carefully because they are now under our control. In
- general, it is not all that difficult to understand the issues
- involved with parallel computing. Indeed, once the issues are
- understood, your expectations will be more realistic and success will
- be more likely. Unlike the "sequential world" where processor speed is
- considered the single most important factor, processor speed in the
- "parallel world" is just one of several factors that will determine
- overall system performance and efficiency.
-
-
-
- 4.2. The methods of parallel computing
-
- Parallel computing can take many forms. From a user's perspective, it
- is important to consider the advantages and disadvantages of each
- methodology. The following section attempts to provide some
- perspective on the methods of parallel computing and indicate where
- the Beowulf machine falls on this continuum.
-
-
- 4.2.1. Why more than one CPU?
-
- Answering this question is important. Using 8 CPUs to run your word
- processor sounds a little like "over-kill" -- and it is. What about a
- web server, a database, a rendering program, or a project scheduler?
- Maybe extra CPUs would help. What about a complex simulation, a fluid
- dynamics code, or a data mining application. Extra CPUs definitely
- help in these situations. Indeed, multiple CPUs are being used to
- solve more and more problems.
-
- The next question usually is: "Why do I need two or four CPUs, I will
- just wait for the 986 turbo-hyper chip." There are several reasons:
-
- 1. Due to the use of multi-tasking Operating Systems, it is possible
- to do several things at once. This is a natural "parallelism" that
- is easily exploited by more than one low cost CPU.
-
- 2. Processor speeds have been doubling every 18 months, but what about
- RAM speeds or hard disk speeds? Unfortunately, these speeds are not
- increasing as fast as the CPU speeds. Keep in mind most
- applications require "out of cache memory access" and hard disk
- access. Doing things in parallel is one way to get around some of
- these limitations.
-
- 3. Predictions indicate that processor speeds will not continue to
- double every 18 months after the year 2005. There are some very
- serious obstacles to overcome in order to maintain this trend.
-
- 4. Depending on the application, parallel computing can speed things
- up by any where from 2 to 500 times faster (in some cases even
- faster). Such performance is not available using a single
- processor. Even supercomputers that at one time used very fast
- custom processors are now built from multiple "commodity- off-the-
- shelf" CPUs.
-
- If you need speed - either due to a compute bound problem and/or an
- I/O bound problem, parallel is worth considering. Because parallel
- computing is implemented in a variety of ways, solving your problem in
- parallel will require some very important decisions to be made. These
- decisions may dramatically effect portability, performance, and cost
- of your application.
-
-
- Before we get technical, let's look take a look at a real "parallel
- computing problem" using an example with which we are familiar -
- waiting in long lines at a store.
-
-
- 4.2.2. The Parallel Computing Store
-
- Consider a big store with 8 cash registers grouped together in the
- front of the store. Assume each cash register/cashier is a CPU and
- each customer is a computer program. The size of the computer program
- (amount of work) is the size of each customer's order. The following
- analogies can be used to illustrate parallel computing concepts.
-
-
- 4.2.2.1. Single-tasking Operating System
-
- One cash register open (is in use) and must process each customer one
- at a time.
-
- Computer Example: MS DOS
-
-
- 4.2.2.2. Multi-tasking Operating System:
-
- One cash register open, but now we process only a part of each order
- at a time, move to the next person and process some of their order.
- Everyone "seems" to be moving through the line together, but if no one
- else is in the line, you will get through the line faster.
-
- Computer Example: UNIX, NT using a single CPU
-
-
- 4.2.2.3. Multitasking Operating Systems with Multiple CPUs:
-
- Now we open several cash registers in the store. Each order can be
- processed by a separate cash register and the line can move much
- faster. This is called SMP - Symmetric Multi-processing. Although
- there are extra cash registers open, you will still never get through
- the line any faster than just you and a single cash register.
-
- Computer Example: UNIX and NT with multiple CPUs
-
-
-
- 4.2.2.4. Threads on a Multitasking Operating Systems extra CPUs
-
- If you "break-up" the items in your order, you might be able to move
- through the line faster by using several cash registers at one time.
- First, we must assume you have a large amount of goods, because the
- time you invest "breaking up your order" must be regained by using
- multiple cash registers. In theory, you should be able to move
- through the line "n" times faster than before*; where "n" is the
- number of cash registers. When the cashiers need to get sub- totals,
- they can exchange information quickly by looking and talking to all
- the other "local" cash registers. They can even snoop around the other
- cash registers to find information they need to work faster. There is
- a limit, however, as to how many cash registers the store can
- effectively locate in any one place.
-
- Amdals law will also limit the application speed-up to the slowest
- sequential portion of the program.
-
- Computer Example: UNIX or NT with extra CPU on the same motherboard
- running multi-threaded programs.
-
-
- 4.2.2.5. Sending Messages on Multitasking Operating Systems with
- extra CPUs:
-
- In order to improve performance, the store adds 8 cash registers at
- the back of the store. Because the new cash registers are far away
- from the front cash registers, the cashiers must call on the phone to
- send their sub-totals to the front of the store. This distance adds
- extra overhead (time) to communication between cashiers, but if
- communication is minimized, it is not a problem. If you have a
- really big order, one that requires all the cash registers, then as
- before your speed can be improved by using all cash registers at the
- same time, the extra overhead must be considered. In some cases, the
- store may have single cash registers (or islands of cash registers)
- located all over the store - each cash register (or island) must
- communicate by phone. Since all the cashiers working the cash
- registers can talk to each other by phone, it does not matter too much
- where they are.
-
- Computer Example: One or several copies of UNIX or NT with extra CPUs
- on the same or different motherboard communicating through messages.
-
- The above scenarios, although not exact, are a good representation of
- constraints placed on parallel systems. Unlike a single CPU (or cash
- register) communication is an issue.
-
-
- 4.3. Architectures for parallel computing
-
- The common methods and architectures of parallel computing are
- presented below. While this description is by no means exhaustive, it
- is enough to understand the basic issues involved with Beowulf design.
-
-
- 4.3.1. Hardware Architectures
-
-
- There are basically two ways parallel computer hardware is put
- together:
-
-
- 1. Local memory machines that communicate by messages (Beowulf
- Clusters)
-
- 2. Shared memory machines that communicate through memory (SMP
- machines)
-
- A typical Beowulf is a collection of single CPU machines connected
- using fast Ethernet and is, therefore, a local memory machine. A 4
- way SMP box is a shared memory machine and can be used for parallel
- computing - parallel applications communicate using shared memory.
- Just as in the computer store analogy, local memory machines
- (individual cash registers) can be scaled up to large numbers of CPUs,
- while the number of CPUs shared memory machines (the number of cash
- registers you can place in one spot) can have is limited due to memory
- contention.
-
- It is possible, however, to connect many shared memory machines to
- create a "hybrid" shared memory machine. These hybrid machines "look"
- like a single large SMP machine to the user and are often called NUMA
- (non uniform memory access) machines because the global memory seen by
- the programmer and shared by all the CPUs can have different
- latencies. At some level, however, a NUMA machine must "pass
- messages" between local shared memory pools.
-
- It is also possible to connect SMP machines as local memory compute
- nodes. Typical CLASS I motherboards have either 2 or 4 CPUs and are
- often used as a means to reduce the overall system cost. The Linux
- internal scheduler determines how these CPUs get shared. The user
- cannot (at this point) assign a specific task to a specific SMP
- processor. The user can however, start two independent processes or a
- threaded processes and expect to see a performance increase over a
- single CPU system.
-
-
- 4.3.2. Software API Architectures
-
- There basically two ways to "express" concurrency in a program:
-
- 1. Using Messages sent between processors
-
- 2. Using operating system Threads
-
- Other methods do exist, but these are the two most widely used. It is
- important to remember that the expression of concurrency is not
- necessary controlled by the underlying hardware. Both Messages and
- Threads can be implemented on SMP, NUMA-SMP, and clusters - although
- as explained below efficiently and portability are important issues.
-
-
- 4.3.2.1. Messages
-
- Historically, messages passing technology reflected the design of
- early local memory parallel computers. Messages require copying data
- while Threads use data in place. The latency and speed at which
- messages can be copied are the limiting factor with message passing
- models. A Message is quite simple: some data and a destination
- processor. Common message passing APIs are PVM or MPI. Message
- passing can be efficiently implemented using Threads and Messages work
- well both on SMP machine and between clusters of machines. The
- advantage to using messages on an SMP machine, as opposed to Threads,
- is that if you decided to use clusters in the future it is easy to add
- machines or scale your application.
-
-
- 4.3.2.2. Threads
-
- Operating system Threads were developed because shared memory SMP
- (symmetrical multiprocessing) designs allowed very fast shared memory
- communication and synchronization between concurrent parts of a
- program. Threads work well on SMP systems because communication is
- through shared memory. For this reason the user must isolate local
- data from global data, otherwise programs will not work properly. In
- contrast to messages, a large amount of copying can be eliminated with
- threads because the data is shared between processes (threads). Linux
- supports POSIX threads. The problem with threads is that it is
- difficult to extend them beyond one SMP machine and because data is
- shared between CPUs, cache coherence issues can contribute to
- overhead. Extending threads beyond the SMP boundary efficiently
- requires NUMA technology which is expensive and not natively supported
- by Linux. Implementing threads on top of messages has been done
- ((http://syntron.com/ptools/ptools_pg.htm)), but Threads are often
- inefficient when implemented using messages.
-
- The following can be stated about performance:
-
-
-
-
-
-
-
-
- SMP machine cluster of machines scalability
- performance performance
- ----------- ------------------- -----------
- messages good best best
-
- threads best poor* poor*
-
- * requires expensive NUMA technology.
-
-
-
-
- 4.3.3. Application Architecture
-
- In order to run an application in parallel on multiple CPUs, it must
- be explicitly broken in to concurrent parts. A standard single CPU
- application will run no faster than a single CPU application on
- multiple processors. There are some tools and compilers that can
- break up programs, but parallelizing codes is not a "plug and play"
- operation. Depending on the application, parallelizing code can be
- easy, extremely difficult, or in some cases impossible due to
- algorithm dependencies.
-
- Before the software issues can be addressed the concept of Suitability
- needs to be introduced.
-
-
- 4.4. Suitability
-
- Most questions about parallel computing have the same answer:
-
- "It all depends upon the application."
-
- Before we jump into the issues, there is one very important
- distinction that needs to be made - the difference between CONCURRENT
- and PARALLEL. For the sake of this discussion we will define these
- two concepts as follows:
-
- CONCURRENT parts of a program are those that can be computed
- independently.
-
- PARALLEL parts of a program are those CONCURRENT parts that are
- executed on separate processing elements at the same time.
-
- The distinction is very important, because CONCURRENCY is a property
- of the program and efficient PARALLELISM is a property of the machine.
- Ideally, PARALLEL execution should result in faster performance. The
- limiting factor in parallel performance is the communication speed and
- latency between compute nodes. (Latency also exists with threaded SMP
- applications due to cache coherency.) Many of the common parallel
- benchmarks are highly parallel and communication and latency are not
- the bottle neck. This type of problem can be called "obviously
- parallel". Other applications are not so simple and executing
- CONCURRENT parts of the program in PARALLEL may actually cause the
- program to run slower, thus offsetting any performance gains in other
- CONCURRENT parts of the program. In simple terms, the cost of
- communication time must pay for the savings in computation time,
- otherwise the PARALLEL execution of the CONCURRENT part is
- inefficient.
-
- The task of the programmer is to determining what CONCURRENT parts of
- the program SHOULD be executed in PARALLEL and what parts SHOULD NOT.
- The answer to this will determine the EFFICIENCY of application. The
- following graph summarizes the situation for the programmer:
-
-
- | *
- | *
- | *
- % of | *
- appli- | *
- cations | *
- | *
- | *
- | *
- | *
- | *
- | ****
- | ****
- | ********************
- +-----------------------------------
- communication time/processing time
-
-
-
- In a perfect parallel computer, the ratio of communication/processing
- would be equal and anything that is CONCURRENT could be implemented in
- PARALLEL. Unfortunately, Real parallel computers, including shared
- memory machines, are subject to the effects described in this graph.
- When designing a Beowulf, the user may want to keep this graph in mind
- because parallel efficiency depends upon ratio of communication time
- and processing time for A SPECIFIC PARALLEL COMPUTER. Applications
- may be portable between parallel computers, but there is no guarantee
- they will be efficient on a different platform.
-
- IN GENERAL, THERE IS NO SUCH THING AS A PORTABLE AND EFFICIENT
- PARALLEL PROGRAM
-
- There is yet another consequence to the above graph. Since efficiency
- depends upon the comm./process. ratio, just changing one component of
- the ratio does not necessary mean a specific application will perform
- faster. A change in processor speed, while keeping the communication
- speed that same may have non- intuitive effects on your program. For
- example, doubling or tripling the CPU speed, while keeping the
- communication speed the same, may now make some previously efficient
- PARALLEL portions of your program, more efficient if they were
- executed SEQUENTIALLY. That is, it may now be faster to run the
- previously PARALLEL parts as SEQUENTIAL. Furthermore, running
- inefficient parts in parallel will actually keep your application from
- reaching its maximum speed. Thus, by adding faster processor, you may
- actually slowed down your application (you are keeping the new CPU
- from running at its maximum speed for that application)
-
- UPGRADING TO A FASTER CPU MAY ACTUALLY SLOW DOWN YOUR APPLICATION
-
- So, in conclusion, to know whether or not you can use a parallel
- hardware environment, you need to have some insight into the
- suitability of a particular machine to your application. You need to
- look at a lot of issues including CPU speeds, compiler, message
- passing API, network, etc. Please note, just profiling an
- application, does not give the whole story. You may identify a
- computationally heavy portion of your program, but you do not know the
- communication cost for this portion. It may be that for a given
- system, the communication cost as do not make parallelizing this code
- efficient.
-
-
- A final note about a common misconception. It is often stated that "a
- program is PARALLELIZED", but in reality only the CONCURRENT parts of
- the program have been located. For all the reasons given above, the
- program is not PARALLELIZED. Efficient PARALLELIZATION is a property
- of the machine.
- 4.5. Writing and porting parallel software
-
- Once you decide that you need parallel computing and would like to
- design and build a Beowulf, a few moments considering your application
- with respect to the previous discussion may be a good idea.
-
- In general there are two things you can do:
-
- 1. Go ahead and construct a CLASS I Beowulf and then "fit" your
- application to it. Or run existing parallel applications that you
- know work on your Beowulf (but beware of the portability and
- efficiently issues mentioned above)
-
- 2. Look at the applications you need to run on your Beowulf and make
- some estimations as to the type of hardware and software you need.
-
- In either case, at some point you will need to look at the efficiency
- issues. In general, there are three things you need to do:
-
- 1. Determine concurrent parts of your program
-
- 2. Estimate parallel efficiently
-
- 3. Describing the concurrent parts of your program
-
- Let's look at these one at a time.
-
-
- 4.5.1. Determine concurrent parts of your program
-
- This step is often considered "parallelizing your program".
- Parallelization decisions will be made in step 2. In this step, you
- need to determine data dependencies.
-
- >From a practical standpoint, applications may exhibit two types of
- concurrency: compute (number crunching) and I/O (database). Although
- in many cases compute and I/O concurrency are orthogonal, there are
- application that require both. There are tools available that can
- perform concurrency analysis on existing applications. Most of these
- tools are designed for FORTRAN. There are two reasons FORTRAN is
- used: historically most number crunching applications were written in
- FORTRAN and it is easier to analyze. If no tools are available, then
- this step can be some what difficult for existing applications.
-
-
- 4.5.2. Estimate parallel efficiency
-
- Without the help of tools, this step may require trial and error tests
- or just a plain old educated guess. If you have a specific
- application in mind, try to determine if it is CPU limited (compute
- bound) or hard disk limited (I/O bound). The requirements of your
- Beowulf may be quite different depending upon your needs. For
- example, a compute bound problem may need a few very fast CPUs and
- high speed low latency network, while an I/O bound problem may work
- better with more slower CPUs and fast Ethernet.
-
- This recommendation often comes as a surprise to most people because,
- the standard assumption is that faster processor are always better.
- While this is true if your have an unlimited budget, real systems may
- have cost constraints that should be maximized. For I/O bound
- problems, there is a little known rule (called the Eadline-Dedkov Law)
- that is quite helpful:
-
- For two given parallel computers with the same cumulative CPU
- performance index, the one which has slower processors (and a probably
- correspondingly slower interprocessor communication network) will have
- better performance for I/O-dominant applications.
-
- While the proof of this rule is beyond the scope of this document, you
- find it interesting to download the paper Performance Considerations
- for I/O-Dominant Applications on Parallel Computers (Postscript format
- 109K ) (ftp://www.plogic.com/pub/papers/exs-pap6.ps)
-
- Once you have determined what type of concurrency you have in your
- application, you will need to estimate how efficient it will be in
- parallel. See Section ``Software'' for a description of Software
- tools.
-
- In the absence of tools, you may try to guess your way through this
- step. If a compute bound loop measured in minutes and the data can be
- transferred in seconds, then it might be a good candidate for
- parallelization. But remember, if you take a 16 minute loop and break
- it into 32 parts, and your data transfers require several seconds per
- part, then things are going to get tight. You will reach a point of
- diminishing returns.
-
-
- 4.5.3. Describing the concurrent parts of your program
-
- There are several ways to describe concurrent parts of your program:
-
- 1. Explicit parallel execution
-
- 2. Implicit parallel execution
-
- The major difference between the two is that explicit parallelism is
- determined by the user where implicit parallelism is determined by the
- compiler.
-
-
- 4.5.3.1. Explicit Methods
-
- These are basically method where the user must modify source code
- specifically for a parallel computer. The user must either add
- messages using PVM or MPI or add threads using POSIX threads. (Keep in
- mind however, threads can not move between SMP motherboards).
-
- Explicit methods tend to be the most difficult to implement and debug.
- Users typically embed explicit function calls in standard FORTRAN 77
- or C/C++ source code. The MPI library has added some functions to
- make some standard parallel methods easier to implement (i.e.
- scatter/gather functions). In addition, it is also possible to use
- standard libraries that have been written for parallel computers.
- Keep in mind, however, the portability vs. efficiently trade-off)
-
-
- For historical reasons, most number crunching codes are written in
- FORTRAN. For this reasons, FORTRAN has the largest amount of support
- (tools, libraries, etc.) for parallel computing. Many programmers now
- use C or re- write existing FORTRAN applications in C with the notion
- the C will allow faster execution. While this may be true as C is the
- closest thing to a universal machine code, it has some major
- drawbacks. The use of pointers in C makes determining data
- dependencies extremely difficult. Automatic analysis of pointers is
- extremely difficult. If you have an existing FORTRAN program and think
- that you might want to parallelize it in the future - DO NOT CONVERT
- IT TO C!
-
-
-
-
-
- 4.5.3.2. Implicit Methods
-
- Implicit methods are those where the user gives up some (or all) of
- the parallelization decisions to the compiler. Examples are FORTRAN
- 90, High Performance FORTRAN (HPF), Bulk Synchronous Parallel (BSP),
- and a whole collection of other methods that are under development.
-
- Implicit methods require the user to provide some information about
- the concurrent nature of their application, but the compiler will then
- make many decisions about how to execute this concurrency in parallel.
- These methods provide some level of portability and efficiency, but
- there is still no "best way" to describe a concurrent problem for a
- parallel computer.
-
-
- 5. Beowulf Resources
-
-
-
- 5.1. Starting Points
-
-
-
- ╖ Beowulf mailing list. To subscribe send mail to beowulf-
- request@cesdis.gsfc.nasa.gov with the word subscribe in the message
- body.
-
- ╖ Beowulf Homepage http://www.beowulf.org
-
- ╖ Extreme Linux http://www.extremelinux.org
-
- ╖ Extreme Linux Software from Red Hat http://www.redhat.com/extreme
-
-
-
- 5.2. Documentation
-
-
-
- ╖ The latest version of the Beowulf HOWTO
- http://www.sci.usq.edu.au/staff/jacek/beowulf.
-
- ╖ Building a Beowulf System
- http://www.cacr.caltech.edu/beowulf/tutorial/building.html
-
- ╖ Jacek's Beowulf Links
- http://www.sci.usq.edu.au/staff/jacek/beowulf.
-
- ╖ Beowulf Installation and Administration HOWTO (DRAFT)
- http://www.sci.usq.edu.au/staff/jacek/beowulf.
-
- ╖ Linux Parallel Processing HOWTO
- http://yara.ecn.purdue.edu/~pplinux/PPHOWTO/pphowto.html
-
-
-
- 5.3. Papers
-
-
-
- ╖ Chance Reschke, Thomas Sterling, Daniel Ridge, Daniel Savarese,
- Donald Becker, and Phillip Merkey A Design Study of Alternative
- Network Topologies for the Beowulf Parallel Workstation.
- Proceedings Fifth IEEE International Symposium on High Performance
- Distributed Computing, 1996.
- http://www.beowulf.org/papers/HPDC96/hpdc96.html
- ╖ Daniel Ridge, Donald Becker, Phillip Merkey, Thomas Sterling
- Becker, and Phillip Merkey. Harnessing the Power of Parallelism in
- a Pile-of-PCs. Proceedings, IEEE Aerospace, 1997.
- http://www.beowulf.org/papers/AA97/aa97.ps
-
-
- ╖ Thomas Sterling, Donald J. Becker, Daniel Savarese, Michael R.
- Berry, and Chance Res. Achieving a Balanced Low-Cost Architecture
- for Mass Storage Management through Multiple Fast Ethernet Channels
- on the Beowulf Parallel Workstation. Proceedings, International
- Parallel Processing Symposium, 1996.
- http://www.beowulf.org/papers/IPPS96/ipps96.html
-
-
- ╖ Donald J. Becker, Thomas Sterling, Daniel Savarese, Bruce Fryxell,
- Kevin Olson. Communication Overhead for Space Science Applications
- on the Beowulf Parallel Workstation. Proceedings,High Performance
- and Distributed Computing, 1995.
- http://www.beowulf.org/papers/HPDC95/hpdc95.html
-
-
-
- ╖ Donald J. Becker, Thomas Sterling, Daniel Savarese, John E.
- Dorband, Udaya A. Ranawak, Charles V. Packer. BEOWULF: A PARALLEL
- WORKSTATION FOR SCIENTIFIC COMPUTATION. Proceedings, International
- Conference on Parallel Processing, 95.
- http://www.beowulf.org/papers/ICPP95/icpp95.html
-
- ╖ Papers at the Beowulf site
- http://www.beowulf.org/papers/papers.html
-
-
-
-
- 5.4. Software
-
-
- ╖ PVM - Parallel Virtual Machine
- http://www.epm.ornl.gov/pvm/pvm_home.html
-
-
-
- ╖ LAM/MPI (Local Area Multicomputer / Message Passing Interface
- http://www.mpi.nd.edu/lam
-
- ╖ BERT77 - FORTRAN conversion tool http://www.plogic.com/bert.html
-
- ╖ Beowulf software from Beowulf Project Page
- http://beowulf.gsfc.nasa.gov/software/software.html
-
- ╖ Jacek's Beowulf-utils ftp://ftp.sci.usq.edu.au/pub/jacek/beowulf-
- utils
-
- ╖ bWatch - cluster monitoring tool
- http://www.sci.usq.edu.au/staff/jacek/bWatch
-
-
-
-
- 5.5. Beowulf Machines
-
-
- ╖ Avalon consists of 140 Alpha processors, 36 GB of RAM, and is
- probably the fastest Beowulf machine, cruising at 47.7 Gflops and
- ranking 114th on the Top 500 list. http://swift.lanl.gov/avalon/
-
- ╖ Megalon-A Massively PArallel CompuTer Resource (MPACTR) consists of
- 14, quad CPU Pentium Pro 200 nodes, and 14 GB of RAM.
- http://megalon.ca.sandia.gov/description.html
-
- ╖ theHIVE - Highly-parallel Integrated Virtual Environment is another
- fast Beowulf Supercomputer. theHIVE is a 64 node, 128 CPU machine
- with the total of 4 GB RAM. http://newton.gsfc.nasa.gov/thehive/
-
- ╖ Topcat is a much smaller machine and consists of 16 CPUs and 1.2 GB
- RAM. http://www.sci.usq.edu.au/staff/jacek/topcat
-
- ╖ MAGI cluster - this is a very interesting site with many good
- links. http://noel.feld.cvut.cz/magi/
-
-
-
-
- 5.6. Other Interesting Sites
-
-
-
- ╖ SMP Linux http://www.linux.org.uk/SMP/title.html
-
- ╖ Paralogic - Buy a Beowulf http://www.plogic.com
-
-
- 5.7. History
-
-
- ╖ Legends - Beowulf http://legends.dm.net/beowulf/index.html
-
- ╖ The Adventures of Beowulf
- http://www.lnstar.com/literature/beowulf/beowulf.html
-
-
- 6. Source code
-
-
- 6.1. sum.c
-
-
- /* Jacek Radajewski jacek@usq.edu.au */
- /* 21/08/1998 */
-
- #include <stdio.h>
- #include <math.h>
-
- int main (void) {
-
- double result = 0.0;
- double number = 0.0;
- char string[80];
-
-
- while (scanf("%s", string) != EOF) {
-
- number = atof(string);
- result = result + number;
- }
-
- printf("%lf\n", result);
-
- return 0;
-
- }
-
- 6.2. sigmasqrt.c
-
-
- /* Jacek Radajewski jacek@usq.edu.au */
- /* 21/08/1998 */
-
- #include <stdio.h>
- #include <math.h>
-
- int main (int argc, char** argv) {
-
- long number1, number2, counter;
- double result;
-
- if (argc < 3) {
- printf ("usage : %s number1 number2\n",argv[0]);
- exit(1);
- } else {
- number1 = atol (argv[1]);
- number2 = atol (argv[2]);
- result = 0.0;
- }
-
- for (counter = number1; counter <= number2; counter++) {
- result = result + sqrt((double)counter);
- }
-
- printf("%lf\n", result);
-
- return 0;
-
- }
-
-
-
-
-
- 6.3. prun.sh
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- #!/bin/bash
- # Jacek Radajewski jacek@usq.edu.au
- # 21/08/1998
-
- export SIGMASQRT=/home/staff/jacek/beowulf/HOWTO/example1/sigmasqrt
-
- # $OUTPUT must be a named pipe
- # mkfifo output
-
- export OUTPUT=/home/staff/jacek/beowulf/HOWTO/example1/output
-
- rsh scilab01 $SIGMASQRT 1 50000000 > $OUTPUT < /dev/null&
- rsh scilab02 $SIGMASQRT 50000001 100000000 > $OUTPUT < /dev/null&
- rsh scilab03 $SIGMASQRT 100000001 150000000 > $OUTPUT < /dev/null&
- rsh scilab04 $SIGMASQRT 150000001 200000000 > $OUTPUT < /dev/null&
- rsh scilab05 $SIGMASQRT 200000001 250000000 > $OUTPUT < /dev/null&
- rsh scilab06 $SIGMASQRT 250000001 300000000 > $OUTPUT < /dev/null&
- rsh scilab07 $SIGMASQRT 300000001 350000000 > $OUTPUT < /dev/null&
- rsh scilab08 $SIGMASQRT 350000001 400000000 > $OUTPUT < /dev/null&
- rsh scilab09 $SIGMASQRT 400000001 450000000 > $OUTPUT < /dev/null&
- rsh scilab10 $SIGMASQRT 450000001 500000000 > $OUTPUT < /dev/null&
- rsh scilab11 $SIGMASQRT 500000001 550000000 > $OUTPUT < /dev/null&
- rsh scilab12 $SIGMASQRT 550000001 600000000 > $OUTPUT < /dev/null&
- rsh scilab13 $SIGMASQRT 600000001 650000000 > $OUTPUT < /dev/null&
- rsh scilab14 $SIGMASQRT 650000001 700000000 > $OUTPUT < /dev/null&
- rsh scilab15 $SIGMASQRT 700000001 750000000 > $OUTPUT < /dev/null&
- rsh scilab16 $SIGMASQRT 750000001 800000000 > $OUTPUT < /dev/null&
- rsh scilab17 $SIGMASQRT 800000001 850000000 > $OUTPUT < /dev/null&
- rsh scilab18 $SIGMASQRT 850000001 900000000 > $OUTPUT < /dev/null&
- rsh scilab19 $SIGMASQRT 900000001 950000000 > $OUTPUT < /dev/null&
- rsh scilab20 $SIGMASQRT 950000001 1000000000 > $OUTPUT < /dev/null&
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-