Linux Parallel Processing HOWTO
Prof. Hank Dietz
Purdue University School of Electrical and Computer Engineering
Still under construction... 15 March 1996
If you are only interested in using an SMP to execute multiple
independent serial programs simultaneously, there is nothing you
have to do other than get appropriate SMP hardware and install SMP Linux on it.
That's not what this document is about.
This document provides an overview of how to use multiple
processors to speed-up execution of an individual program.
Use of SMP Linux
clusters of networked Linux systems,
and attached (parallel)
processors hosted by Linux, are all discussed. If you don't already
know what Linux is, you may want to take a quick peek at the Linux
Documentation Project home page.
Parallel Processing refers to the concept of speeding-up the
execution of a program by dividing the program into multiple fragments
that can execute simultaneously, each on its own processor. A program
being executed across n processors might execute n
times faster than it would using a single processor.
Traditionally, multiple processors were provided within a specially
designed "parallel computer"; along these lines, Linux now supports
SMP Pentium systems in which multiple processors
share a single memory and bus interface within a single computer. It
is also possible for a group of computers (for example, a group of PCs
each running Linux) to be interconnected by a network to form a
parallel-processing cluster. The third alternative
for parallel computing using Linux is to use a Linux system as a
"host" for a specialized attached parallel processing
Do I Want To Use Parallel Processing?
Probably not -- for most applications. Although use of multiple
processors can speed-up many operations, most applications cannot yet
benefit from parallel processing. Basically, parallel processing is
appropriate only if:
Your application has enough parallelism to make good use of multiple
processors. In part, this is a matter of identifying portions of the
program that can execute independently and simultaneously on separate
processors, but you will also find that some things that could execute
in parallel might actually slow execution if executed in parallel
using a particular system. For example, a program that takes four
seconds to execute within a single machine might be able to execute in
only one second of processor time on each of four machines, but no
speedup would be achieved if it took three seconds or more for these
machines to coordinate their actions.
Either the particular application program you are interested in has
already been parallelized (rewritten to take advantage of
parallel processing) or you are willing to do at least some new coding
to take advantage of parallel processing.
You are interested in researching, or at least becoming familiar with,
issues involving parallel processing. Parallel processing using Linux
systems isn't necessarily difficult, but it is not familiar to most
computer users, and there isn't any book called "Parallel Processing
for Dummies"... at least not yet. This HOWTO is a good starting
point, not all you need to know.
The good news is that if all the above are true, you'll find that
parallel processing using Linux can yield supercomputer performance
for some programs that perform complex computations or operate on
large data sets. What's more, it can do that using cheap hardware...
which you might already own. It is worthwhile noting that a parallel
Linux system can still be used for other things when it is not busy
executing a parallel job.
If parallel processing is not what you want, but you would
like to achieve at least a modest improvement in performance, there is
still hope. The Linux in High-Performance
Computing page used to be a good reference; now, you
might try this.
Although parallel processing has been used for many years in many
systems, it is still somewhat unfamiliar to most computer users.
Thus, before discussing the various alternatives, it is important to
become familiar with a few commonly used terms.
SIMD (Single Instruction stream, Multiple Data stream)
SIMD refers to a parallel execution model in which all processors
execute the same operation at the same time, but each processor is
allowed to operate upon its own data. This model naturally fits the
concept of performing the same operation on every element of an array,
and is thus often associated with vector or array manipulation.
Because all operations are inherently synchronized, interactions among
SIMD processors tend to be easily and efficiently implemented.
MIMD (Multiple Instruction stream, Multiple Data stream)
MIMD refers to a parallel execution model in which each processor is
essentially acting independently. This model most naturally fits the
concept of decomposing a program for parallel execution on a
functional basis; for example, one processor might update a database
file while another processor generates a graphic display of the new
entry. This is a more flexible model than SIMD execution, but it is
achieved at the risk of debugging nightmares called race
conditions, in which a program may intermittently fail due to
timing variations reordering the operations of one processor relative
to those of another.
The bandwidth of a communication system is the maximum amount of data
that can be transmitted in a unit of time... once data transmission
has begun. Bandwidth for serial connections is often measured in
baud or bits/second, which generally
correspond to 1/10 to 1/8 that many Bytes each second. For example, a
1,200 baud modem transfers about 120 Bytes/s, whereas a 155 Mbit/s ATM
network connection is nearly 130,000 times faster, transferring about
about 17 million Bytes/s. High bandwidth allows large blocks of data
to be transferred efficiently between processors.
The latency of a communication system is the minimum time taken to
transmit one object, including any send and receive software
overhead. Latency is very important in parallel processing because it
determines the minimum useful grain size, the minimum
run time for a segment of code to yield speed-up through parallel
execution. Basically, if a segment of code runs for less time than it
takes to transmit its result value (i.e., latency), executing that
code segment serially on the processor that needed the result value
would be faster because it would avoid the communication.
Message passing is a model for interactions between processors within
a parallel system. In general, a message is constructed by software
on one processor and is sent through an interconnection network to
another processor, which then must accept and act upon the message
contents. Although the overhead in handling each message (latency)
may be high, there are typically few restrictions on how much
information each message may contain. Thus, message passing can yield
high bandwidth making it a very effective way to transmit a large
block of data from one processor to another. However, to minimize the
need for expensive message passing operations, data structures within
a parallel program must be spread across the processors so that most
data referenced by each processor is in its local memory... this task
is known as data layout.
Shared memory is a model for interactions between processors within a
parallel system. Systems like the multi-processor Pentium machines
running Linux physically share a single memory among
their processors, so that a value written to shared memory by one
processor can be directly accessed by any processor. Alternatively,
logically shared memory can be implemented for
systems in which each processor has it own memory by converting each
non-local memory reference into an appropriate inter-processor
communication. Either implementation of shared memory is generally
considered easier to use than message passing. Physically shared
memory can have both high bandwidth and low latency, but only when
multiple processors do not try to access the bus simultaneously; thus,
data layout still can seriously impact performance, and cache effects,
etc., can make it difficult to determine what the best layout is.
Aggregate Function or Collective Communication
In both the message passing and shared memory models, a communication
is initiated by a single processor; in contrast, aggregate function or
collective communication is an inherently parallel communication model
in which an entire group of processors act together. The simplest such
action is a barrier synchronization, in which each
individual processor waits until every processor in the group has
arrived at the barrier. By having each processor output a datum as a
side-effect of reaching a barrier, it is possible to have the
communication hardware return a value to each processor which is an
arbitrary function of the values collected from all processors. For
example, the return value might be the answer to the question "did any
processor find a solution?" or it might be the sum of one value from
each processor. Latency can be very low, but bandwidth per processor
also tends to be low. Traditionally, this model is used primarily to
control parallel execution rather than to distribute data values.
SMP (Symmetric Multi-Processor)
SMP refers to the operating system concept of a group of processors
working together as peers, so that any piece of work could be done
equally well by any processor. Typically, SMP implies the combination
of MIMD and shared memory.
Attached processors are essentially special-purpose computers that are
connected to a host system to accelerate specific
types of computation. For example, many video and audio cards for PCs
contain attached processors designed, respectively, to accelerate
common graphics operations and audio DSP (Digital
Signal Processing). There is also a wide range of attached
array processors, so called because they are designed
to accelerate arithmetic operations on arrays. In fact, many
commercial supercomputers are really attached processors with
RAID, Redundant Array of Inexpensive Disks, is a simple technology for
increasing both the bandwidth and reliability of disk I/O. Although
there are many different variations, all have two key concepts in
common. First, each data block is striped across a
group of n+k disk drives such that each drive only has to
read or write 1/n of the data... yielding n
times the bandwidth of one drive. Second, redundant data is written
so that data can be recovered if a disk drive fails; this is important
because otherwise if any one of the n+k drives were to fail,
the entire file system could be lost.
Organization of this Document
The remainder of this document is subdivided into four parts. The
first three parts correspond to the three different types of hardware
configurations supporting parallel processing using Linux:
Linux systems directly support MIMD execution
using shared memory, although message passing is also easily
implemented. Although Linux supports SMP configurations up to 32
processors, most SMP Pentium systems have either two or four identical
processors. SMP use of Linux is quite young, so a number of "generic"
SMP reference materials are included in this discussion.
A Cluster of networked
machines, each running Linux,
can be used as a parallel processing system that directly supports
MIMD execution and message passing, perhaps also providing logically
shared memory. Simulated SIMD execution and aggregate function
communication also can be supported, depending on the networking
method used. The number of processors in a cluster can range from two
to thousands, primarily limited by the physical wiring constraints of
the network. In some cases, various types of machines can be mixed
within a cluster; for example, a network combining DEC Alpha and
Pentium Linux systems would be a heterogeneous
Either as an add-in card or as an external box, attached
processors can provide a Linux system with formidable
processing power for specific types of applications. For example,
inexpensive ISA cards are available that provide multiple DSP
processors offering hundreds of MFLOPS for compute-bound problems.
However, these add-in boards are just processors; they
generally do not run an OS, have disk or console I/O capability, etc.
To make such systems useful, the Linux "host" must provide these
The fourth part of this document covers the miscellaneous
issues that are not directly linked to one of the above
configurations, but that apply generically to parallel processing.
This includes references to parallel disk I/O (e.g., RAID), non-Linux
parallel processing, and other reference information.
This page was last modified