Linux Parallel Processing Using Clusters
Prof. Hank Dietz
Purdue University School of Electrical and Computer Engineering
hankd@ecn.purdue.edu
Still under construction... 17 February 1997
This document attempts to give an overview of cluster parallel
processing using Linux.  Clusters are currently both the most popular
and the most varied approach, ranging from a conventional network of
workstations (NOW) to essentially custom parallel
machines that just happen to use Linux PCs as processor nodes.  There
is also quite a lot of software support for parallel processing using
clusters of Linux machines.
Why A Cluster?
Cluster parallel processing offers several important advantages:
- 
Each of the machines in a cluster can be a complete system, usable for
a wide range of other computing applications.  This leads many people
to suggest that cluster parallel computing can simply claim all the
"wasted cycles" of workstations sitting idle on people's desks.  It is
not really so easy to salvage those cycles, and it will probably slow
your co-worker's screen saver, but it can be done.
- 
The current explosion in networked systems means that most of the
hardware for building a cluster is being sold in high volume, with
correspondingly low "commodity" prices as the result.  Further savings
come from the fact that only one video card, monitor, and keyboard are
needed for each cluster (although you will have to swap these to each
machine to perform the initial installation of Linux, once running, a
typical Linux PC does not need a "console").  In comparison, SMP and
attached processors are much smaller markets, tending toward somewhat
higher price per unit performance.
- 
Cluster computing can scale to very large systems.  While it
is currently hard to find a Linux-compatible SMP with many more than
four processors, most commonly available network hardware easily
builds a cluster with up to 16 machines.  With a little work, hundreds
or even thousands of machines can be networked.  In fact, the entire
Internet can be viewed as one truly huge cluster.
- 
The fact that replacing a "bad machine" within a cluster is trivial
compared to fixing a partly faulty SMP yields much higher availability
for carefully designed cluster configurations.  This becomes important
not only for particular applications that cannot tolerate significant
service interruptions, but also for general use of systems containing
enough processors so that single-machine failures are fairly common.
(For example, even though the average time to failure of a PC might
be two years, in a cluster with 32 machines, the probability that at
least one will fail within 6 months is quite high.)
OK, so clusters are free or cheap and can be very large and highly
available...  why doesn't everyone use a cluster?  Well, there are
problems too:
- 
With a few exceptions, network hardware is not designed for parallel
processing.  Typically latency is very high and bandwidth relatively
low compared to SMP and attached processors.  For example, SMP latency
is generally no more than a few microseconds, but is commonly hundreds
or thousands of microseconds for a cluster.  SMP communication
bandwidth is often more than 100 MBytes/second, whereas even the
fastest ATM network connections are more than five times slower.
- 
There is very little software support for treating a cluster as a
single system.
Thus, the basic story is that clusters offer great potential, but that
potential may be very difficult to achieve for most applications.  The
good news is that there is quite a lot of software support that will
help you achieve good performance for programs that are well suited to
this environment, and there are also networks designed specifically to
widen the range of programs that can achieve good performance.
Network Hardware
Computer networking is an exploding field...  but you already knew
that if you've read, listened to, or watched the news at any time in
the past two years.  An ever-increasing range of networking
technologies and products are being developed, and most are available
in forms that could be used to make a parallel-processing cluster using
a group of machines running Linux.
Unfortunately, no one network technology solves all problems best; in
fact, the range of approach, cost, and performance is at first hard to
believe.  For example, using standard commercially-available hardware,
the cost per machine networked ranges from less than $5 to over
$4,000.  The delivered bandwidth and latency each also vary over four
orders of magnitude.
A Tabular Overview
To get things started, the following table gives a brief, and
incomprehensibly dense, summary of the properties of many of the
alternative networks.
Everything At A Glance (Decreasing Bandwidth/Latency/Cost per Node)
| Network Type | Prog. Model
 | Available? | Linux? | $/Node | $/Port | Port | $/Hub | Max. PEs
 | Bandwidth (Mbits/s)
 | Latency (µs)
 | n-PE O( )
 | 
| *HiPPI |  | Product |  | $3,500 | $1,500 | EISA, PCI
 | $30,000 | 16/hub | 1,600.0 |  | log n | 
| CAPERS | Library | Public | Yes | $2 | $2 | SPP |  | 2 | 1.2 | 3 |  | 
| *Serial HiPPI |  | Product |  | $4,500 | $2,500 | PCI | $30,000 | 16/hub | 1,200.0 |  | log n | 
| *SCI |  |  |  |  |  |  |  |  | 1,000.0 |  |  | 
| *FC |  |  |  |  |  |  |  |  | 1,062.0 |  |  | 
| *Myrinet | Library | Product | Yes | $1,800 | $1,300 | PCI | $2,000 | 8/hub | 1,280.0 | > 9 | log n | 
| *ParaStation | HAL | Product | Yes | $2,000 | $2,000 | PCI |  | > 100 | 125.0 | 2 | sqrt n | 
| *SHRIMP |  | Research | Yes |  |  | EISA | ? | ? | 180.0 | 5 | sqrt n | 
| *ParaPC |  | Research |  |  |  | EISA |  | 2? | 40.0 | 5 |  | 
| TTL_PAPERS | Library | Public | Yes | $100 | $5 | SPP | $800 | 8/hub | 1.6 | 3 | log n | 
| *ParaStation | Socket Library
 | Product | Yes | $2,000 | $2,000 | PCI |  | > 100 | 93.0 | 13 | sqrt n | 
| PLIP | Socket | Copyleft | Yes | $2 | $2 | SPP |  | 2 | 1.2 | 1,000 |  | 
| ATM | AAL5 | Product | Yes | $3,000 | $1,000 | PCI | $35,000 | 16/hub | 155.0 | 100 | log n | 
| Fast Ethernet (unswitched)
 | Socket | Product | Yes | $600 | $250 | PCI | $5,000 | 16/hub | 100.0 | 1,000 | > n | 
| Ethernet | Socket | Product | Yes | $100 | $100 | ISA |  | 200 | 10.0 | 1,000 | > n | 
| Fast Ethernet (switched)
 | Socket | Product | Yes | $1,500 | $250 | PCI | $20,000 | 16/hub | 100.0 | 1,000 | log n | 
| ATM | Socket | Product | Yes | $3,000 | $1,000 | PCI | $35,000 | 16/hub | 155.0 | 1,000 | log n | 
| Ethernet (switched)
 | Socket | Product | Yes | $200 | $100 | ISA | $1,500 | 16/hub | 10.0 | 1,000 | log n | 
| *ARCNET | Socket | Product | Yes | $200 | $200 | ISA |  | 255 | 2.5 | 1,000 | n | 
| SLIP | Socket | Copyleft | Yes | $2 | $2 | RS232 |  | 2 | 0.1 | 10,000 |  | 
What does the table mean?  Well, different types of networks are
listed in approximate order of decreasing bandwidth divided by latency
divided by cost per node; in other words, the higher the bandwidth and
lower the latency and cost, the earlier the table entry.  Entries that
begin with * are based on incomplete information...  which does not
imply that the other entries are 100% accurate.  Neither does the
absence of FireWire,
Token Ring (IBM Tropic chipset), etc., imply that they cannot be used;
rather, it means I couldn't get enough information to assemble a
reasonable table entry and would be happy to receive more information.
The Programming Model column describes the programmer
interface used to access the network; most network hardware supports a
unix socket model, although that implies relatively high latency, so
ParaStation and ATM also provide lower-level interfaces and CAPERS and
TTL_PAPERS provide an interface library.  Available?
indicates how you can get this type of network; basically, you buy a
product, copyleft and public stuff is freely available (with some
restrictions for copyleft), and research items are pretty much
one-of-a-kind.  Is there currently support for using this network to
link machines running Linux?  If so, there is a yes under the
Linux? column.
Cost of a network is a little difficult to measure; there are many
parameters and different configurations.  However, most networks are
structured as either:
- 
Each machine has an interface board which is connected to other
interface boards, or
- 
Each machine is connected to a centralized hub or switch, which may in
turn be connected to other hubs.
The average total cost per node (machine networked) is listed in the
$/Node column.  The portion of this cost which goes
for the interface card and/or cable is listed in the
$/Port column, and the Port column
describes the type of interface used.  ISA, EISA, and PCI are all
busses that require interface cards, whereas most PCs need only a
cable to connect via the SPP (parallel printer port) or RS232 (serial
or modem port).  If the network uses a hub or switch, the cost of that
is listed under $/Hub.  The maximum number of PEs
(machines) in a network is listed under Max. PEs;
however, hub-based networks can be scaled to huge sizes by linking
hubs, so, for example, HiPPI isn't limited to 16 machines, but 16 per
hub.  Of course, costs quoted in this table might not reflect
reality....
Perhaps the most mysterious quality of a network is its performance.
Still, it is generally true that higher bandwidth and lower latency
are both good.  The Bandwidth and
Latency columns quote approximate numbers that the
respective networks are unlikely to better; they are not averages and
I can't swear that they are accurate.  The table also doesn't account
for oddities like the fact that many operations that would take
multiple communications on the other networks can be accomplished in a
single aggregate communication using TTL_PAPERS.  In any case, it is
very easy to see that differences between networks are often less than
subtle.  These differences can become even more dramatic when larger
numbers of processors are networked; the n-PE O(
) column gives the approximate order of communication
slowdown when n machines are networked.  For example, Fast
Ethernet performance looks pretty much the same with either switched
or unswitched hub, but the unswitched hub will yield far poorer
performance if many machines are networked.
Network Hardware Definitions
Although the above table really says a lot, it doesn't allow for very
much description of each of the different types of network hardware.
Thus, the following definitions should help.
- 
HiPPI (High Performance Parallel Interface) 
- 
HiPPI was originally intended to provide very high bandwidth for
transfer of huge data sets between a supercomputer and another machine
(a supercomputer, frame buffer, disk array, etc.), and has become the
dominant standard for supercomputers.  Although it is an oxymoron,
Serial HiPPI is also becoming popular, typically
using a fiber optic cable instead of the 32-bit wide standard
(parallel) HiPPI cables.  Over the past few years, HiPPI crossbar
switches have become common and prices have dropped sharply;
unfortunately, serial HiPPI is still pricey, and that is what PCI bus
interface cards generally support.  Worse still, Linux doesn't yet
support HiPPI.  A good overview of HiPPI is
maintained by CERN; they also maintain a rather long list of
HiPPI vendors.
- 
CAPERS (Cable Adapter for Parallel Execution and Rapid Synchronization)
- 
CAPERS is a spin-off of the PAPERS project at the
Purdue University School of Electrical and Computer Engineering.  In
essence, it defines a software protocol for using an ordinary "LapLink"
SPP-to-SPP cable to implement the PAPERS library for two Linux PCs.
The idea doesn't scale, but you can't beat the price.  As with PAPERS,
to improve system security, there is a minor kernel
patch recommended, but not required.
- 
SCI (Scalable Coherent Interconnect)
- 
The goal of SCI is essentially to provide a high performance mechanism
that can support coherent shared memory access across large numbers of
machines, as well various types of block message transfers.  It is
fairly safe to say that the designed bandwidth and latency of SCI are
both "awesome" in comparison to most other network technologies.  The
catch is that SCI is not widely available as cheap production units,
and there isn't any Linux support.  A good set of links over-viewing
SCI is maintained by CERN.
- 
FC (Fibre Channel)
- 
The goal of FC is to provide high-performance block I/O (an FC frame
carries a 2,048 byte data payload), particularly for sharing disks and
other storage devices that can be directly connected to the FC rather
than connected through a computer.  Bandwidth-wise, FC is specified to
be relatively fast, running anywhere between 133 and 1,062 Mbits/s.
If FC becomes popular as a high-end SCSI replacement, it may quickly
become a cheap technology; however, it is not yet and is not supported
by Linux.  A good collection of FC references is maintained by the Fibre Channel
Association.
- 
Myrinet
- 
Myrinet is a local area network
(LAN) designed to also serve as a "system area network" (SAN), i.e.,
the network within a cabinet full of machines connected as a parallel
system.  It is fairly conventional in structure, but has a reputation
for being particularly well-implemented (using custom VLSI).  The
drivers for Linux are said to perform very well, although there are
shockingly large performance variations with different PCI bus
implementations for the host computers.
- 
ParaStation (formerly ParaPC)
- 
The ParaStation
project at University of Karlsruhe Department of Informatics is
building a PVM-compatible custom low-latency network.  They first
constructed a two-processor ParaPC prototype using a custom EISA card
interface and PCs running BSD UNIX, and then built larger clusters
using DEC Alphas.  ParaStation is now (January 1997) available for
Linux.  The PCI cards are being made in cooperation with a company
called Hitex.
- 
SHRIMP (Scalable, High-Performance, Really Inexpensive Multi-Processor)
- 
The SHRIMP project
at the Princeton University Computer Science Department is building a
parallel computer using PCs running Linux as the processing elements.
They have developed a simple two-processor prototype using a
dual-ported RAM on a custom EISA card interface, and are working on a
prototype that will scale to large configurations using a custom
interface card to connect to a "hub" that is essentially the same mesh
routing network used in the Intel Paragon.
Considerable effort has gone into developing low-overhead "virtual
memory mapped communication" hardware and support software.
- 
TTL_PAPERS (TTL Purdue's Adapter for Parallel Execution and Rapid Synchronization)
- 
The PAPERS project
at the Purdue University School of Electrical and Computer Engineering
is building scalable, low-latency, aggregate function communication
hardware and software.  There have been eight different types of
PAPERS hardware built that connect to PCs/workstations via the SPP
(Standard Parallel Port), including incredibly simple public domain
designs using TTL logic.  One such design is available
commercially.  Unlike the custom hardware designs from other
universities, TTL_PAPERS clusters have been assembled at several
universities.  Bandwidth is severly limited by the SPP connections,
but PAPERS implements very low latency aggregate function
communications; even the fastest message-oriented systems cannot
provide comparable performance on those aggregate functions.  Although
PAPERS clusters have been built using AIX and OSF/1 machines,
Linux-based PCs are the platforms best supported.  To improve system
security, there is a minor kernel
patch recommended, but not required.
- 
PLIP (Parallel Line Interface Protocol)
- 
For just the cost of a "LapLink" cable, PLIP allows two Linux machines
to communicate through standard parallel ports using standard
socket-based software.  In terms of bandwidth, latency, and
scalability, this is not a very serious network technology; however,
the near-zero cost and the software compatibility are useful.  
- 
ATM (Asynchronous Transfer Mode)
- 
Unless you've been in a coma for the past year, you have probably heard
a lot about how ATM is the future...  well, sort-of.  ATM is
cheaper than HiPPI and faster than Fast Ethernet, and it can be used
over very long distances.  The ATM network protocol is also designed
to provide a lower-overhead software interface and to more efficiently
manage small messages and real-time communications (e.g., digital
audio and video).  Best of all, it is the highest-bandwidth network
that Linux currently supports.  The bad news is that ATM isn't cheap,
and there are still quite a few compatibility problems across
vendors.  An overview of
Linux ATM development is available.
- 
Fast Ethernet
- 
Although there are really quite a few different technologies calling
themselves "fast Ethernet," this term most often refers to a hub-based
100 Mbits/s Ethernet that is somewhat compatible with older "10 BaseT"
10 Mbits/s devices and cables.  As might be expected, anything called
Ethernet is generally priced for a volume market, and these interfaces
are generally about 1/4 the price of 155 Mbits/s ATM cards.  The catch
is that having a bunch of machines dividing the bandwidth of a single
100 Mbits/s "bus" (unswitched hub) yields performance that might not
even be as good on average as using 10 Mbits/s Ethernet with a
switched hub that can give each machine's connection a full 10
Mbits/s.  Switched hubs that can provide 100 Mbits/s for each machine
simultaneously are very difficult to build and thus approach the cost
of ATM switches, but they do yield much higher total network bandwidth.
 Also note that, as described below for Ethernet, the Beowulf
project at NASA has been developing support that offers
improved performance by load sharing across multiple Fast Ethernets.
- 
Ethernet
- 
For several years now, 10 Mbits/s Ethernet has been the standard
network technology.  Good Ethernet interface cards can be purchased
for under $50, and a fair number of PCs now have an Ethernet
controller built-into the motherboard.  For lightly-used networks,
Ethernet connections can be organized as a multi-tap bus without a hub;
such configurations can serve up to 200 machines with minimal cost,
but are not appropriate for parallel processing.  Adding an unswitched
hub does not really help performance.  However, switched hubs that can
provide full bandwidth to simultaneous connections cost only about
$100 per port.  Linux supports an amazing range of Ethernet
interfaces, but it is important to keep in mind that variations in the
interface hardware can yield significant performance differences.  See
the Hardware Compatibility HOWTO for comments on which are supported
and how well they work.
 An interesting way to improve performance is offered by the 16-machine
Linux cluster work done in the Beowulf
project at NASA CESDIS.  There, Donald Becker, who is the
author of many Ethernet card drivers, has developed support for load
sharing across multiple Ethernet networks that shadow each other
(i.e., share the same network addresses).  This load sharing is
built-into the standard Linux distribution, and is done invisibly at
the socket operation level.  Because hub cost is significant, having
each machine connected to two or more hub-less or unswitched hub
Ethernet networks can be a very cost-effective way to improve
performance.  In fact, in situations where one machine is the network
performance bottleneck, load sharing using shadow networks works much
better than using a single switched hub network.
- 
ARCNET
- 
ARCNET is a local area network that is primarily intended for use in
embedded real-time control systems.  Like Ethernet, the network is
physically organized either as taps on a bus or one or more hubs,
however, unlike Ethernet, it uses a token-based protocol logically
structuring the network as a ring.  Packet headers are small (3 or 4
bytes) and messages can carry as little as a single byte of data.
Thus, ARCNET yields more consistent performance than Ethernet, with
bounded delays, etc.  Unfortunately, it is slower than Ethernet and
less popular, making it more expensive.  More information is available
from the ARCNET Trade Association.
- 
SLIP (Serial Line Interface Protocol)
- 
Although SLIP is firmly planted at the low end of the performance
spectrum, SLIP (or CSLIP or PPP) allows two machines to perform
socket communication via ordinary RS232 serial ports.  The RS232 ports
can be connected using a null-modem RS232 serial cable, or they can
even be connected via dial-up through a modem.  In any case, latency
is high and bandwidth is low, so SLIP should be used only when no other
alternatives are available.  It is worth noting, however, that most
PCs have two RS232 ports, so it would be possible to network a group
of machines simply by connecting the machines as a linear array or as
a ring.  There is even load sharing software called EQL.
Network Software Interface
Before moving on to discuss the software support for parallel
applications, it is useful to first briefly cover the basics of
low-level software interface to the network hardware.  There are
really only three basic choices:  sockets, device drivers, and
user-level libraries.
Sockets
As can be seen in the above table, by far the most common low-level
interface is a socket interface.  Sockets have been a part of unix for
over a decade, and most standard network hardware is designed to
support at least two types of socket protocols:  UDP and TCP.  Both
types of socket allow you to send arbitrary size blocks of data from
one machine to another, but there are several important differences.
Typically, both yield a minimum latency of around 1,000 µs,
although performance can be far worse depending on network traffic.
These socket types are the basic network software interface for most
of the portable, higher-level, parallel processing software; for
example, PVM uses a combination of UDP and TCP, so knowing the
difference will help you tune performance.  For even better
performance, you can also use these mechanisms directly in your
program.  The following is just a simple overview of UDP and TCP; see
the manual pages and a good network programming book for details.
UDP Protocol (SOCK_DGRAM)
UDP is an Unreliable Datagram Protocol; in other
words, it allows each block to be sent as an individual message, but a
message might be lost in transmission.  In fact, depending on network
traffic, UDP messages can be lost, can arrive multiple times, or can
arrive in an order different from that in which they were sent.  The
sender of a UDP message does not automatically get an acknowledgment,
so it is up to user-written code to detect and compensate for these
problems.  Fortunately, UDP does ensure that if a message arrives, the
message contents are intact (i.e., you never get just part of a
message).
The nice thing about UDP is that it tends to be the fastest socket
protocol.  Further, UDP is "connectionless," which means that each
message is essentially independent of all others.  A good analogy is
that each message is like a letter to be mailed; you might send
multiple letters to the same address, but each one is independent of
the others and there is no limit on how many people you can send
letters to.
TCP Protocol (SOCK_STREAM)
Unlike UDP, TCP is a reliable, connection-based,
protocol.  Each block sent is not seen as a message, but as a block of
data within an apparently continuous stream of bytes being transmitted
through a connection between sender and receiver.  This is very
different from UDP messaging because each block is simply part of the
byte stream and it is up to the user code to figure-out how to extract
each block from the byte stream; there are no markings separating
messages.  Further, the connections are more fragile with respect to
network problems, and only a limited number of connections can exist
simultaneously for each process.  Because it is reliable, TCP
generally implies significantly more overhead than UDP.
There are, however, a few pleasant surprises about TCP.  One is that,
if multiple messages are sent through a connection, TCP is able to
pack them together in a buffer to better match network hardware packet
sizes, potentially yielding better-than-UDP performance for groups of
short or oddly-sized messages.  The other bonus is that networks
constructed using reliable direct physical links between machines can
easily and efficiently simulate TCP connections.  For example, this was
done for the ParaStation's "Socket Library" interface software, which
provides TCP semantics using user-level calls that differ from the
standard TCP OS calls only by the addition of the prefix
PSS to each function name.
Device Drivers
When it comes to actually pushing data onto the network or pulling data
off the network, the standard unix software interface is a part of the
unix kernel called a device driver.  UDP and TCP don't just transport
data, they also imply a fair amount of overhead for socket management.
For example, something has to manage the fact that multiple TCP
connections can share a single physical network interface. In
contrast, a device driver for a dedicated network interface only needs
to implement a few simple data transport functions.  These device
driver functions can then be invoked by user programs by using
open() to identify the proper device and then using
system calls like read() and write() on the
open "file."  Thus, each such operation could transport a block of
data with little more than the overhead of a system call, which might
be as fast as 100 µs.
Writing a device driver to be used with Linux is not hard...  if you
know precisely how the device hardware works.  If you are not
sure how it works, don't guess.  Debugging device drivers isn't fun
and mistakes can fry hardware.  However, if that hasn't scared you
off, it may be possible to write a device driver to, for example, use
dedicated Ethernet cards as dumb but fast direct machine-to-machine
connections without the usual Ethernet protocol overhead.  In fact,
that's pretty much what some early Intel supercomputers did....  Look
at the Device Driver HOWTO for more information.
User-Level Libraries
If you've taken an OS course, user-level access to hardware device
registers is exactly what you have been taught never to do, because
one of the primary purposes of an OS is to control device access.
However, an OS call is commonly 100 µs of overhead.  For custom
network hardware like TTL_PAPERS, which can perform over 30 network
operations in 100 µs, such OS call overhead is intolerable.  The
only way to avoid that overhead is to have user-level code -- a
user-level library -- directly access hardware device registers.
Thus, the question becomes one of how a user-level library can access
hardware directly, yet not compromise the OS control of device access
rights.
On a typical system, the only way for a user-level library to directly
access hardware device registers is to:
- 
At user program start-up, use an OS call to map the page of memory
address space containing the device registers into the user process
virtual memory map.  There is no standard Linux call for this purpose,
but it is relatively simple to write a device driver to perform this
function.  Further, this device driver can control access by only
mapping the page(s) containing the specific device registers needed,
thereby maintaining OS access control.
- 
Access device registers without an OS call by simply loading or storing
to the mapped addresses.  For example, *((char *) 0x1234) =
5;would store the byte value 5 into memory location 1234
(hexadecimal).
Fortunately, it happens that Linux for the Intel 386 (and compatible
processors) offers an even better solution:
- 
Using the ioperm()OS call from a privileged process,
get permission to access the precise I/O port addresses that correspond
to the device registers.  Alternatively, permission can be managed by
an independent privileged user process (i.e., a "meta OS") using thegiveioperm()OS call patch for Linux.
- 
Access device registers without an OS call by using 386 port I/O
instructions.
This second solution is preferable because it is common that multiple
devices have their registers within a single page, in which case the
first technique would not provide protection against accessing other
device registers that happened to reside in the same page as the ones
intended.  Of course, the down side is that 386 port I/O instructions
cannot be coded in C -- instead, you will need to use a bit of
assembly code.  The GCC-wrapped (usable in C programs) inline assembly
code function for a port input of a byte value is:
extern inline unsigned char
inb(unsigned short port)
{
    unsigned char _v;
__asm__ __volatile__ ("inb %w1,%b0"
                      :"=a" (_v)
                      :"d" (port), "0" (0));
    return _v;
}
Similarly, the GCC-wrapped code for a byte port output is:
extern inline void
outb(unsigned char value,
unsigned short port)
{
__asm__ __volatile__ ("outb %b0,%w1"
                      :/* no outputs */
                      :"a" (value), "d" (port));
}
Cluster-Parallel Support Software
Ok, so you've selected cluster hardware... now how are you going to
write parallel programs for it?  Well, there is actually quite a lot
of support software to help you.  This section provides pointers to
further information, with a brief overview for each.
To aid you in finding what you need, the pointers are organized into
groups.  The first group is basic message-passing library support,
generally built on top of sockets.  The second group is higher-level
parallel programming support; many of these programming systems are
actually built on top of the messaging support in the first group.
Libraries
There are actually quite a few libraries.  The following have all been
ported to Linux or might be directly usable with Linux systems.
- 
http://www.epm.ornl.gov/pvm/pvm_home.html
- 
PVM (Parallel Virtual Machine) is a freely-available,
portable, message-passing library generally implemented on top of UNIX
sockets.  This includes single-processor and SMP Linux machines, as
well as clusters of Linux machines linked by socket-capable networks
(e.g., SLIP, PLIP, Ethernet, ATM).  In fact, PVM will even work across
groups of machines in which a variety of different types of processors,
configurations, and physical networks are used --
Heterogeneous Clusters -- even to the scale of
treating machines linked by the Internet as a parallel cluster.  Best
of all, PVM is freely available and is clearly the de-facto standard
for message-passing cluster parallel computing.  PVM also provides
facilities for parallel job control.
 It is important to note, however, that PVM message-passing calls
generally add significant overhead to standard socket operations,
which already had high latency.  Further, the message handling calls
do not constitute a particularly "friendly" programming model, so PVM
is commonly used as the "portable message library target" for
high-level language parallel compilers.
- 
http://www.mcs.anl.gov:80/mpi/
- 
Although PVM is the de-facto standard message-passing library,
MPI (Message Passing Interface) is the official
standard.  This page is the home page for the MPI standard.
- 
http://www.osc.edu/lam.html
- 
This is the home page for LAM (Local Area
Multicomputer), a full implementation of the MPI communication
standard for workstation clusters using a conventional network.  The
system includes a variety of development and debugging aids.
- 
ftp://ftp.epcc.ed.ac.uk/pub/chimp/release/
- 
CHIMP, a freely available MPI implementation.
- 
http://garage.ecn.purdue.edu/~papers
- 
The PAPERS project provides a high-level parallel processing library
that is designed to be used with either the CAPERS or PAPERS hardware.
The library is designed to be called from C or C++ routines.
- 
http://info2.rus.uni-stuttgart.de:81/rus/dfn_rpc/README_df
- 
The DFN-RPC, a Remote Procedure Call Tool, was developed to distribute
and parallelize scientific-technical application programs between a
workstation and a compute server or a cluster. The interface is
optimized for applications written in fortran, but the DFN-RPC can
also be used in a C environment.  If you want to install the DFN-RPC
on PC with Linux then you must use at least Rel.  1.0.49alpha, better
is Rel. 1.0.60beta. 
- 
http://www.cmpharm.ucsf.edu/~srp/batch/systems.html
- 
DQS 3.1 is an experimental queueing system that has been developed and
tested under Linux.
- 
http://www.cs.wisc.edu/condor/
- 
Condor is a distributed resource management system that can manage
large heterogeneous clusters of workstations.  Its design has been
motivated by the needs of users who would like to use the unutilized
capacity of such clusters for their long-running,
computation-intensive jobs.  Condor preserves a large measure of the
originating machine's environment on the execution machine, even if
the originating and execution machines do not share a common file
and/or password systems.  Condor jobs that consist of a single process
are automatically checkpointed and migrated between workstations as
needed to ensure eventual completion.
 A Linux port is in progress; more information is available at http://www.cs.wisc.edu/condor/linux/linux.html.
Contact greger@cae.wisc.edu
for details.
Parallel Programming Systems
Generally somewhat higher-level than libraries, the following
programming systems may be usable with Linux.  Note that many of these
systems are built upon one of the libraries listed in the previous
section.
- 
http://www.cs.virginia.edu/~mentat/
- 
Mentat is an object-oriented parallel processing system that works
with workstation clusters and has been ported to Linux.  Mentat
Programming Language (MPL) is an object-oriented programming language
based on C++.  The Mentat run-time system uses something vaguely
resembling non-blocking remote procedure calls.
- 
http://www.informatik.uni-stuttgart.de/ipvr/bv/p3/p3.html
- 
Parallaxis is a structured programming language that extends Modula-2
with "virtual processors and connections" for data parallelism (a SIMD
model).  The Parallaxis software comprises compilers for sequential and
parallel computer systems, a debugger (extensions to the gdb and xgbd
debugger), and a large variety of sample algorithms from different
areas, especially image processing.  This runs on sequential Linux
systems and PVM clusters.
- 
http://suif.stanford.edu/~scales/sam.html
- 
Jade is a parallel programming language that extends C to exploit
coarse-grain concurrency in sequential, imperative programs.  It
assumes a distributed shared memory model, which is implemented by SAM
for workstation clusters using PVM.
- 
http://www.cs.washington.edu/research/projects/orca3/zpl/www/
- 
ZPL is an array-based programming language intended to support
engineering and scientific applications.  It runs under PVM on
workstation clusters.
- 
http://www.csl.sri.com/GLU.html
- 
GLU (Granular Lucid) is a very high-level programming system based on
a hybrid programming model that combines intensional (Lucid) and
imperative models.  Does it run under Linux?
- 
http://www.extreme.indiana.edu/sage/
- 
pC++ is a language extention to C++ that permits data-parallel style
operations using "collections of objects" from some base "element"
class.  It is a preprocessor generating C++ code that can run under
PVM.
- 
http://www.cs.arizona.edu/sr/www/index.html
- 
SR (Synchronizing Resources) is a concurrent programming language in
which resources encapsulate processes and the variables they share;
operations provide the primary mechanism for process interaction. SR
provides a novel integration of the mechanisms for invoking and
servicing operations. Consequently, all of local and remote procedure
call, rendezvous, message passing, dynamic process creation,
multicast, and semaphores are supported. SR also supports shared
global variables and operations.
 It has been ported to Linux, and may be able to run on networked Linux
systems?
- 
http://www.myrias.com/
- 
Myrias is a company selling a software product called Parallel
Application Management System (PAMS).  PAMS provides very simple
directives for virtual shared memory parallel processing.  Are
networks of Linux machines supported?
References of General Interest
The following are references to various cluster-related projects that
may be of general interest.  This includes a mix of generic cluster
references and pointers to Linux cluster sites.
- 
http://www.cs.huji.ac.il/mosix/
- 
This is the home page for MOSIX, a project trying to modify the BSD
kernel to provide dynamic load balancing and preemptive process
migration across a networked group of PCs.
- 
http://now.cs.berkeley.edu/
- 
This is the home page for the Berkeley NOW project.  There is a lot
work going on here, all aimed toward "demonstrating a practical 100
processor system in the next few years."  Alas, they don't use Linux.
- 
http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html
- 
This is the home page for the Beowulf project, which has been the
leading Linux-based cluster project using conventional network
hardware.  In fact, it was as part of this project that Linux kernel
support for load-sharing across multiple Ethernet interfaces was
developed.  This project has also been the source of a number of
Ethernet and 100 Mbit/s Fast Ethernet drivers for Linux.  Currently,
they are using PVM for a variety of parallel applications.
- 
http://www.geli.com/
- 
Geli Engineering provides sales, installation, support and distributed
computing consulting services for workstation clusters using BSD
Pentium systems with 100 Mbits/s Ethernet.
- 
http://www.cs.cornell.edu/Info/People/mdw/mdw.html
- 
This page describes some of the Ethernet cluster and ATM work going on
at the Systems Group at the Cornell University Computer Science
Department.
- 
http://www.cs.sunysb.edu/~manish/locust/
- 
The Locust project is building a distributed virtual shared memory
system that uses compile-time information to hide message-latency and
to reduce network traffic at run time.  Pupa is the underlying
communication subsystem of Locust, and is implemented using Ethernet to
connect 486 PCs under FreeBSD.
- 
http://www.cs.cmu.edu/afs/cs/project/multiC-sys-sw/WWW/top.html
- 
This is the World Wide Web home page of the ARPA/CSTO Multicomputing
System Software project.  Lots of cluster stuff, nothing about Linux
in particular.
 
This page was last modified
.