Linux Parallel Processing Using Clusters

Prof. Hank Dietz
Purdue University School of Electrical and Computer Engineering
hankd@ecn.purdue.edu
Still under construction... 17 February 1997

This document attempts to give an overview of cluster parallel processing using Linux. Clusters are currently both the most popular and the most varied approach, ranging from a conventional network of workstations (NOW) to essentially custom parallel machines that just happen to use Linux PCs as processor nodes. There is also quite a lot of software support for parallel processing using clusters of Linux machines.

Why A Cluster?

Cluster parallel processing offers several important advantages:

Each of the machines in a cluster can be a complete system, usable for a wide range of other computing applications. This leads many people to suggest that cluster parallel computing can simply claim all the "wasted cycles" of workstations sitting idle on people's desks. It is not really so easy to salvage those cycles, and it will probably slow your co-worker's screen saver, but it can be done.
The current explosion in networked systems means that most of the hardware for building a cluster is being sold in high volume, with correspondingly low "commodity" prices as the result. Further savings come from the fact that only one video card, monitor, and keyboard are needed for each cluster (although you will have to swap these to each machine to perform the initial installation of Linux, once running, a typical Linux PC does not need a "console"). In comparison, SMP and attached processors are much smaller markets, tending toward somewhat higher price per unit performance.
Cluster computing can scale to very large systems. While it is currently hard to find a Linux-compatible SMP with many more than four processors, most commonly available network hardware easily builds a cluster with up to 16 machines. With a little work, hundreds or even thousands of machines can be networked. In fact, the entire Internet can be viewed as one truly huge cluster.
The fact that replacing a "bad machine" within a cluster is trivial compared to fixing a partly faulty SMP yields much higher availability for carefully designed cluster configurations. This becomes important not only for particular applications that cannot tolerate significant service interruptions, but also for general use of systems containing enough processors so that single-machine failures are fairly common. (For example, even though the average time to failure of a PC might be two years, in a cluster with 32 machines, the probability that at least one will fail within 6 months is quite high.)

OK, so clusters are free or cheap and can be very large and highly available... why doesn't everyone use a cluster? Well, there are problems too:

With a few exceptions, network hardware is not designed for parallel processing. Typically latency is very high and bandwidth relatively low compared to SMP and attached processors. For example, SMP latency is generally no more than a few microseconds, but is commonly hundreds or thousands of microseconds for a cluster. SMP communication bandwidth is often more than 100 MBytes/second, whereas even the fastest ATM network connections are more than five times slower.
There is very little software support for treating a cluster as a single system.

Thus, the basic story is that clusters offer great potential, but that potential may be very difficult to achieve for most applications. The good news is that there is quite a lot of software support that will help you achieve good performance for programs that are well suited to this environment, and there are also networks designed specifically to widen the range of programs that can achieve good performance.

Network Hardware

Computer networking is an exploding field... but you already knew that if you've read, listened to, or watched the news at any time in the past two years. An ever-increasing range of networking technologies and products are being developed, and most are available in forms that could be used to make a parallel-processing cluster using a group of machines running Linux.

Unfortunately, no one network technology solves all problems best; in fact, the range of approach, cost, and performance is at first hard to believe. For example, using standard commercially-available hardware, the cost per machine networked ranges from less than $5 to over $4,000. The delivered bandwidth and latency each also vary over four orders of magnitude.

A Tabular Overview

To get things started, the following table gives a brief, and incomprehensibly dense, summary of the properties of many of the alternative networks.

Everything At A Glance (Decreasing Bandwidth/Latency/Cost per Node)
Network Type Prog.
Model Available? Linux? $/Node $/Port Port $/Hub Max.
PEs Bandwidth
(Mbits/s) Latency
(µs) n-PE
O( )

*HiPPI Product $3,500 $1,500 EISA,
PCI $30,000 16/hub 1,600.0 log n

CAPERS Library Public Yes $2 $2 SPP 2 1.2 3

*Serial HiPPI Product $4,500 $2,500 PCI $30,000 16/hub 1,200.0 log n

*SCI 1,000.0

*FC 1,062.0

*Myrinet Library Product Yes $1,800 $1,300 PCI $2,000 8/hub 1,280.0 > 9 log n

*ParaStation HAL Product Yes $2,000 $2,000 PCI > 100 125.0 2 sqrt n

*SHRIMP Research Yes EISA ? ? 180.0 5 sqrt n

*ParaPC Research EISA 2? 40.0 5

TTL_PAPERS Library Public Yes $100 $5 SPP $800 8/hub 1.6 3 log n

*ParaStation Socket
Library Product Yes $2,000 $2,000 PCI > 100 93.0 13 sqrt n

PLIP Socket Copyleft Yes $2 $2 SPP 2 1.2 1,000

ATM AAL5 Product Yes $3,000 $1,000 PCI $35,000 16/hub 155.0 100 log n

Fast Ethernet
(unswitched) Socket Product Yes $600 $250 PCI $5,000 16/hub 100.0 1,000 > n

Ethernet Socket Product Yes $100 $100 ISA 200 10.0 1,000 > n

Fast Ethernet
(switched) Socket Product Yes $1,500 $250 PCI $20,000 16/hub 100.0 1,000 log n

ATM Socket Product Yes $3,000 $1,000 PCI $35,000 16/hub 155.0 1,000 log n

Ethernet
(switched) Socket Product Yes $200 $100 ISA $1,500 16/hub 10.0 1,000 log n

*ARCNET Socket Product Yes $200 $200 ISA 255 2.5 1,000 n

SLIP Socket Copyleft Yes $2 $2 RS232 2 0.1 10,000

Everything At A Glance (Decreasing Bandwidth/Latency/Cost per Node)
Network Type	Prog. Model	Available?	Linux?	$/Node	$/Port	Port	$/Hub	Max. PEs	Bandwidth (Mbits/s)	Latency (µs)	`n`-PE O( )
*HiPPI		Product		$3,500	$1,500	EISA, PCI	$30,000	16/hub	1,600.0		log `n`
CAPERS	Library	Public	Yes	$2	$2	SPP		2	1.2	3
*Serial HiPPI		Product		$4,500	$2,500	PCI	$30,000	16/hub	1,200.0		log `n`
*SCI									1,000.0
*FC									1,062.0
*Myrinet	Library	Product	Yes	$1,800	$1,300	PCI	$2,000	8/hub	1,280.0	> 9	log `n`
*ParaStation	HAL	Product	Yes	$2,000	$2,000	PCI		> 100	125.0	2	sqrt `n`
*SHRIMP		Research	Yes			EISA	?	?	180.0	5	sqrt `n`
*ParaPC		Research				EISA		2?	40.0	5
TTL_PAPERS	Library	Public	Yes	$100	$5	SPP	$800	8/hub	1.6	3	log `n`
*ParaStation	Socket Library	Product	Yes	$2,000	$2,000	PCI		> 100	93.0	13	sqrt `n`
PLIP	Socket	Copyleft	Yes	$2	$2	SPP		2	1.2	1,000
ATM	AAL5	Product	Yes	$3,000	$1,000	PCI	$35,000	16/hub	155.0	100	log `n`
Fast Ethernet (unswitched)	Socket	Product	Yes	$600	$250	PCI	$5,000	16/hub	100.0	1,000	> `n`
Ethernet	Socket	Product	Yes	$100	$100	ISA		200	10.0	1,000	> `n`
Fast Ethernet (switched)	Socket	Product	Yes	$1,500	$250	PCI	$20,000	16/hub	100.0	1,000	log `n`
ATM	Socket	Product	Yes	$3,000	$1,000	PCI	$35,000	16/hub	155.0	1,000	log `n`
Ethernet (switched)	Socket	Product	Yes	$200	$100	ISA	$1,500	16/hub	10.0	1,000	log `n`
*ARCNET	Socket	Product	Yes	$200	$200	ISA		255	2.5	1,000	`n`
SLIP	Socket	Copyleft	Yes	$2	$2	RS232		2	0.1	10,000

What does the table mean? Well, different types of networks are listed in approximate order of decreasing bandwidth divided by latency divided by cost per node; in other words, the higher the bandwidth and lower the latency and cost, the earlier the table entry. Entries that begin with * are based on incomplete information... which does not imply that the other entries are 100% accurate. Neither does the absence of FireWire, Token Ring (IBM Tropic chipset), etc., imply that they cannot be used; rather, it means I couldn't get enough information to assemble a reasonable table entry and would be happy to receive more information.

The Programming Model column describes the programmer interface used to access the network; most network hardware supports a unix socket model, although that implies relatively high latency, so ParaStation and ATM also provide lower-level interfaces and CAPERS and TTL_PAPERS provide an interface library. Available? indicates how you can get this type of network; basically, you buy a product, copyleft and public stuff is freely available (with some restrictions for copyleft), and research items are pretty much one-of-a-kind. Is there currently support for using this network to link machines running Linux? If so, there is a yes under the Linux? column.

Cost of a network is a little difficult to measure; there are many parameters and different configurations. However, most networks are structured as either:

Each machine has an interface board which is connected to other interface boards, or
Each machine is connected to a centralized hub or switch, which may in turn be connected to other hubs.

The average total cost per node (machine networked) is listed in the $/Node column. The portion of this cost which goes for the interface card and/or cable is listed in the $/Port column, and the Port column describes the type of interface used. ISA, EISA, and PCI are all busses that require interface cards, whereas most PCs need only a cable to connect via the SPP (parallel printer port) or RS232 (serial or modem port). If the network uses a hub or switch, the cost of that is listed under $/Hub. The maximum number of PEs (machines) in a network is listed under Max. PEs; however, hub-based networks can be scaled to huge sizes by linking hubs, so, for example, HiPPI isn't limited to 16 machines, but 16 per hub. Of course, costs quoted in this table might not reflect reality....

Perhaps the most mysterious quality of a network is its performance. Still, it is generally true that higher bandwidth and lower latency are both good. The Bandwidth and Latency columns quote approximate numbers that the respective networks are unlikely to better; they are not averages and I can't swear that they are accurate. The table also doesn't account for oddities like the fact that many operations that would take multiple communications on the other networks can be accomplished in a single aggregate communication using TTL_PAPERS. In any case, it is very easy to see that differences between networks are often less than subtle. These differences can become even more dramatic when larger numbers of processors are networked; the n-PE O( ) column gives the approximate order of communication slowdown when n machines are networked. For example, Fast Ethernet performance looks pretty much the same with either switched or unswitched hub, but the unswitched hub will yield far poorer performance if many machines are networked.

Network Hardware Definitions

Although the above table really says a lot, it doesn't allow for very much description of each of the different types of network hardware. Thus, the following definitions should help.

HiPPI (High Performance Parallel Interface): HiPPI was originally intended to provide very high bandwidth for transfer of huge data sets between a supercomputer and another machine (a supercomputer, frame buffer, disk array, etc.), and has become the dominant standard for supercomputers. Although it is an oxymoron, Serial HiPPI is also becoming popular, typically using a fiber optic cable instead of the 32-bit wide standard (parallel) HiPPI cables. Over the past few years, HiPPI crossbar switches have become common and prices have dropped sharply; unfortunately, serial HiPPI is still pricey, and that is what PCI bus interface cards generally support. Worse still, Linux doesn't yet support HiPPI. A good overview of HiPPI is maintained by CERN; they also maintain a rather long list of HiPPI vendors.
CAPERS (Cable Adapter for Parallel Execution and Rapid Synchronization): CAPERS is a spin-off of the PAPERS project at the Purdue University School of Electrical and Computer Engineering. In essence, it defines a software protocol for using an ordinary "LapLink" SPP-to-SPP cable to implement the PAPERS library for two Linux PCs. The idea doesn't scale, but you can't beat the price. As with PAPERS, to improve system security, there is a minor kernel patch recommended, but not required.
SCI (Scalable Coherent Interconnect): The goal of SCI is essentially to provide a high performance mechanism that can support coherent shared memory access across large numbers of machines, as well various types of block message transfers. It is fairly safe to say that the designed bandwidth and latency of SCI are both "awesome" in comparison to most other network technologies. The catch is that SCI is not widely available as cheap production units, and there isn't any Linux support. A good set of links over-viewing SCI is maintained by CERN.
FC (Fibre Channel): The goal of FC is to provide high-performance block I/O (an FC frame carries a 2,048 byte data payload), particularly for sharing disks and other storage devices that can be directly connected to the FC rather than connected through a computer. Bandwidth-wise, FC is specified to be relatively fast, running anywhere between 133 and 1,062 Mbits/s. If FC becomes popular as a high-end SCSI replacement, it may quickly become a cheap technology; however, it is not yet and is not supported by Linux. A good collection of FC references is maintained by the Fibre Channel Association.
Myrinet: Myrinet is a local area network (LAN) designed to also serve as a "system area network" (SAN), i.e., the network within a cabinet full of machines connected as a parallel system. It is fairly conventional in structure, but has a reputation for being particularly well-implemented (using custom VLSI). The drivers for Linux are said to perform very well, although there are shockingly large performance variations with different PCI bus implementations for the host computers.
ParaStation (formerly ParaPC): The ParaStation project at University of Karlsruhe Department of Informatics is building a PVM-compatible custom low-latency network. They first constructed a two-processor ParaPC prototype using a custom EISA card interface and PCs running BSD UNIX, and then built larger clusters using DEC Alphas. ParaStation is now (January 1997) available for Linux. The PCI cards are being made in cooperation with a company called Hitex.
SHRIMP (Scalable, High-Performance, Really Inexpensive Multi-Processor): The SHRIMP project at the Princeton University Computer Science Department is building a parallel computer using PCs running Linux as the processing elements. They have developed a simple two-processor prototype using a dual-ported RAM on a custom EISA card interface, and are working on a prototype that will scale to large configurations using a custom interface card to connect to a "hub" that is essentially the same mesh routing network used in the Intel Paragon. Considerable effort has gone into developing low-overhead "virtual memory mapped communication" hardware and support software.
TTL_PAPERS (TTL Purdue's Adapter for Parallel Execution and Rapid Synchronization): The PAPERS project at the Purdue University School of Electrical and Computer Engineering is building scalable, low-latency, aggregate function communication hardware and software. There have been eight different types of PAPERS hardware built that connect to PCs/workstations via the SPP (Standard Parallel Port), including incredibly simple public domain designs using TTL logic. One such design is available commercially. Unlike the custom hardware designs from other universities, TTL_PAPERS clusters have been assembled at several universities. Bandwidth is severly limited by the SPP connections, but PAPERS implements very low latency aggregate function communications; even the fastest message-oriented systems cannot provide comparable performance on those aggregate functions. Although PAPERS clusters have been built using AIX and OSF/1 machines, Linux-based PCs are the platforms best supported. To improve system security, there is a minor kernel patch recommended, but not required.
PLIP (Parallel Line Interface Protocol): For just the cost of a "LapLink" cable, PLIP allows two Linux machines to communicate through standard parallel ports using standard socket-based software. In terms of bandwidth, latency, and scalability, this is not a very serious network technology; however, the near-zero cost and the software compatibility are useful.
ATM (Asynchronous Transfer Mode): Unless you've been in a coma for the past year, you have probably heard a lot about how ATM is the future... well, sort-of. ATM is cheaper than HiPPI and faster than Fast Ethernet, and it can be used over very long distances. The ATM network protocol is also designed to provide a lower-overhead software interface and to more efficiently manage small messages and real-time communications (e.g., digital audio and video). Best of all, it is the highest-bandwidth network that Linux currently supports. The bad news is that ATM isn't cheap, and there are still quite a few compatibility problems across vendors. An overview of Linux ATM development is available.
Fast Ethernet: Although there are really quite a few different technologies calling themselves "fast Ethernet," this term most often refers to a hub-based 100 Mbits/s Ethernet that is somewhat compatible with older "10 BaseT" 10 Mbits/s devices and cables. As might be expected, anything called Ethernet is generally priced for a volume market, and these interfaces are generally about 1/4 the price of 155 Mbits/s ATM cards. The catch is that having a bunch of machines dividing the bandwidth of a single 100 Mbits/s "bus" (unswitched hub) yields performance that might not even be as good on average as using 10 Mbits/s Ethernet with a switched hub that can give each machine's connection a full 10 Mbits/s. Switched hubs that can provide 100 Mbits/s for each machine simultaneously are very difficult to build and thus approach the cost of ATM switches, but they do yield much higher total network bandwidth.
Also note that, as described below for Ethernet, the Beowulf project at NASA has been developing support that offers improved performance by load sharing across multiple Fast Ethernets.
Ethernet: For several years now, 10 Mbits/s Ethernet has been the standard network technology. Good Ethernet interface cards can be purchased for under $50, and a fair number of PCs now have an Ethernet controller built-into the motherboard. For lightly-used networks, Ethernet connections can be organized as a multi-tap bus without a hub; such configurations can serve up to 200 machines with minimal cost, but are not appropriate for parallel processing. Adding an unswitched hub does not really help performance. However, switched hubs that can provide full bandwidth to simultaneous connections cost only about $100 per port. Linux supports an amazing range of Ethernet interfaces, but it is important to keep in mind that variations in the interface hardware can yield significant performance differences. See the Hardware Compatibility HOWTO for comments on which are supported and how well they work.
An interesting way to improve performance is offered by the 16-machine Linux cluster work done in the Beowulf project at NASA CESDIS. There, Donald Becker, who is the author of many Ethernet card drivers, has developed support for load sharing across multiple Ethernet networks that shadow each other (i.e., share the same network addresses). This load sharing is built-into the standard Linux distribution, and is done invisibly at the socket operation level. Because hub cost is significant, having each machine connected to two or more hub-less or unswitched hub Ethernet networks can be a very cost-effective way to improve performance. In fact, in situations where one machine is the network performance bottleneck, load sharing using shadow networks works much better than using a single switched hub network.
ARCNET: ARCNET is a local area network that is primarily intended for use in embedded real-time control systems. Like Ethernet, the network is physically organized either as taps on a bus or one or more hubs, however, unlike Ethernet, it uses a token-based protocol logically structuring the network as a ring. Packet headers are small (3 or 4 bytes) and messages can carry as little as a single byte of data. Thus, ARCNET yields more consistent performance than Ethernet, with bounded delays, etc. Unfortunately, it is slower than Ethernet and less popular, making it more expensive. More information is available from the ARCNET Trade Association.
SLIP (Serial Line Interface Protocol): Although SLIP is firmly planted at the low end of the performance spectrum, SLIP (or CSLIP or PPP) allows two machines to perform socket communication via ordinary RS232 serial ports. The RS232 ports can be connected using a null-modem RS232 serial cable, or they can even be connected via dial-up through a modem. In any case, latency is high and bandwidth is low, so SLIP should be used only when no other alternatives are available. It is worth noting, however, that most PCs have two RS232 ports, so it would be possible to network a group of machines simply by connecting the machines as a linear array or as a ring. There is even load sharing software called EQL.

Network Software Interface

Before moving on to discuss the software support for parallel applications, it is useful to first briefly cover the basics of low-level software interface to the network hardware. There are really only three basic choices: sockets, device drivers, and user-level libraries.

Sockets

As can be seen in the above table, by far the most common low-level interface is a socket interface. Sockets have been a part of unix for over a decade, and most standard network hardware is designed to support at least two types of socket protocols: UDP and TCP. Both types of socket allow you to send arbitrary size blocks of data from one machine to another, but there are several important differences. Typically, both yield a minimum latency of around 1,000 µs, although performance can be far worse depending on network traffic.

These socket types are the basic network software interface for most of the portable, higher-level, parallel processing software; for example, PVM uses a combination of UDP and TCP, so knowing the difference will help you tune performance. For even better performance, you can also use these mechanisms directly in your program. The following is just a simple overview of UDP and TCP; see the manual pages and a good network programming book for details.

UDP Protocol (`SOCK_DGRAM`)

UDP is an Unreliable Datagram Protocol; in other words, it allows each block to be sent as an individual message, but a message might be lost in transmission. In fact, depending on network traffic, UDP messages can be lost, can arrive multiple times, or can arrive in an order different from that in which they were sent. The sender of a UDP message does not automatically get an acknowledgment, so it is up to user-written code to detect and compensate for these problems. Fortunately, UDP does ensure that if a message arrives, the message contents are intact (i.e., you never get just part of a message).

The nice thing about UDP is that it tends to be the fastest socket protocol. Further, UDP is "connectionless," which means that each message is essentially independent of all others. A good analogy is that each message is like a letter to be mailed; you might send multiple letters to the same address, but each one is independent of the others and there is no limit on how many people you can send letters to.

TCP Protocol (`SOCK_STREAM`)

Unlike UDP, TCP is a reliable, connection-based, protocol. Each block sent is not seen as a message, but as a block of data within an apparently continuous stream of bytes being transmitted through a connection between sender and receiver. This is very different from UDP messaging because each block is simply part of the byte stream and it is up to the user code to figure-out how to extract each block from the byte stream; there are no markings separating messages. Further, the connections are more fragile with respect to network problems, and only a limited number of connections can exist simultaneously for each process. Because it is reliable, TCP generally implies significantly more overhead than UDP.

There are, however, a few pleasant surprises about TCP. One is that, if multiple messages are sent through a connection, TCP is able to pack them together in a buffer to better match network hardware packet sizes, potentially yielding better-than-UDP performance for groups of short or oddly-sized messages. The other bonus is that networks constructed using reliable direct physical links between machines can easily and efficiently simulate TCP connections. For example, this was done for the ParaStation's "Socket Library" interface software, which provides TCP semantics using user-level calls that differ from the standard TCP OS calls only by the addition of the prefix PSS to each function name.

Device Drivers

When it comes to actually pushing data onto the network or pulling data off the network, the standard unix software interface is a part of the unix kernel called a device driver. UDP and TCP don't just transport data, they also imply a fair amount of overhead for socket management. For example, something has to manage the fact that multiple TCP connections can share a single physical network interface. In contrast, a device driver for a dedicated network interface only needs to implement a few simple data transport functions. These device driver functions can then be invoked by user programs by using open() to identify the proper device and then using system calls like read() and write() on the open "file." Thus, each such operation could transport a block of data with little more than the overhead of a system call, which might be as fast as 100 µs.

Writing a device driver to be used with Linux is not hard... if you know precisely how the device hardware works. If you are not sure how it works, don't guess. Debugging device drivers isn't fun and mistakes can fry hardware. However, if that hasn't scared you off, it may be possible to write a device driver to, for example, use dedicated Ethernet cards as dumb but fast direct machine-to-machine connections without the usual Ethernet protocol overhead. In fact, that's pretty much what some early Intel supercomputers did.... Look at the Device Driver HOWTO for more information.

User-Level Libraries

If you've taken an OS course, user-level access to hardware device registers is exactly what you have been taught never to do, because one of the primary purposes of an OS is to control device access. However, an OS call is commonly 100 µs of overhead. For custom network hardware like TTL_PAPERS, which can perform over 30 network operations in 100 µs, such OS call overhead is intolerable. The only way to avoid that overhead is to have user-level code -- a user-level library -- directly access hardware device registers. Thus, the question becomes one of how a user-level library can access hardware directly, yet not compromise the OS control of device access rights.

On a typical system, the only way for a user-level library to directly access hardware device registers is to:

At user program start-up, use an OS call to map the page of memory address space containing the device registers into the user process virtual memory map. There is no standard Linux call for this purpose, but it is relatively simple to write a device driver to perform this function. Further, this device driver can control access by only mapping the page(s) containing the specific device registers needed, thereby maintaining OS access control.
Access device registers without an OS call by simply loading or storing to the mapped addresses. For example, *((char *) 0x1234) = 5; would store the byte value 5 into memory location 1234 (hexadecimal).

Fortunately, it happens that Linux for the Intel 386 (and compatible processors) offers an even better solution:

Using the ioperm() OS call from a privileged process, get permission to access the precise I/O port addresses that correspond to the device registers. Alternatively, permission can be managed by an independent privileged user process (i.e., a "meta OS") using the giveioperm() OS call patch for Linux.
Access device registers without an OS call by using 386 port I/O instructions.

This second solution is preferable because it is common that multiple devices have their registers within a single page, in which case the first technique would not provide protection against accessing other device registers that happened to reside in the same page as the ones intended. Of course, the down side is that 386 port I/O instructions cannot be coded in C -- instead, you will need to use a bit of assembly code. The GCC-wrapped (usable in C programs) inline assembly code function for a port input of a byte value is:

extern inline unsigned char
inb(unsigned short port)
{
    unsigned char _v;
__asm__ __volatile__ ("inb %w1,%b0"
                      :"=a" (_v)
                      :"d" (port), "0" (0));
    return _v;
}

Similarly, the GCC-wrapped code for a byte port output is:

extern inline void
outb(unsigned char value,
unsigned short port)
{
__asm__ __volatile__ ("outb %b0,%w1"
                      :/* no outputs */
                      :"a" (value), "d" (port));
}

Cluster-Parallel Support Software

Ok, so you've selected cluster hardware... now how are you going to write parallel programs for it? Well, there is actually quite a lot of support software to help you. This section provides pointers to further information, with a brief overview for each.

To aid you in finding what you need, the pointers are organized into groups. The first group is basic message-passing library support, generally built on top of sockets. The second group is higher-level parallel programming support; many of these programming systems are actually built on top of the messaging support in the first group.

Libraries

There are actually quite a few libraries. The following have all been ported to Linux or might be directly usable with Linux systems.

http://www.epm.ornl.gov/pvm/pvm_home.html: PVM (Parallel Virtual Machine) is a freely-available, portable, message-passing library generally implemented on top of UNIX sockets. This includes single-processor and SMP Linux machines, as well as clusters of Linux machines linked by socket-capable networks (e.g., SLIP, PLIP, Ethernet, ATM). In fact, PVM will even work across groups of machines in which a variety of different types of processors, configurations, and physical networks are used -- Heterogeneous Clusters -- even to the scale of treating machines linked by the Internet as a parallel cluster. Best of all, PVM is freely available and is clearly the de-facto standard for message-passing cluster parallel computing. PVM also provides facilities for parallel job control.
It is important to note, however, that PVM message-passing calls generally add significant overhead to standard socket operations, which already had high latency. Further, the message handling calls do not constitute a particularly "friendly" programming model, so PVM is commonly used as the "portable message library target" for high-level language parallel compilers.
http://www.mcs.anl.gov:80/mpi/: Although PVM is the de-facto standard message-passing library, MPI (Message Passing Interface) is the official standard. This page is the home page for the MPI standard.
http://www.osc.edu/lam.html: This is the home page for LAM (Local Area Multicomputer), a full implementation of the MPI communication standard for workstation clusters using a conventional network. The system includes a variety of development and debugging aids.
ftp://ftp.epcc.ed.ac.uk/pub/chimp/release/: CHIMP, a freely available MPI implementation.
http://garage.ecn.purdue.edu/~papers: The PAPERS project provides a high-level parallel processing library that is designed to be used with either the CAPERS or PAPERS hardware. The library is designed to be called from C or C++ routines.
http://info2.rus.uni-stuttgart.de:81/rus/dfn_rpc/README_df: The DFN-RPC, a Remote Procedure Call Tool, was developed to distribute and parallelize scientific-technical application programs between a workstation and a compute server or a cluster. The interface is optimized for applications written in fortran, but the DFN-RPC can also be used in a C environment. If you want to install the DFN-RPC on PC with Linux then you must use at least Rel. 1.0.49alpha, better is Rel. 1.0.60beta.
http://www.cmpharm.ucsf.edu/~srp/batch/systems.html: DQS 3.1 is an experimental queueing system that has been developed and tested under Linux.
http://www.cs.wisc.edu/condor/: Condor is a distributed resource management system that can manage large heterogeneous clusters of workstations. Its design has been motivated by the needs of users who would like to use the unutilized capacity of such clusters for their long-running, computation-intensive jobs. Condor preserves a large measure of the originating machine's environment on the execution machine, even if the originating and execution machines do not share a common file and/or password systems. Condor jobs that consist of a single process are automatically checkpointed and migrated between workstations as needed to ensure eventual completion.
A Linux port is in progress; more information is available at http://www.cs.wisc.edu/condor/linux/linux.html. Contact greger@cae.wisc.edu for details.

Parallel Programming Systems

Generally somewhat higher-level than libraries, the following programming systems may be usable with Linux. Note that many of these systems are built upon one of the libraries listed in the previous section.

http://www.cs.virginia.edu/~mentat/: Mentat is an object-oriented parallel processing system that works with workstation clusters and has been ported to Linux. Mentat Programming Language (MPL) is an object-oriented programming language based on C++. The Mentat run-time system uses something vaguely resembling non-blocking remote procedure calls.
http://www.informatik.uni-stuttgart.de/ipvr/bv/p3/p3.html: Parallaxis is a structured programming language that extends Modula-2 with "virtual processors and connections" for data parallelism (a SIMD model). The Parallaxis software comprises compilers for sequential and parallel computer systems, a debugger (extensions to the gdb and xgbd debugger), and a large variety of sample algorithms from different areas, especially image processing. This runs on sequential Linux systems and PVM clusters.
http://suif.stanford.edu/~scales/sam.html: Jade is a parallel programming language that extends C to exploit coarse-grain concurrency in sequential, imperative programs. It assumes a distributed shared memory model, which is implemented by SAM for workstation clusters using PVM.
http://www.cs.washington.edu/research/projects/orca3/zpl/www/: ZPL is an array-based programming language intended to support engineering and scientific applications. It runs under PVM on workstation clusters.
http://www.csl.sri.com/GLU.html: GLU (Granular Lucid) is a very high-level programming system based on a hybrid programming model that combines intensional (Lucid) and imperative models. Does it run under Linux?
http://www.extreme.indiana.edu/sage/: pC++ is a language extention to C++ that permits data-parallel style operations using "collections of objects" from some base "element" class. It is a preprocessor generating C++ code that can run under PVM.
http://www.cs.arizona.edu/sr/www/index.html: SR (Synchronizing Resources) is a concurrent programming language in which resources encapsulate processes and the variables they share; operations provide the primary mechanism for process interaction. SR provides a novel integration of the mechanisms for invoking and servicing operations. Consequently, all of local and remote procedure call, rendezvous, message passing, dynamic process creation, multicast, and semaphores are supported. SR also supports shared global variables and operations.
It has been ported to Linux, and may be able to run on networked Linux systems?
http://www.myrias.com/: Myrias is a company selling a software product called Parallel Application Management System (PAMS). PAMS provides very simple directives for virtual shared memory parallel processing. Are networks of Linux machines supported?

References of General Interest

The following are references to various cluster-related projects that may be of general interest. This includes a mix of generic cluster references and pointers to Linux cluster sites.

http://www.cs.huji.ac.il/mosix/: This is the home page for MOSIX, a project trying to modify the BSD kernel to provide dynamic load balancing and preemptive process migration across a networked group of PCs.
http://now.cs.berkeley.edu/: This is the home page for the Berkeley NOW project. There is a lot work going on here, all aimed toward "demonstrating a practical 100 processor system in the next few years." Alas, they don't use Linux.
http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html: This is the home page for the Beowulf project, which has been the leading Linux-based cluster project using conventional network hardware. In fact, it was as part of this project that Linux kernel support for load-sharing across multiple Ethernet interfaces was developed. This project has also been the source of a number of Ethernet and 100 Mbit/s Fast Ethernet drivers for Linux. Currently, they are using PVM for a variety of parallel applications.
http://www.geli.com/: Geli Engineering provides sales, installation, support and distributed computing consulting services for workstation clusters using BSD Pentium systems with 100 Mbits/s Ethernet.
http://www.cs.cornell.edu/Info/People/mdw/mdw.html: This page describes some of the Ethernet cluster and ATM work going on at the Systems Group at the Cornell University Computer Science Department.
http://www.cs.sunysb.edu/~manish/locust/: The Locust project is building a distributed virtual shared memory system that uses compile-time information to hide message-latency and to reduce network traffic at run time. Pupa is the underlying communication subsystem of Locust, and is implemented using Ethernet to connect 486 PCs under FreeBSD.
http://www.cs.cmu.edu/afs/cs/project/multiC-sys-sw/WWW/top.html: This is the World Wide Web home page of the ARPA/CSTO Multicomputing System Software project. Lots of cluster stuff, nothing about Linux in particular.

This page was last modified .