Hank Dietz
School of Electrical and Computer Engineering
Purdue University
West Lafayette, IN 47907-1285
hankd@ecn.purdue.edu
PAPERS (Purdue's Adapter for Parallel Execution and Rapid Synchronization) has proven that a group of ordinary PCs and/or workstations can function as a tightly-coupled parallel system. Unfortunately, PAPERS is hardware. True, TTL_PAPERS is very simple hardware, but it is still a separate box, with its own power supply, that you build or purchase.
In contrast, although WAPERS supports the full user-level AFAPI (Aggregate Function Application Program Interface), the WAPERS hardware is entirely passive. Literally, WAPERS is a wiring pattern. How does a wiring pattern implement aggregate functions? The basic building block for TTL_PAPERS aggregates is NAND; WAPERS replaces this by a wired-AND of open-collector outputs from standard parallel ports (SPPs). Depending on the precise electrical properties of the SPPs and cables, WAPERS is even modularly scalable.
What's the catch? There are three. (1) WAPERS yields somewhat lower performance than TTL_PAPERS. (2) Scaling may be limited to as few as about 8 machines. (3) WAPERS can fry parallel port hardware if things are not configured correctly. However, for a lot of small-scale cluster applications, WAPERS is an amazingly simple way to get the performance you need.
Although the concept of using the SPP for communication between machines is widely accepted (e.g., using "LapLink" cables), few would think of it as a viable approach for parallel processing... but why not? The traditional views of parallel and distributed processing rest on a set of basic assumptions that are incompatible with achieving good performance using such an interface. So, WAPERS does things differently:
Thus, WAPERS does not perform any magic; it merely uses a parallel computation model that naturally yields simpler hardware and lower latency... and even though it is an electrically passive design, WAPERS does implement the most important functions directly in hardware. The low-latency synchronization and communication WAPERS provides allow users to take full advantage of the "loosely synchronous" execution models associated with fine-grain to moderate- grain parallelism.
For most versions of PAPERS, especially TTL_PAPERS, bitwise NAND plays a key role in implementing both barrier synchronization and other aggregate functions -- but WAPERS does not contain NAND gates. The TTL_PAPERS barrier logic also depends on a single bit of state stored in a flip-flop, which WAPERS lacks. In fact, WAPERS contains no active logic at all.
An SPP actually has three separate I/O registers, which are mapped to consecutive I/O addresses starting at 0x278, 0x378, or 0x3bc. The first register is an 8-bit data output port, which WAPERS could use to implement an 8-bit broadcast bus, but that we generally ignore for electrical reasons. The second register is a 5-bit status input port that serves no purpose for WAPERS. The last register is a 4-bit control output port, which WAPERS uses to implement the 4-bit AND.
Each bit of the SPP control output was originally an open- collector TTL signal pulled-up to +5 volts through a 4.7K ohm resistor, and the logic level of each signal can be read back by reading the feedback register at the same I/O address. Three of the four control outputs are inverted; actually, all four bits were originally driven by 7405 inverting open-collector buffers, but the 0x04 bit was inverted twice on output while the other three bits were inverted once on output and once on input via the feedback register. Although the different paths can yield slightly different characteristics, we can essentially ignore the differences except in the lowest-level WAPERS software, which must correct for the fact that three lines have the inverted sense.
The open-collector outputs are only driven low; they slowly drift high through the pull-up resistance. Normally, reading the value on the feedback port will get you what you output -- however, when a bit is set to the high output value, an external connection could harmlessly pull the signal low by sinking the current provided by the pull-up resistance. Thus, each of these control lines is simultaneously both an output and an input. If we connect several of these I/O lines, the voltage is logic high only if all the outputs were high, i.e., all the signals will be ANDed together.
Does this wired-AND really work? Well, yes, but there are a few constraints:
Well, maybe it is more than an few constraints.... ;-) The gist of it is that wired-AND of the control port signals will work fine if your systems are properly configured, and using cheap ISA SPP cards is a pretty good way to hedge the bet.
So, what good is a 4-bit wired-AND anyway? The answer is that these four lowly signals implement both fast barrier synchronization and bitwise aggregate communication functions. Here is how.
Before discussing how WAPERS uses the wired-AND signals to implement fast hardware barriers, it is useful to briefly review a bit of the history of PAPERS-style barrier hardware.
A barrier synchronization is an n-way synchronization in which each processor:
It doesn't take a flash of inspiration to realize that the signal in step 2 is essentially the logical AND of signals sent by each of the processors. Way back in 1987 we figured that out, and further realized that by selecting between a constant 1 and the signal out of each processor, the inputs to the AND could be controlled to allow any arbitrary set of processors to participate.
The thing we did not realize back then is that, using computers that can receive interrupts from other devices, a second barrier synchronization mechanism is needed to confirm that the barrier logic has been reset by all processors before any processor can attempt another barrier synchronization on the primary unit. The result was that our first barrier logic implemented four-cycle barriers:
We then had the insight that, by using a set/reset flip- flop, we could combine steps 1 and 3, and also steps 2 and 4, to create a two-cycle barrier system. In this scheme, two barrier AND (actually NAND) trees are used such that when waiting at one, we are resetting the other. This is the two-cycle design used in most PAPERS units, including the widely disseminated November 1994 TTL_PAPERS design.
Without the flip-flop, it is not possible for WAPERS to perform two-cycle barriers using just two AND trees. The problem lies in the fact that the previous barrier signal must still be available for other processors to read while some processors are initiating the next barrier. Thus, the previous barrier cannot be reset until after the next barrier has completed -- and some processors may be starting a third barrier. The somewhat strange conclusion is that by cycling through three separate barrier AND trees, WAPERS can implement two-cycle barriers. At any barrier, WAPERS is essentially preserving the state of the previous barrier signal while checking-in at the next barrier and resetting the third barrier.
Bits 0x08, 0x04, and 0x02 of the control register are used as the three barrier signals. The slight additional complication in this assignment is that the sense of bit 0x04 is not inverted, while the other two bits are inverted. This is compensated for by the lowest-level WAPERS port I/O software.
Given a 4-bit control output in which three lines are used to implement barrier synchronization, only a single line (corresponding to the 0x01 bit) is available to implement data communication.
The bad news is that transmitting data using a single bit path is essentially software-intensive serial communication, and the raw data rates are not much better than high-speed RS232C serial ports -- typically between 100k and 200k baud. However, there are a few important differences that make this much more capable than just a multi-tap serial line:
Thus, although one would expect WAPERS to yield about 1/4 the aggregate function performance of TTL_PAPERS (which has a 4-bit data path), actual performance using the 1-bit AND is often much closer to that of TTL_PAPERS. In fact, trying to take advantage of the 4-bit pathway of TTL_PAPERS makes it difficult to implement some optimizations, so WAPERS will occasionally outperform TTL_PAPERS.
Notice that here we have only hinted at the complexity of some of the WAPERS AFAPI routines... remember that the full source code is freely available, and it provides the most detailed and up-to-date reference for the algorithms used.
The standard WAPERS AFAPI requires the 4-bit AND described above, but it is also possible to implement an 8-bit shared broadcast bus without arbitration hardware... if your port hardware can support it.
The 8-bit data output register is accompanied by an 8-bit feedback register than can be used to read the state of the output signals. Originally, the 8-bit output was driven by a 74LS374, which is an octal TTL driver with tri-state outputs... so it is natural to think of placing the 74LS374 in the high-impedance output state and using the feedback register as an 8-bit input register. The catch is that the SPP simply hardwired the tri-state control to always enable output, so the input trick did not work unless you literally cut a board trace and installed a jumper (of course, cutting a trace and installing a jumper within a single-chip CMOS implementation of SPP is a bit difficult ;-). Besides, if you did that, you had an input-only register; to implement an 8-bit broadcast bus, we need software control over the tri-state signal.
Before we discuss the port configurations that do allow us to have an 8-bit bus, it is useful to ponder what will happen if we use cabling that connects these 8-bit SPP outputs anyway. Why? Well, it is nice to have just one cable wiring specification, WAPERS can detect and avoid using the faulty 8-bit bus, and full WAPERS functionality on appropriate ports can actually be implemented by simply connecting the corresponding pins across all the ports, which is particularly easy (discussed in section 2.2 as Design 2). So, what happens when actively driven, supposedly "TTL-compatible," outputs are tied together?
We engineers always have been taught not to tie actively driven outputs together, and port hardware varies quite a lot, so surprisingly few people really have a feel for what will happen if you do this. After surveying many of my colleagues, I've formulated the following answer. First, things will probably slowly (compared to microsecond port access times) overheat and die if two drivers on a single line disagree; this is probably a larger problem with CMOS than with TTL. We would expect the worst case to be a single CMOS driver pulling high against a group of drivers pulling low; the CMOS pulling high may try to source more current than it can dissipate the heat for. So, let's assume that all the drivers will be in the same state... which state: logic high or low? Most people figured low was safer, because the drivers all agree on ground (the grounds are connected) and both TTL and CMOS are generally good at sinking whatever differences might exist. In contrast, the logic high may differ quite a bit: from just over 2 volts for TTL drivers to 5 volts (or even sightly higher, depending on power supplies) for CMOS. However, if every line is low and one CMOS driver is set high, that driver will fry quickly, so a couple of people felt all high is safer. The bottom line is that things will only be damaged by thermal problems which take a little while to develop, so having software set all the lines in the same state should be ok, and WAPERS AFAPI versions that use the 8-bit broadcast bus take precautions to ensure that things stay ok (forcing all lines low when in doubt). Basically, if AFAPI outputs a value on the 8-bit bus and sees something else in the feedback register, it assumes that somebody else is not in the high impedance state. Still, software errors could fry hardware, so seriously consider using just a five- wire connection (Design 1) if you have SPP port hardware.
Ok, we now know how to keep an unusable port from frying despite an inappropriate cable, but what ports are usable? At least in theory, the PS/2 (bidirectional) port is ideal, and EPP and ECP can emulate that. The catch is that some ports will only allow software to tri-state disable the 8-bit output in modes where the control register lines are not open-collector. If you have one of those ports, forget about the 8-bit stuff -- no version of WAPERS can work without open-collector control lines.
The tri-state control for the 8-bit output is bit 0x20 of the control register. If this bit is on, the 8-bit output is disabled (you cannot disable individual output lines, but only the entire chip). Thus, it is up to the WAPERS AFAPI to ensure that, at any given time, at most one processor has its 8-bit output enabled. How do we do that? The answer is that we use barriers around each 8-bit bus action... not really very different from how we do 1-bit AND transmissions. If there is ever any ambiguity about which processor should have access to the 8-bit bus, the ambiguity is resolved using the 1-bit AND facility and the resulting conflict-free bus write order is enforced by barriers.
The result is raw bandwidth typically somewhat over 1Mb/s... still slow, but quite usable given that the latency is on the order of a few microseconds. You should notice two things about this bandwidth. First, it is 8x the raw WAPERS AND bandwidth and 2x the TTL_PAPERS bandwidth, but, unlike those, implements only broadcast. Second, because it is an 8-bit wide path, it is not as easy to optimize the transmissions, so the optimized WAPERS AND bandwidth can actually be better under certain circumstances. In summary, the 8-bit bus is a nice facility, but it is probably best to view it as providing more consistently good bandwidth rather than improving the best case.
As suggested earlier, some versions of the WAPERS AFAPI actually check for availability and proper operation of the 8-bit bus at runtime. This adds a little overhead to some library functions, but is the safest approach that allows the 8-bit bus to be used wherever possible. In general, we do not recommend using 8-bit WAPERS cables unless you are absolutely sure that these connections will be harmless.
To facilitate some level of asynchronous operation, some versions of PAPERS provide a separate interrupt broadcast facility so that any processor can signal the others. Such an interrupt does not really generate a hardware interrupt on each processor, rather, it sets a flag that each processor can read at an appropriate time. Generally, this facility is used primarily by the parallel meta-OS to enforce gang-scheduling, etc.
Unfortunately, WAPERS does not provide any such mechanism. The user-level AFAPI signals are fully supported, but WAPERS does not provide an "out-of-band" parallel interrupt facility for meta-OS use. We feel this is a minor issue because WAPERS is intended for experimentation with dedicated applications; it is not recommended as the interconnection network for clusters to serve multiple users running/developing multiple parallel applications.
Unlike most research prototype supercomputers, WAPERS is a fully public domain hardware design and software intended to be widely replicated. It is hoped that the fine-grain capabilities of WAPERS, and the various more sophisticated PAPERS units, in linking conventional computers will bring a qualitative change to the fields of cluster, network, and heterogeneous supercomputing.
As discussed above, WAPERS is not really hardware, but a wiring pattern. In this section, we detail two different ways to implement an appropriate wiring pattern. The first method implements only the bare minimum wiring, but yields a box that can be safely used with any parallel ports implementing the open-collector control outputs. The second method creates a version that is even easier to build and also implements the 8-bit broadcast bus, but doesn't look as nice and is potentially a bit risky to use with ports not supporting tri-state disable of the 8-bit data output.
It is a quirk of our society that it is often cheaper and easier to modify a more complex, but standard, thing than to build something simple from scratch. This design takes full advantage of that fact by modifying a mechanical parallel printer switch to construct a WAPERS unit. A photo of the completed version appears on the cover of this document.
Although you could easily enough buy DB25 connectors, a box, and wires as individual components, for less than $10 you can buy a 4-to-1 mechanical parallel printer switch that contains all of the above, complete with mounting hardware. Do not get a smart all-electronic switch box; you want one that has a dial that you have to turn to select what is connected. When you open the box (probably removing four screws) and look inside, you'll find something like:
Basically, it is 5 DB25 connectors (yielding a 5-machine WAPERS unit) wired to a big, fat, switch. The next few steps may be easier to do if you first remove the switch and connectors from their mountings in the box.
Disconnect the switch by desoldering the wires from the DB25 connectors. A useful suggestion is to always plug a mating DB25 connector to the connector whose pins you will be soldering/desoldering on -- this way, even if you get the pins a little too hot, they will not become misaligned when the plastic around them gets soft.
Once you have disconnected all the wires from the DB25 connectors, you have just a few connections to make. The pin/contact assignment for each of the lines is given in Table 1. WAPERS connections are completely symmetric; all PEs are connected identically to five "posts". Table 1 lists the pin numbers in the order they appear on each DB25 connector. Notice that most pins (those not listed) are unconnected.
+--------------------------------------------+ | Table 1: DB25 Pin Connections, Design 1 | | | |PE Standard | |Pin # Post Name Use In WAPERS Box | +------+------+----------+-------------------+ |1 | Data | Strobe | AND Data, 0x01 | +------+------+----------+-------------------+ |14 | Bar2 | AutoFD | Barrier, 0x02 | +------+------+----------+-------------------+ |16 | Bar4 | Init | Barrier, 0x04 | +------+------+----------+-------------------+ |17 | Bar8 | Slct | Barrier, 0x08 | +------+------+----------+-------------------+ |18 | Gnd | Gnd | Signal ground | +------+------+----------+-------------------+ |19 | Gnd | Gnd | Signal ground | +------+------+----------+-------------------+ |20 | Gnd | Gnd | Signal ground | +------+------+----------+-------------------+ |21 | Gnd | Gnd | Signal ground | +------+------+----------+-------------------+ |22 | Gnd | Gnd | Signal ground | +------+------+----------+-------------------+ |23 | Gnd | Gnd | Signal ground | +------+------+----------+-------------------+ |24 | Gnd | Gnd | Signal ground | +------+------+----------+-------------------+ |25 | Gnd | Gnd | Signal ground | +------+------+----------+-------------------+
How do you make these connections? Odds are that you have a bundle of appropriate-length wires connected to the switch that you just removed, so you can desolder and reuse those wires.
Pins 18 through 25 are all ground and are all next to each other... so start by soldering a wire to pin 25 and then solder-bridging across pins 18 through 25. Then connect a wire to each of pins 1, 14, 16, and 17. Do this for each of the DB25 connectors, and then remount them in the box, twist and solder the corresponding wires together, wrap the soldered-wire connections with electrical tape, and you're done inside. It should look something like this:
The only thing remaining is to finish the box. Closing the box is easy enough (probably replacing four screws), but the front panel probably now has a hole in it where the switch used to be. We recommend covering the front panel, hole and all, with an appropriate label. Here's the one we used:
That's it. You now have a neat little WAPERS unit suitable for connecting up to five machines. Each machine simply gets connected to the WAPERS box via a straight-through DB25-to-DB25 cable.
If your open-collector ports can sink enough current, you can also use this same WAPERS unit as a scalable module. For example, to connect eight machines, you would simply connect four machines to the DB25 connectors of each of two such WAPERS units, and then connect the fifth DB25 connectors of the two units to each other using a straight- through DB25-to-DB25 cable. If you are lucky enough to have parallel port hardware and cables with the right electrical characteristics, you can connect up to 11 machines with 3 units, 14 with 4 units, etc. In general, up to 2 + 3x machines could be connected using x WAPERS units, where x>=0. Note that two machines can be connected to each other using a cable without any WAPERS unit... but that cable is essentially Design 2.
Although Design 1, the WAPERS box, quickly yields a functional and fairly serious-looking unit, there are a few advantages to instead constructing a single, multi- connector, WAPERS cable (Design 2):
Of course, the WAPERS cable has some disadvantages too. Perhaps the most important disadvantage is that there is the potential for the 8-bit bus line drivers to fry due to software errors, which cannot happen with the WAPERS box wired as described above. Another disadvantage is that the cable is not modularly scalable; heck, you cannot even change the set-at-cable-assembly-time distance between machines. Also, WAPERS is a custom cable, and thus might be slightly more expensive. Finally, the recommended construction uses unshielded ribbon cable that connects the signal ground lines pin-to-pin rather than grouping them as a single high-quality ground; the expected result is poorer noise immunity.
The wiring pattern for a WAPERS cable is incredibly simple: each pin on every connector is tied to the corresponding pin on every other connector. The resulting signal assignments are:
+------------------------------------------+ | Table 2: DB25 Wire Use, Design 2 | |PE Standard | |Pin # Name Use In WAPERS Cable | +------+----------+------------------------+ |1 | Strobe | AND Data, 0x01 | +------+----------+------------------------+ |2 | D0 | Bus Data, 0x01 | +------+----------+------------------------+ |3 | D1 | Bus Data, 0x02 | +------+----------+------------------------+ |4 | D2 | Bus Data, 0x04 | +------+----------+------------------------+ |5 | D3 | Bus Data, 0x08 | +------+----------+------------------------+ |6 | D4 | Bus Data, 0x10 | +------+----------+------------------------+ |7 | D5 | Bus Data, 0x20 | +------+----------+------------------------+ |8 | D6 | Bus Data, 0x40 | +------+----------+------------------------+ |9 | D7 | Bus Data, 0x80 | +------+----------+------------------------+ |10 | Ack | Ignored (unused input) | +------+----------+------------------------+ |11 | Busy | Ignored (unused input) | +------+----------+------------------------+ |12 | PE | Ignored (unused input) | +------+----------+------------------------+ |13 | SlctIn | Ignored (unused input) | +------+----------+------------------------+ |14 | AutoFD | Barrier, 0x02 | +------+----------+------------------------+ |15 | Error | Ignored (unused input) | +------+----------+------------------------+ |16 | Init | Barrier, 0x04 | +------+----------+------------------------+ |17 | Slct | Barrier, 0x08 | +------+----------+------------------------+ |18 | Gnd | Signal ground | +------+----------+------------------------+ |19 | Gnd | Signal ground | +------+----------+------------------------+ |20 | Gnd | Signal ground | +------+----------+------------------------+ |21 | Gnd | Signal ground | +------+----------+------------------------+ |22 | Gnd | Signal ground | +------+----------+------------------------+ |23 | Gnd | Signal ground | +------+----------+------------------------+ |24 | Gnd | Signal ground | +------+----------+------------------------+ |25 | Gnd | Signal ground | +------+----------+------------------------+
For the special case of a two-machine WAPERS cable, simply purchase a DB25-to-DB25 straight through cable... that's all you need.
However, to make a WAPERS cable for n machines, things are a bit more complex:
The result is a very unobtrusive custom cable that simply chains from the port of each machine to the next. Better still, if you do not want to make this cable yourself, most local computer stores and cable suppliers will make it for you at a reasonable cost.
When all the network hardware that you have is a bunch of wires, which is all WAPERS provides, it is clearly necessary that a bit of cleverness be employed in the support software. There are actually three major problems that the software must attack: determining the port configuration, avoiding illegal hardware states, and efficiently implementing the AFAPI.
The WAPERS software must determine, or at least attempt to confirm the user's specification of, the port hardware configuration being used. TTL_PAPERS and CAPERS use only the minimum port functionality that is common to all types of parallel ports, but WAPERS needs the ports to do more. WAPERS requires open-collector control output, and might additionally use tri-state data output if that ability is present on all machines within a cluster. It is entirely the user's responsibility to determine the appropriateness of using their port hardware for WAPERS before attempting to use WAPERS, and hardware damage may result from an attempt to use an incorrectly configured port.
Although we do not know of any test procedure which is 100% safe and effective in determining the port characteristics, the portinfo program provided with WAPERS AFAPI attempts to determine this information. It also attempts to guide you to the highest-performance port and port configuration. Inside WAPERS AFAPI itself, there are only simplified, occasionally executed, checks to confirm that the port configuration matches that specified by the user.
Finding a parallel port consists of looking for something that responds like a parallel port at one of the base I/O addresses where ports are generally found: 0x278, 0x378, and 0x3bc. Typically, one outputs a non-0xff value to the data output port (the base address) and then reads it back -- if you get the same value you wrote, it is likely that a port is present. The catch is that this test does not work if the data output is tri-state disabled, so you really want to enable the tri-state output first (turn off bit 0x20 on base + 2).
Ok, so it is a port. Is it an ECP port? If so, address base + 0x402 should be the ECP extended control register (ECR). Because 0x3bc + 0x402 yields an address that is generally used by other PC hardware, ports at 0x3bc cannot be used as ECP. If reading the ECR yields a value whose low two bits have the value (value & 0x03 is 0x01), then we have an ECP (whose FIFO is empty and not full).
If it wasn't an ECP, perhaps it is an EPP? The EPP uses the low bit of port address base + 1 as the EPP status; first we must clear that, then write to an EPP extended port address (base + 3, 4, 5, or 6), and then read the value back. The reset generally happens when base + 1 is touched or 0x01 is written there (wouldn't it be nice if such things were truly standardized?). If data output to base + 3 is then seen at base + 3, you have an EPP.
Suppose it was neither ECP nor EPP, is it a simple bidirectional port (also called a PS/2 port)? The way to test this is to tri-state disable the data output port (output 0x20 to base + 2), output some non-0xff data value to base, and then read from base. If what you output is what you read back, your port probably is not tri-state disabled, so it must be an ordinary SPP (Standard Parallel Port).
Ok, hopefully we now know what type of port we have. Ah, but we really do not know enough about it yet -- some ports, of whatever type, do not implement the control output lines using the open-collector drivers that WAPERS depends on. So we have to test that the open-collector outputs are indeed implemented as open-collector outputs.
The traditional open collector output configuration in a parallel port uses a 4.7K ohm pull-up to +5V. When the open-collector output is high, the resistance to ground is very large (essentially an open circuit). Thus, if you happen to have a DC volt meter, you can confirm the port construction fairly safely and easily by using the fact that connecting two identical resistors in series between +5V and ground (0V) should yield +2.5V at the point between the resistors. If the port is not open collector, the odds are that the resistance between +5V and the reference point is a lot less than 4.7K ohms, and consequently you will read a value significantly more than +2.5V. The circuit needed to make this measurement is:
No harm should come to the port hardware in either case, since the 4.7K ohm resistance is large enough to keep source current within bounds for an ordinary driver. If your port is not open collector and had an effective resistance of zero ohms to +5V, it would still source only about 1ma (5V divided by the 4.7K ohm resistor you used).
You don't have a voltage meter and/or do not want to deal with connecting a 4.7K ohm resistor? Well, if your parallel port came with a manual, perhaps it is time to do some reading....
In any case, remember: it is not our fault if you see a little puff of smoke appear where the driver chips used to be. WAPERS is electrically marginal; we know it, we have warned you, and now you know it. Also, our portinfo program is not fully trustworthy. In summary, neither the PAPERS group nor Purdue University can be held responsible for any problems that attempts to use WAPERS may cause.
The software must ensure that the port hardware settings of all machines always yields a safe global state. This is important because, unlike the TTL_PAPERS interface protocols, if the WAPERS protocols are applied incorrectly, the port hardware of one or more machines can be damaged.
The level of software protection against the occurrence of illegal hardware states varies widely across WAPERS library releases -- check the source code for your WAPERS AFAPI version to see how it handles such things. In any case, we are fighting a losing battle in the sense that there always will be some ways in which an undetected hardware error (i.e., noise, short, or other cable problem) or software error could cause port hardware to fry. You can minimize the probability of serious problems by:
Keep in mind that even the best software protection against illegal hardware states is effective only when that software is running. For example, the port probes done during boot are very likely to cause illegal states at least temporarily, so you might want to phsyically disconnect the machines during boot or to run the WAPERS software to initialize the port within the boot process.
Like the old Saturday Night Live skits about harmless little toys like "Bag O' Glass" used to say: Kid, be careful!
The whole point of WAPERS is to provide a network supporting low-latency AFAPI communications across two or more machines. Any software that works with the user-level AFAPI will work with WAPERS AFAPI.
Although we now have a unified AFAPI release, the WAPERS AFAPI is customized for the WAPERS "hardware" and what it can do. Thus, WAPERS AFAPI provides the same library interface, but it is not just a minor variation on the TTL_PAPERS AFAPI; it is a re-implementation optimized to yield good performance using WAPERS' dumb, passive, hardware. To see how the implementations differ, look at the AFAPI sources:
http://aggregate.org/AFAPI/
In this paper, we have presented the complete public domain design of the WAPERS hardware. This design represents the simplest possible mechanism to efficiently support barrier synchronization, aggregate communication, and group interrupt capabilities -- using unmodified conventional workstations or personal computers as the processing elements of a fine-grain parallel machine.
WAPERS is electrically marginal, and does not scale to very large clusters, but it makes for a very low cost first experience in using tightly-coupled cluster parallel processing. In fact, the ultra-low cost can make WAPERS very appropriate for linking together those wimpy old 386 machines that were in dead storage taking-up valuable office or lab space; even a 386 WAPERS cluster can implement a decent video wall.
If you like WAPERS, but want something better and are willing to deal with more complex hardware, take a look at the various PAPERS designs at:
http://aggregate.org/AFN/Hardware/
Most importantly, let us know what you think about WAPERS and how you use it. Also tell us if you tried and failed. Send comments to hankd@engr.uky.edu