This hypertext document presents a brief illustrated history of the development of PAPERS, Purdue's Adapter for Parallel Execution and Rapid Synchronization.
The first PAPERS prototype, this box was designed and constructed in less than two weeks... although we subsequently spent several months debugging and testing it. The unit connects to four PCs using standard printer cables to go from each PC to the corresponding Centronics connector mounted on the back of PAPERS0. Within the Oak box (with the top attached by velcro) are a power supply, rear-mounted Centronics parallel port connectors, wire-wrapped main circuit board, and wire-wrapped 40-LED display board. The LED display was a tad excessive, and was the vast majority of the power draw, but it looked cute and was helpful in debugging.
The full dynamic barrier mechanism is implemented using just one AMD 22V10 PAL per processor, with a group of TTL drivers used to ensure proper interface levels for the parallel printer ports of four PCs. Each logical barrier synchronization requires 4 port operations (cycles), organized as a barrier followed by an "anti-barrier": the barrier to synchronize, the anti-barrier to ensure that all participating processors have detected that synchronization was achieved.
The only communication supported by the hardware was a 1-bit multibroadcast intended to provide a means for voting on membership in a new barrier mask. However, we observed that arbitrary communication patterns and functions of the aggregate state also could be implemented using this hardware, and thus was born the concept of associating aggregate communications with barriers. If the data bit sent by a PE was the same as the last value sent, the transmission required 4 cycles; otherwise, a fifth cycle was needed to give the changed data bit time to settle (in essence, avoiding a race with the barrier GO signal). Unfortunately, the electrical characteristics of the parallel printer port turned out to be a lot more "interesting" than we had expected. We spent a lot of time in the MSEE 190 undergraduate laboratory debugging noise problems, which isn't too big a surprize when you combine the port characteristics with the wire-wrap jungle inside the box....
Because the parallel printer port supports generation of interrupts, we were very careful to make PAPERS0 able to generate interrupts either when a barrier synchronization was achieved or when a processor sent a "parallel interrupt request." How the interrupt would be triggered was software selectable. However, we quickly realized that latency and OS problems made it unwise for a parallel interrupt to generate a "real" interrupt on the PC, so generation of real PC interrupts was always disabled in our library.
Given how well PAPERS0 worked, and that we had a pretty good idea of what kind of electrical surprises to expect from the printer ports, we decided to design and build a simplified version that would use only TTL parts -- no PALs. This unit used just 8 standard TTL chips to implement a non-partitionable static barrier version of PAPERS0.
Aside from the logic simplification, the TTL PAPERS unit incorporated a few improvements. An obvious change is that the front panel has just one LED for each PE and one LED that acts as a power indicator. This brought the power consumption down to a level that allowed us to use a cheap AC wall adapter unit, a few capacitors, and a 7805 voltage regulator as the power supply. We also simplified construction, reduced signal noise, and cut cost by directly connecting cables to the circuit board rather than connecting them to a rear-mounted connector that is in turn connected to the circuit board. Because the cable mounts and power supply no longer mandated a large box, we were able to make the box much smaller, and changed to a design that used a pivoting "hood" for the cover. Another change was that all connections were soldered rather than wire-wrapped, greatly improving reliability.
For all you woodworkers out there, the TTL PAPERS box is Oak with a cover made out of Poplar.
Just as TTL PAPERS was simplifying the PAPERS0 design, PAPERS1 was attempting to create a higher-performance enhanced version. Because it was to be the high-performance version, the PAPERS1 hardware has the dubious honor of having undergone enough revisions so that we lost count long ago. Phrases like "no, that PAL design is ancient... it's from last week" come to mind....
In any case, and, incidentally, the case of PAPERS1 is made of Pine, PAPERS1 does yield optimum performance. It is a full dynamic barrier mechanism with a variety of enhanced data communication operations, and all PAPERS1 operations require just 2 cycles. Inside the case....
Each PE corresponds to some TTL drivers and two AMD 22V10 PALS: one "barrier PAL" and one "communication PAL." This separation allowed PAPERS1 to perform both 1-bit multibroadcast (like PAPERS0) and a 4-bit multibroadcast that we now call "putget." (Currently, PAPERS1 has been upgraded by replacing 1-bit multibroadcast with NANDing as described below.) Making PAPERS1 capable of 2 cycle data transmission required a bit of cleverness implemented by a carefully crafted state machine. In essence, PAPERS1 internally simulates a multi-cycle data transmission using its own clock; this can yield near peak performance, but also makes it necessary to tune the internal clocking to the PC port and cable characteristics.
The board is wire-wrapped, but, unlike PAPERS0, wires were routed very carefully to minimize interference. Having 8 PALs meant a 300ma AC adapter wouldn't suffice, so there is a small switching power supply inside and an on/off switch on the back. The method for connecting cables and the LED display resemble those of the first TTL PAPERS.
It is also worthwhile to note that PAPERS1 upgraded the PAPERS0 concept of a parallel interrupt to include a special "interrupt acknowledge" barrier. Although the library still doesn't generally use this mechanism, it is used to achieve the initial synchronization when a parallel code begins execution. There is also a hardware ID feature that allows a PE to determine which PAPERS PE it is connected as -- earlier versions of PAPERS had to be explicitly given their PE number.
Just when we thought that things couldn't be done any simpler than the first TTL PAPERS, we realized a few things:
All of this led to a second TTL PAPERS that was built using only 5 standard TTL chips. Well, it was really 6 chips if you count the additional TTL driver chip we added later to brighten the LEDs. By the way, the case is a slightly rounded version of the earlier TTL PAPERS case, but made with Aspen instead of Oak.
Something else wonderful happened with this prototype: we finally got a place where we could keep a PAPERS unit connected long term. Up to this time, we had been borrowing a few of the 486DX2/66 machines in the MSEE 190 undergraduate laboratory, but we could only use those machines when classes didn't need them. The cluster of 486DX33 machines shown with the second TTL PAPERS box finally gave us a place to experiment without having to compete with undergraduate students for access to the machines.
All the above is good stuff, but does it scale? Although this Oak and Aspen box is physically even smaller than the first two 4 PE TTL PAPERS boxes, it was put together for the sole purpose of demonstrating that the design scales at least to an 8 PE cluster. We were not very particular about how it would prove scalability, so the functionality is essentially a supercharged version of what TTL PAPERS supports, with the state machine timing properties of PAPERS1. Thus, it provides static barrier synchronization with 4-bit NANDing, all done with 2 cycle speed. Inside the box....
The single wire-wrapped circuit board is very densely packed with a variety of TTL and AMD PAL parts implementing a non-partitionable static barrier mechanism. Unfortunately, the wire-wrap was apparently a bit too dense, because we measured something on the order of a 2 volt spike on one wire that wasn't supposed to be doing anything at the time... a few capacitors cleared-up these little problems, but left us cursing wire-wrap. We also used a different method to connect the cables to the board: we made a little DIP header for each group of similar signals from the PCs. We will not do that again either. Oh yeah. It also was the first PAPERS unit to need a heat sink on its 7805 and a fan.
Ok, so this one was an evolutionary dead end. Anyway, it works and it is fun to watch the 8 bi-color LEDs as it plays one of our MIMD multi-voice music demos.
Don't ask. We tried to quickly build a 16 PE static barrier 2 cycle unit using a TI FPGA and some simple 4 PE signal-conditioning boards (shown in the photo above). Results: (1) we proved that many Purdue EE students don't know how to solder and (2) doing something like this in a rush virtually guarantees failure. It was a good experience in learning how to use Mentor Graphics and the FPGAs, making our own printed circuit boards, etc. Although we completed several of the 4 PE signal conditioning boards, we'll probably never bother finishing this prototype -- the design is now obsolete.
Ok, you can ask about this one. Actually, that's why we built it (and because the 16 PE unit wasn't going to be ready in time for our booth at Supercomputing ;-). This is the unit we've been building in significant quantity and supporting as a full public domain hardware design and support software release.
Basically, this version of PAPERS is just like the second TTL PAPERS unit, except:
Although the particular box pictured above is made out of Aspen, other copies have front and rear panels made out of Oak, Cherry, Walnut, Poplar, Pine, and Mahogany. Incidentally, although the PE numbering is arbitrary, the unit shown in the above photo has the display numbered with higher PE numbers corresponding to lower positions on the panel, which is the reverse of our "standard" numbering.
We are using this version of TTL_PAPERS for our permanent clusters. For example, the following photo shows our first Pentium cluster. These machines were donated in Summer 1995 by Intel specifically for the PAPERS project. Each holds a Pentium 90, 32M RAM, and 700M disk.
Since other people at other places also have been building this type of TTL_PAPERS unit, (e.g., Prof. Will Cohen has built one at the University of Alabama at Huntsville), it is useful to take a closer look at some of the construction details for TTL_PAPERS. The back of the box is....
From this view, you can see the power and cable connections on the rear panel. Opening the box....
This photograph reveals the construction of the box itself as well as the installation of the circuit board and cable connections. Notice that all the signal ground connections are made by mechanically connecting and soldering directly to a common ground post in the base of the unit -- this both provides a better electrical ground and a solid physical connection to help ensure that the cables will not pull out (this type of ground connection was used on all but PAPERS0 and PAPERS1). Partially removing the board....
If you back the board away from the front panel, you can see how the LEDs are mounted on the "wrong side" of the board and how they fit into the front panel mounting holes. In fact, this fit is generally tight enough that no separate mounting hardware is needed to attach the board to the box.
Although the TTL_PAPERS design of November 1994 was widely accepted, and a few other universities have built clusters using that design, we still have trouble getting some people to take it seriously because it only connects 4 machines. True, we did detail how to scale to larger systems, but that left a lot of people unconvinced. There was also the problem that scaling to a larger cluster meant building a whole new PAPERS unit... you can't incrementally expand the unit. In contrast, TTL_PAPERS 950801 is an 8-processor unit that modularly scales to thousands of processors....
The practical maximum number of machines that can be placed in a single rack is 8; thus, larger systems would most naturally be composed of multiple 8-machine racks. Ideally, the cluster should be able to be constructed by simply connecting the PAPERS modules housed within each rack of 8 machines. Further, to minimize wiring distances within each rack, the PAPERS module should really be placed in the middle of the rack (rather than being a stand-alone box). TTL_PAPERS 950801, which is designed to be built-into a slide-out drawer within a wooden 8-machine rack, meets these goals by implementing a modular version of the TTL_PAPERS design of November 1994.
As our first attempt at a modular design, there have been quite a few new problems to be solved. Perhaps the most difficult question is: what interconnect pattern should be used to link the units of multiple 8-machine racks? The answer we have implemented is that TTL_PAPERS 950801 units can be linked in a tree structure with a fan-out of five and an increase in operation time of <200ns for each level in the tree. Thus, a two-level tree allows up to 8 + 5*8, or 48, machines; a three-level tree allows up to 8 + 5*8 + 5*5*8, or 248, machines. A four-level cluster could use as many as 8 + 5*8 + 5*5*8 + 5*5*5*8, or 1248, machines while adding only about 4 * 200ns, or 0.8 microseconds, to the time for each basic operation.
This tree-structured expandability unfortunately implies that there are actually four distinct configurations of TTL_PAPERS 950801 boards: stand alone, root node, internal node, and leaf node. Although the same board layout works for all, there are significant population and wiring differences between these configurations. Further, we did not have space for enough drivers for the internal node configuration, so separate driver boards are needed for very large clusters.
The TTL_PAPERS unit we released at the IEEE/ACM Supercomputing conference in November 1994 was a sanitized and improved version of the earlier 4-processor TTL_PAPERS units. Likewise, the TTL_PAPERS 951201 which we will be releasing at Supercomputing 1995 is an improved version of the modularly scalable 8-processor TTL_PAPERS 950801.
None of the changes from TTL_PAPERS 950801 is particularly dramatic, but there are dozens of incremental improvements. Many of these improvements relate to the electrical and mechanical properties of the board, but a few extensions have been made to the functionality. Although TTL_PAPERS 951201 only supports a tree fan-out of four at each level (versus five for the 950801 design), there is no need for additional drivers and all four board configurations are supported without major wiring changes. There is also a new interface on the board that facilitates connection of external logic to support real-time control applications (e.g., an external timer or sensor can trigger a barrier).
When will more details on the TTL_PAPERS 951201 design be publically available? At and after our Supercomputing 1995 research exhibit, December 5-7, 1995.... Until then, here's a peek inside:
Hey... isn't that just a photo of a cable? Yup. To be precise, it is what is sometimes called a LapLink cable, but we call it CAPERS: the Cable Adapter for Parallel Execution and Rapid Synchronization. While we were busy improving the TTL_PAPERS library, we realized that it would be possible to implement a version for two processors using only a passive parallel cable connection between the machines. This doesn't scale, but it sure makes it easy and cheap to try out the library. Aside from the inability to scale, performance is slightly worse than using the TTL_PAPERS hardware.
After we had completed the TTL_PAPERS 960801 board layout, we found that we had a little corner of the layout rectangle unused (because the 960801 layout actually has two boards: a main board and a boardlet used to support scaling). After getting price quotes on the two-layer plated-through board, we discovered that the price would be unaffected by whatever we decided to put in that corner. So we couldn't leave it blank. ;-)
One idea was to make a PAPERS keychain. Another passing whim was to make it an image of the group's business card. However, we finally figured-out something useful to do with this space: PAPERS_JR.
PAPERS_JR is essentially a CAPERS unit, but it adds two features:
Neither of the above TTL_PAPERS-like features are implemented in the same way that they are implemented in TTL_PAPERS, rather, they are implemented in a way that allows the PAPERS_JR library to work with CAPERS hardware for all that CAPERS supports.
Of course, PAPERS_JR is both very small and very simple: just two TTL parts. The wooden case for PAPERS_JR (mahogany, in the photo ;-) is exceptionally simple to make, and looks a lot like a solid block of wood with a hole in it for cooling the 7805 voltage regulator. It makes a great stocking-stuffer. ;-)
Although TTL_PAPERS 951201 makes a nice, modularly scalable, eight-processor TTL_PAPERS, there are a few annoying things about that design which we have fixed in the TTL_PAPERS 960801 design. Of course, every thing one fixes breaks something else.... So, here's what's new:
In summary, this four-processor design is not really worthwhile if all you want to connect is four or eight machines. However, if you expect to have a lot more than four, and/or to incrementally add more machines, this is the design you want.
Thank you for visiting the PAPERS museum. As further developments are made (or whenever we get around to it) new exhibits will be added. If you have any suggestions or comments, please send them to hankd@engr.uky.edu.