Although the TTL_PAPERS design of November 1994 was widely accepted, and a few other universities have built clusters using that design, we still have trouble getting some people to take it seriously because it only connects 4 machines. True, we did detail how to scale to larger systems, but that left a lot of people unconvinced. There was also the problem that scaling to a larger cluster meant building a whole new PAPERS unit... you can't incrementally expand the unit. In contrast, TTL_PAPERS 950801 is an 8-processor unit that modularly scales to thousands of processors....
The practical maximum number of machines that can be placed in a single rack is 8; thus, larger systems would most naturally be composed of multiple 8-machine racks. Ideally, the cluster should be able to be constructed by simply connecting the PAPERS modules housed within each rack of 8 machines. Further, to minimize wiring distances within each rack, the PAPERS module should really be placed in the middle of the rack (rather than being a stand-alone box). TTL_PAPERS 950801, which is designed to be built-into a slide-out drawer within a wooden 8-machine rack, meets these goals by implementing a modular version of the TTL_PAPERS design of November 1994.
As our first attempt at a modular design, there have been quite a few new problems to be solved. Perhaps the most difficult question is: what interconnect pattern should be used to link the units of multiple 8-machine racks? The answer we have implemented is that TTL_PAPERS 950801 units can be linked in a tree structure with a fan-out of five and an increase in operation time of <200ns for each level in the tree. Thus, a two-level tree allows up to 8 + 5*8, or 48, machines; a three-level tree allows up to 8 + 5*8 + 5*5*8, or 248, machines. A four-level cluster could use as many as 8 + 5*8 + 5*5*8 + 5*5*5*8, or 1248, machines while adding only about 4 * 200ns, or 0.8 microseconds, to the time for each basic operation.
This tree-structured expandability unfortunately implies that there are actually four distinct configurations of TTL_PAPERS 950801 boards: stand alone, root node, internal node, and leaf node. Although the same board layout works for all, there are significant population and wiring differences between these configurations. Further, we did not have space for enough drivers for the internal node configuration, so separate driver boards are needed for very large clusters.