Hardware Barrier Synchronization For A Cluster Of Personal Computers+

T. Muhammad

MS Thesis Defense

School of Electrical Engineering
Purdue University
West Lafayette, IN 47907-1285
February 2, 1995

Image1 not displayable as text. ( 4K .ps.Z )

+ This work was supported in part by the Office of Naval Research (ONR) under grant number N00014-91-J-4013.

Clustering

Treats a group of machines as a parallel machine
Workstations and PCs offer high performance, low cost
Networks increasing bandwidth, decreasing cost:

Ethernet
FDDI
HIPPI
ATM

Politically correct (use things you already have)

Parallel Processing

Fine-grain parallelism needs low latency

Networks designed for bandwidth, not latency
Minimum latency of 1,000-5,000 microseconds

Need efficient barrier synchronization

Data parallelism and SPMD workforce models
SIMD and VLIW emulation

Cluster Computing Latencies

Megjou Lin et al. used an ATM cluster of 4 Suns to compare performance of AAL5, API, PVM, and BSD stream sockets
Latency was 800µs -- 3,000µs
Chengchang Huang used an ATM cluster of 11 SPARC-10 with PVM and AAL5
Latency was >1,000µs
NAS (NASA Ames Lab) tested various clusters with PVM
Latency was >700µs for IBM Allnode, 2,000µs -- 3,000µs for FDDI and Ethernet
Thekkath proposed a "remote memory" model using 4 DEC R3000 machines with ATM hardware
Latency as low as 30µs to 45µs?

Barrier Synchronization

Algorithm:

Signal present at barrier
Wait for all participating processors
When all have arrived, resume execution

Software/messaging implementations are inefficient

Other Hardware Barriers

First used by Harry Jordon in FEM, 1978
Priority chain hardware
Burroughs FMP
AND tree hardware, tree-node partitioning
Fuzzy barrier by Gupta
Delayed firing, overly complex hardware
Thinking Machines CM-5
Control network works like AND tree
Cray T3D
like FMP, but 8 bits wide

Static Vs. Dynamic Barriers

Everybody built static barrier hardware
Static: one barrier stream
Dynamic: multiple independent barrier streams
(arbitrary partitioning of the machine)

Image2 not displayable as text. ( 11K .ps.Z )

Barrier Groups (Masks)

Any set of processors can be a barrier group
A group can be represented by a bit mask
Concept of arbitrary masks came from H. Dietz and T. Schwederski at Purdue in 1987 as outgrowth of PASM SIMD enable logic
Runtime partitioning method was not specified

Theoretical Barrier Work At Purdue

1987: Basic Concepts of barrier MIMD
Compiler technology based on VLIW scheduling
1990: SBM using a "barrier processor" and mask queue
DBM using a "barrier processor" and associative mask memory
Compiler technology based on timing analysis
1993: Improved DBM design with runtime partitioning support

Experimental Barrier Work At Purdue

1987: PASM implements SBM
1987: CARP (Compiler-oriented Architecture Research at Purdue)
Machine design: barrier MIMD using custom VLSI
1993: CARDBoard (Compiler-oriented Architecture Research Demonstration Board)
System design: DBM using RISC microprocessors in a custom board
1994: PAPERS (Purdue's Adapter for Parallel Execution and Rapid Synchronization)
Cluster design: DBM using PCs and an external adapter
PAPERS0 implements improved DBM

Original Concept Of CARD Barrier Hardware

Initially, intended to be an SBM
Simple global AND hardware
Extended load processor interface (like PASM)

Load address is decoded as barrier request
Load does not complete (logic inserts memory wait states) until all processors are present

Basic CARD DBM Design

Implement "distributed" DBM hardware
Replicate OR-AND trees for each processor
Uses barrier masks stored locally by each processor
Extended load interface, but some address bits are decoded as the barrier mask

Image3 not displayable as text. ( 9K .ps.Z )

Problems With The Basic CARD DBM Design

Runtime partitioning seemed desirable, but:

Must use separate data network to agree on new masks
Once partitioned, subgroups can't recombine

Image4 not displayable as text. ( 8K .ps.Z )

Improved CARD DBM Design

Invented in October 1993
Adds to basic CARD DBM design:

Load address includes a one-bit flag value
Load returns a bit vector gathering the flag bits from all processors
Every processor send its n-bit mask, each bit to the corresponding processor

Results:

Partitioning doesn't need a separate data network
Recombining subgroups works
Wiring complexity goes from O(n) to O(n2)

Improved CARD DBM Design

Image5 not displayable as text. ( 5K .ps.Z )

Why PAPERS?

CARD project was delayed waiting for:

Microprocessors (first TMS320C30, then AMD29050 and PowerPC601)
Xilinx glue logic design support tools
More design and construction experience
(no new hardware since 1987)

Improved DBM is very new --
we needed to test the design concepts

Generic PAPERS Cluster Concept

Use standard PCs as processing elements
Barrier unit is an external box connected to all PCs

Image6 not displayable as text. ( 8K .ps.Z )

What Hardware Interface?

Choices

Use a custom interface card:
ISA, EISA, VESA, or PCI
Use a standard interface:
RS232, Parallel Printer Port, SCSI

Why we use the parallel printer port:

Number of usable signal lines
Simplicity of the hardware interface (TTL logic levels)
Relatively easy direct software access to the port
(very low latency)

Problems

Load interface cannot be implemented!

Printer port is mapped as an I/O device
Signals controlled by data bus read/write of I/O registers

Requires a minimum of two port accesses, at 1-5 microseconds each
Running Linux (UNIX) on each processor blurs timing

Must use a memory element to ensure barrier GO signals are not missed
Must distinguish between barrier request and barrier seen

PAPERS Barrier Logic (For One Processor)

Image7 not displayable as text. ( 14K .ps.Z )

Parallel Interrupts?

Needed to achieve initial synchronization
(known barrier state)
To recover from program errors
(including mask errors)
To use PAPERS for parallel OS functions
(without an additional network)

Interrupt Architecture For PAPERS

Simplest form is a global OR of IRQ from all PEs
PAPERS is partitionable, so interrupt should affect only the specified partition and use AND-OR tree implementation

Reciever determines which processors can interrupt it
Requestor determines the group it will interrupt

The Two Interrupt Mask Alternatives

Image8 not displayable as text. ( 9K .ps.Z )

Image9 not displayable as text. ( 9K .ps.Z )

Generic PAPERS Block Diagram

Image10 not displayable as
text. ( 12K .ps.Z )

PAPERS0 Implementation

Connects four PCs or workstations using improved DBM
For each processor:

One PLA implements barrier and interrupt logic
Data bits are buffered through TTL drivers
10 status LED display
Connection to the PC is made via a centronics printer cable

PLAs have common internal logic but different connections between chips

PAPERS0 Logic Schematic

Image11 not displayable as
text. ( 13K .ps.Z )

PAPERS0 Display Schematic

Image12 not displayable as
text. ( 9K .ps.Z )

Performance Of PAPERS0

It really works... with speed limited by slow ports

Low latency barrier synchronization
Low latency data communication

Image13 not displayable as
text. ( 4K .ps.Z )

What Can A PAPERS Cluster Do?

Fine-grain MIMD and SPMD with barriers
Fine-grain SIMD emulation
Fine-grain VLIW emulation
What do we mean by fine grain?

Minimum PAPERS0 communication takes 11µs on a 4 MFLOPS machine
PAPERS0 grain size is about 44 FLOPS
Newer PAPERS units are at 2.5µs, or about 10 FLOPS

PAPERS As A Communication Network

The improved DBM needs one-bit multi-broadcast, thus, PAPERS0 can implement any aggregate communication function without routing conflicts
Later versions of PAPERS expand this:

ANY and ALL tests
Multibit global OR, multibroadcast
Voting operations

Voting operations allow PAPERS to be used for scheduling access to other resources
(e.g., a high-bandwidth network)

Lessons From PAPERS0

"Standard" printer port?
Driving TTL levels through lousy 10' cables and too many connectors...
Can go from 4 or 5 port accesses/op to just 2
Fancy LED displays help debugging, but...
A cheap AC adaptor + 7805 works as well as a $50 power supply
Wire wrap creates debugging problems

Conclusion

PAPERS an effective way to interconnect computers in a cluster
PAPERS0 is the first system to make a PC/workstation cluster capable of fine-grain mixed-mode parallel execution
Experiments with PAPERS0 have thus far spawned 5 generations of other PAPERS prototypes, including a publically demonstrated and released simplified version (TTL_PAPERS)

Future Work

Implementation for larger clusters:

TTL_PAPERS SBM design for 32 processors
Full DBM PAPERS design for 16 processors
Plans to scale to 128 or 256 processors within this year

Design of a minimal custom interface board that will allow the Load interface to be used with existing PAPERS designs
A high-performance PCI-interface PAPERS... hopefully leading to CARDBoard and perhaps even a CARP machine

Significance Of My Contribution

Use of the unidirectional parallel printer port for parallel processing

Simplified construction of the prototypes
Portably yields good performance

I played a leading role in designing, implementing, and debugging the first DBM ever built
My thesis explains how and why the barrier mechanism evolved into the improved DBM
PAPERS works very well, and has been very well recieved by the parallel processing research community:

Equipment loans/donations from TI, IBM, DEC, etc.
20'x20' research exhibit at IEEE/ACM Supercomputing '94
Publications

Publications From This Work

H. G. Dietz, T. M. Chung, T. I. Mattox, and T. Muhammad, "Purdue's Adapter for Parallel Execution and Rapid Synchronization: The TTL_PAPERS Design," submitted to International Conference on Parallel Processing, August 1995.
H. G. Dietz, T. Muhammad, and T. I. Mattox, TTL Implementation of Purdue's Adapter for Parallel Execution and Rapid Synchronization, Purdue University School of Electrical Engineering, December 1994.
H. G. Dietz, W. E. Cohen, T. Muhammad, and T. I. Mattox, "Compiler Techniques For Fine-Grain Execution On Workstation Clusters Using PAPERS," 7th Annual Workshop on Languages and Compilers for Parallel Computing (also to appear as a book chapter from Springer Verlag), pp. 3.1-3.15, Cornell University, August 1994.
H. G. Dietz, T. Muhammad, J. B. Sponaugle, and T. Mattox, PAPERS: Purdue's Adapter for Parallel Execution and Rapid Synchronization, Purdue University School of Electrical Engineering, Technical Report TR-EE 94-11, March 1994.

Hypertext Index

Clustering
Parallel Processing
Cluster Computing Latencies
Barrier Synchronization
Other Hardware Barriers
Static Vs. Dynamic Barriers
Barrier Groups (Masks)
Theoretical Barrier Work At Purdue
Experimental Barrier Work At Purdue
Original Concept Of CARD Barrier Hardware
Basic CARD DBM Design
Problems With The Basic CARD DBM Design
Improved CARD DBM Design
Improved CARD DBM Design
Why PAPERS?
Generic PAPERS Cluster Concept
What Hardware Interface?
Problems
PAPERS Barrier Logic (For One Processor)
Parallel Interrupts?
Interrupt Architecture For PAPERS
The Two Interrupt Mask Alternatives
Generic PAPERS Block Diagram
PAPERS0 Implementation
PAPERS0 Logic Schematic
PAPERS0 Display Schematic
Performance Of PAPERS0
What Can A PAPERS Cluster Do?
PAPERS As A Communication Network
Lessons From PAPERS0
Conclusion
Future Work
Significance Of My Contribution
Publications From This Work
Hypertext Index