Global Info For Parallel Processing

What A Little Global Information Can Do For Parallel Processing

Hank Dietz
Assoc. Prof. of Electrical and Computer Engineering
Purdue University
West Lafayette, IN 47907-1285
hankd@ecn.purdue.edu
http://dynamo.ecn.purdue.edu/~hankd

These slides are taken from a presentation given at UAH on Oct. 9, 1995.

Abstract

To be able to achieve speedup through parallel execution, a program must be able to be decomposed into sections of code that can execute simultaneously, each on its own processing element. However, simultaneous execution does not imply independent execution; all sections of the code are really working together to produce the program's unified, coherent, final state. The odd thing is that most parallel computing hardware and software models do not directly provide means for accessing this global, aggregate, information.

PAPERS (Purdue's Adapter for Parallel Execution and Rapid Synchronization) is fast, simple, hardware that provides such access for a cluster of PCs or workstations. This talk begins by outlining the PAPERS hardware and the synchronized aggregate communication library developed for it. This very strong execution model can be used to efficiently implement a variety of weaker (but more familiar) models, including SIMD and VLIW control flow, coherent shared memory, and even asynchronous message-passing. Both the implementations and performance measurements will be presented.

Parallel Computation Models

Execution mode
- MIMD (really SPMD)
- SIMD
- VLIW
Communication/synchronization model
- Shared memory/locks or semaphores
- Message passing/signals or active messages
- Aggregate functions/barriers

What Is PAPERS?

Purdue's Adapter for Parallel Execution and Rapid Synchronization
Hardware support for parallel processing on a cluster
- Typical latency of a few usec
- Efficient SIMD and VLIW emulation
Works with almost any workstations and/or PCs
Public domain hardware design and software

Skip The Advertising -- What Is PAPERS?

PAPERS is a barrier synchronization unit:
1. Signal present at barrier
2. Wait for all participating PEs
3. When all present, resume execution
Aggregate communication is done only as a side-effect of a barrier synchronization
Parallel interrupts/signals are also supported

Aggregate Communication

As a side-effect of a barrier, sample data from all PEs:
- Voting operations
- ANY and ALL tests
- Global OR, AND, or NAND
- Multibroadcast
- Anything else that's useful...
Statically-scheduled communication!

What's Inside TTL_PAPERS?

Synchronization hardware
- Pass
- Barrier & anti-barrier
- Lock
Data transmission hardware
- 4-bit NAND (and that's all!)

Parallel interrupt logic

Any PE requests interrupt
All PEs acknowledge

4 PE TTL_PAPERS

Inside 4 PE TTL_PAPERS

8 PE TTL_PAPERS

Measured Latency

Using TTL_PAPERS, 486DX33, Linux 1.1.75
Standard printer port connection

TTL_PAPERS Operation	4 PE Latency (usec)	`n` PE Latency (usec)
Partition	0.1	0.1
Barrier sync.	2.5	2.5
ANY test	6.3	6.3
ALL test	6.3	6.3
4-bit Global OR	6.3	6.3
1-bit Multibroadcast	6.3	6.3*ceil(`n`/4)

The PAPERS Library Execution Model

Communication is a side-effect of barrier sync.
- All PEs together effect communication
- Communication role of each PE is determined by that PE
- Aggregate functions are computed in parallel within the "network"
Very low latency hardware and interface
No software layering
Barrier synchronization and voting operations supporting MIMD, SIMD, and VLIW execution modes

How Is The Library Used?

Portable C code optimized for Gnu C compiler
- Hand coding using C, Fortran (via f2c), or Pascal (via p2c)
- Compiler target using C as a "portable assembly language"
  - Tuples using register temporaries
  - Instruction-level analysis easily parameterized for machine differences, even heterogeneous clusters
  - E.g., PAPERS MPL compiler (Will Cohen at UAH)
Usable for fine-grain MIMD, SIMD, and VLIW
Can be used to augment PVM, MPI, etc.

Data Types

A datum from each processor, not blocks from memory
Data types supported:
- 64 and 32 bit floating point
- 64, 32, 16, and 8 bit signed integer
- 64, 32, 16, 8, and 1 bit unsigned integer
- The k least significant bits of a 64, 32, 16, and 8 bit unsigned integer
Data format conversions for heterogeneous clusters occur invisibly when portable type definitions are used

Library Summary

p_init()
p_exit(ecode)
p_error(mesg)
p_enqueue(mask)
p_wait()
p_waitvec(flag)
p_any(flag)
p_all(flag)
p_putgettype(datum, source)
p_gathertype(array, datum)
p_reduceOptype(datum)
p_bcastPuttype(datum)
p_bcastGettype()
p_scanOptype(datum)
p_rankOptype(datum)
p_population()
p_enumerate()
p_selectFirst()
p_selectOne()
p_voteCount(vote)
p_vote(vote)
p_matchCounttype(vote)
p_matchtype(vote)

Performance

Machine Barrier Sync. 32-bit Putget (permutation) 64-bit Broadcast

MasPar MP-1 0.1 44.0 31.0

Cray T3D (PVM) 21.0 82.0

486 Linux (TTL_PAPERS) 2.5 216.0 137.0

Intel Paragon XP/S 530.0 700.0 210.0

486 Linux (Ethernet PVM3) 49,000.0 100,000.0 40,000.0

Machine	Barrier Sync.	32-bit Putget (permutation)	64-bit Broadcast
MasPar MP-1	0.1	44.0	31.0
Cray T3D (PVM)	21.0		82.0
486 Linux (TTL_PAPERS)	2.5	216.0	137.0
Intel Paragon XP/S	530.0	700.0	210.0
486 Linux (Ethernet PVM3)	49,000.0	100,000.0	40,000.0

Implementing SIMD Execution

All SIMD PEs execute the same instruction at the same time
- All PEs get identical code
- Barriers inserted to ensure timing
If any PE executes a block, all must
PEs can be disabled (SIMD masking)

SIMD Masking

Two ways to be disabled for c=a+b;
- if (!disabled) c=a+b;
- c=(((a+b) & -!disabled)|(c & ~-!disabled));
How to track enable nesting?
- Use an enable stack
- Count disable nesting depth [Keryell & Paris]
What happens if no PE is enabled?
- Execute code anyway (C*)
- Jump over code (MPL)
MPL's SIMD if
```
if (parallel_expr) {
    stat_a;
} else {
    stat_b;
}
```
C + PAPERS Library Coding
```
    if (disabled) ++disabled;
    EVAL_ENABLED {
        if (!parallel_expr) disabled = 1;
    }
    if (p_all(disabled)) goto Else;
    EVAL_ENABLED {
        stat_a;
    }
Else:
    if (disabled < 2) disabled ^= 1;
    if (p_all(disabled)) goto Exit;
    EVAL_ENABLED {
        stat_b;
    }
Exit:
    if (disabled) --disabled;
```
MPL's SIMD while
```
while(parallel_expr) {
    stat;
}
```
C + PAPERS Library Coding
```
    if (disabled) ++disabled;
Loop:
    EVAL_ENABLED {
        if (!parallel_expr)
            disabled = 1;
    }
    if (p_all(disabled)) goto Exit;
    EVAL_ENABLED {
        stat;
    }
    goto Loop;
Exit:
    --disabled;
```
MPL's SIMD router (RVal) Communication
```
plural int x;
x = router[parallel_expr1].parallel_expr2;
```
- Only PEs selected by active PEs evaluate parallel_expr2
- Result stored only in active PEs
C + PAPERS Library Coding
```
t1 = parallel_expr1;
{
    register int disabled_tmp = disabled;
    disable = !(p_vote(disabled ? NPROC : t1) & OUR_MASK);
    EVAL_ENABLED {
        t2 = parallel_expr2;
    }
    disabled = disabled_tmp;
}
t = p_putget32(t2, t1);
if (!disabled) x = t;
```
Implementing VLIW Execution
- Each PE has only the operations it executes
- All PEs have the same multi-way branch code
  - Use p_any() for binary branches
  - Use p_reduceOr8() for multi-way branches
- Use p_putgetType() to simulate a "shared register file"
Implementing Shared Memory
- There is more than one type...
  - Each PE has its own vars, others can access
  - All PEs see the same vars
- Access time a fn of address?
- Coherence?
  - It's up to the programmer (i.e., incoherent)
  - Various flavors of "weak" coherence
  - Fully coherent
- Atomicity?
Some Fine Points
- If you have enough local memory, all PEs see the same vars is a better model....
- If MIMD PEs are truly operating on local time, can global coherence exist?
- Can arbitrary size objects be managed atomically without serializing all memory access?
Synchronous Shared Memory
- When a PE wants to write, it sends a parallel signal and waits for a barrier
- If a PE sees a signal pending (by polling or hardware interrupt), it joins the barrier
- Once all PEs have joined, the PEs vote on who wants to write
- In turn, each writer broadcasts to all PEs
- If a writer sees overlapping data, it adjusts its write request to preserve atomicity in resolving the race
Performance of Synchronous Shared Memory
- Implemented using C++ classes for "shared" data types
  - Lvalue store causes s_write() call
  - Rvalue fetch causes s_poll() call (~1 usec)
- Parameters:
  - 32-bit shared memory address
  - 32-bit datum size (in bytes)
  - Arbitrary data values
- For a 32-bit object across 8 386DX33:
  - Read takes ~1 usec
  - Write ranged from 61 to 302 usec
    (32-bit broadcast is ~130 usec!)
  Shared Memory Synchronization
  - Barrier synchronization is a null write!
    (about 10 usec)
  - Trivial, due to synchronous race resolution:
    
    Locks
    Combining operations (reductions)
  Shared Memory Lock Code
```
void
s_lock(s_locktype *lock)
{
  do {
    /* Wait for lock to be open... */
    while (*lock != S_LOCKOPEN) {
      s_poll();
    }

    /* Try to grab it for us */
    *lock = IPROC;
    s_write(((uint8 *) lock), sizeof(*lock));

    /* Continue looping if we lost the race... */
  } while (*lock != IPROC);
}
```
  Asynchronous Message Passing
  - A very common model these days...
    
    PVM (Parallel Virtual Machine)
    MPI (Message Passing Interface)
  - PE-to-PE, memory buffer to memory buffer
  - Not a good match for our hardware...?
  Implementation of PVM
  - When a PE wants to send a message, it sends a parallel signal and waits for a barrier
  - If a PE sees a signal pending, it joins the barrier
  - Once all PEs have joined, the PEs vote on who wants to send a message
  - In turn, each sender broadcasts its send opcode and then sends its message + tag (no source PE id!)
  - Buffering problems are minimized by perfect global flow control
  Performance of PVM Implementation
  - Reimplementation not quite done...
    (not simply porting PVM!)
  - No daemon processes
  - Buffer management is simplified, but overhead is still high
  - Time is ~30 usec + data transmit time
  - Group operations take about the same time as PE-to-PE send
  Summary
  - Barrier synchronization is a good thing
    
    Establishes global time
    Creates a globally coherent state
    Allows N-way interactions (e.g., voting, multicast) to test properties of the global state
    Facilitates run-time static scheduling
  - Barrier synchronization is easy to implement
    
    4 PE TTL_PAPERS unit uses just 8 TTL chips
    See http://garage.ecn.purdue.edu/~papers
  - Building "parallel computers" that cannot directly access a global state doesn't make sense

What A Little Global Information Can Do For Parallel Processing

Abstract

Parallel Computation Models

What Is PAPERS?

Skip The Advertising -- What Is PAPERS?

Aggregate Communication

What's Inside TTL_PAPERS?

4 PE TTL_PAPERS

Inside 4 PE TTL_PAPERS

8 PE TTL_PAPERS

Measured Latency

The PAPERS Library Execution Model

How Is The Library Used?

Data Types

Library Summary

Performance

Implementing SIMD Execution

SIMD Masking

MPL's SIMD `if`

C + PAPERS Library Coding

MPL's SIMD `while`

C + PAPERS Library Coding

MPL's SIMD `router` (RVal) Communication

C + PAPERS Library Coding

Implementing VLIW Execution

Implementing Shared Memory

Some Fine Points

Synchronous Shared Memory

Performance of Synchronous Shared Memory

Shared Memory Synchronization

Shared Memory Lock Code

Asynchronous Message Passing

Implementation of PVM

Performance of PVM Implementation

Summary

What A Little Global Information Can Do For Parallel Processing

Abstract

Parallel Computation Models

What Is PAPERS?

Skip The Advertising -- What Is PAPERS?

Aggregate Communication

What's Inside TTL_PAPERS?

4 PE TTL_PAPERS

Inside 4 PE TTL_PAPERS

8 PE TTL_PAPERS

Measured Latency

The PAPERS Library Execution Model

How Is The Library Used?

Data Types

Library Summary

Performance

Implementing SIMD Execution

SIMD Masking

MPL's SIMD if

C + PAPERS Library Coding

MPL's SIMD while

C + PAPERS Library Coding

MPL's SIMD router (RVal) Communication

C + PAPERS Library Coding

Implementing VLIW Execution

Implementing Shared Memory

Some Fine Points

Synchronous Shared Memory

Performance of Synchronous Shared Memory

Shared Memory Synchronization

Shared Memory Lock Code

Asynchronous Message Passing

Implementation of PVM

Performance of PVM Implementation

Summary

MPL's SIMD `if`

MPL's SIMD `while`

MPL's SIMD `router` (RVal) Communication