Global Info For Parallel Processing
What A Little Global Information
Can Do For Parallel Processing
Hank Dietz
Assoc. Prof. of Electrical and Computer Engineering
Purdue University
West Lafayette, IN 47907-1285
hankd@ecn.purdue.edu
http://dynamo.ecn.purdue.edu/~hankd
These slides are taken from a presentation given at UAH
on Oct. 9, 1995.
Abstract
To be able to achieve speedup through parallel execution, a program
must be able to be decomposed into sections of code that can execute
simultaneously, each on its own processing element. However,
simultaneous execution does not imply independent execution; all
sections of the code are really working together to produce the
program's unified, coherent, final state. The odd thing is that most
parallel computing hardware and software models do not directly
provide means for accessing this global, aggregate, information.
PAPERS (Purdue's Adapter for Parallel Execution and Rapid
Synchronization) is fast, simple, hardware that provides such access
for a cluster of PCs or workstations. This talk begins by outlining
the PAPERS hardware and the synchronized aggregate communication
library developed for it. This very strong execution model can be
used to efficiently implement a variety of weaker (but more familiar)
models, including SIMD and VLIW control flow, coherent shared memory,
and even asynchronous message-passing. Both the implementations and
performance measurements will be presented.
Parallel Computation Models
-
Execution mode
-
MIMD (really SPMD)
-
SIMD
-
VLIW
-
Communication/synchronization model
-
Shared memory/locks or semaphores
-
Message passing/signals or active messages
-
Aggregate functions/barriers
What Is PAPERS?
-
Purdue's Adapter for Parallel Execution and Rapid Synchronization
-
Hardware support for parallel processing on a cluster
-
Typical latency of a few usec
-
Efficient SIMD and VLIW emulation
-
Works with almost any workstations and/or PCs
-
Public domain hardware design and software
Skip The Advertising -- What Is PAPERS?
-
PAPERS is a barrier synchronization unit:
-
Signal present at barrier
-
Wait for all participating PEs
-
When all present, resume execution
-
Aggregate communication is done only as a
side-effect of a barrier synchronization
-
Parallel interrupts/signals are also supported
Aggregate Communication
-
As a side-effect of a barrier, sample data from all PEs:
-
Voting operations
-
ANY and ALL tests
-
Global
OR, AND, or NAND
-
Multibroadcast
-
Anything else that's useful...
-
Statically-scheduled communication!
What's Inside TTL_PAPERS?
-
Synchronization hardware
-
Pass
-
Barrier & anti-barrier
-
Lock
-
Data transmission hardware
-
4-bit NAND (and that's all!)
-
Parallel interrupt logic
-
Any PE requests interrupt
-
All PEs acknowledge
4 PE TTL_PAPERS
Inside 4 PE TTL_PAPERS
8 PE TTL_PAPERS
Measured Latency
-
Using TTL_PAPERS, 486DX33, Linux 1.1.75
-
Standard printer port connection
| TTL_PAPERS Operation |
4 PE Latency (usec) |
n PE Latency (usec) |
| Partition |
0.1 |
0.1 |
| Barrier sync. |
2.5 |
2.5 |
| ANY test |
6.3 |
6.3 |
| ALL test |
6.3 |
6.3 |
| 4-bit Global OR |
6.3 |
6.3 |
| 1-bit Multibroadcast |
6.3 |
6.3*ceil(n/4) |
The PAPERS Library Execution Model
-
Communication is a side-effect of barrier sync.
-
All PEs together effect communication
-
Communication role of each PE is determined by that PE
-
Aggregate functions are computed in parallel within the "network"
-
Very low latency hardware and interface
-
No software layering
-
Barrier synchronization and voting operations supporting MIMD, SIMD,
and VLIW execution modes
How Is The Library Used?
-
Portable C code optimized for Gnu C compiler
-
Hand coding using C, Fortran (via f2c), or Pascal (via p2c)
-
Compiler target using C as a "portable assembly language"
-
Tuples using
register temporaries
-
Instruction-level analysis easily parameterized for machine
differences, even heterogeneous clusters
-
E.g., PAPERS MPL compiler (Will Cohen at UAH)
-
Usable for fine-grain MIMD, SIMD, and VLIW
-
Can be used to augment PVM, MPI, etc.
Data Types
-
A datum from each processor, not blocks from memory
-
Data types supported:
-
64 and 32 bit floating point
-
64, 32, 16, and 8 bit signed integer
-
64, 32, 16, 8, and 1 bit unsigned integer
-
The k least significant bits of a 64, 32, 16, and 8 bit unsigned
integer
-
Data format conversions for heterogeneous clusters occur invisibly when
portable type definitions are used
Library Summary
p_init()
p_exit(ecode)
p_error(mesg)
p_enqueue(mask)
p_wait()
p_waitvec(flag)
p_any(flag)
p_all(flag)
p_putgettype(datum, source)
p_gathertype(array, datum)
p_reduceOptype(datum)
p_bcastPuttype(datum)
p_bcastGettype()
p_scanOptype(datum)
p_rankOptype(datum)
p_population()
p_enumerate()
p_selectFirst()
p_selectOne()
p_voteCount(vote)
p_vote(vote)
p_matchCounttype(vote)
p_matchtype(vote)
Performance
| Machine |
Barrier Sync. |
32-bit Putget (permutation) |
64-bit Broadcast |
| MasPar MP-1 |
0.1 |
44.0 |
31.0 |
| Cray T3D (PVM) |
21.0 |
|
82.0 |
| 486 Linux (TTL_PAPERS) |
2.5 |
216.0 |
137.0 |
| Intel Paragon XP/S |
530.0 |
700.0 |
210.0 |
| 486 Linux (Ethernet PVM3) |
49,000.0 |
100,000.0 |
40,000.0 |
Implementing SIMD Execution
-
All SIMD PEs execute the same instruction at the same time
-
All PEs get identical code
-
Barriers inserted to ensure timing
-
If any PE executes a block, all must
-
PEs can be disabled (SIMD masking)
SIMD Masking
-
Two ways to be disabled for
c=a+b;
-
if (!disabled) c=a+b;
-
c=(((a+b) & -!disabled)|(c & ~-!disabled));
-
How to track enable nesting?
-
Use an enable stack
-
Count disable nesting depth [Keryell & Paris]
-
What happens if no PE is enabled?
-
Execute code anyway (C*)
-
Jump over code (MPL)
MPL's SIMD if
if (parallel_expr) {
stat_a;
} else {
stat_b;
}
C + PAPERS Library Coding
if (disabled) ++disabled;
EVAL_ENABLED {
if (!parallel_expr) disabled = 1;
}
if (p_all(disabled)) goto Else;
EVAL_ENABLED {
stat_a;
}
Else:
if (disabled < 2) disabled ^= 1;
if (p_all(disabled)) goto Exit;
EVAL_ENABLED {
stat_b;
}
Exit:
if (disabled) --disabled;
MPL's SIMD while
while(parallel_expr) {
stat;
}
C + PAPERS Library Coding
if (disabled) ++disabled;
Loop:
EVAL_ENABLED {
if (!parallel_expr)
disabled = 1;
}
if (p_all(disabled)) goto Exit;
EVAL_ENABLED {
stat;
}
goto Loop;
Exit:
--disabled;
MPL's SIMD router (RVal) Communication
plural int x;
x = router[parallel_expr1].parallel_expr2;
-
Only PEs selected by active PEs evaluate parallel_expr2
-
Result stored only in active PEs
C + PAPERS Library Coding
t1 = parallel_expr1;
{
register int disabled_tmp = disabled;
disable = !(p_vote(disabled ? NPROC : t1) & OUR_MASK);
EVAL_ENABLED {
t2 = parallel_expr2;
}
disabled = disabled_tmp;
}
t = p_putget32(t2, t1);
if (!disabled) x = t;
Implementing VLIW Execution
-
Each PE has only the operations it executes
-
All PEs have the same multi-way branch code
-
Use
p_any() for binary branches
-
Use
p_reduceOr8() for multi-way branches
-
Use
p_putgetType() to simulate
a "shared register file"
Implementing Shared Memory
-
There is more than one type...
-
Each PE has its own vars, others can access
-
All PEs see the same vars
-
Access time a fn of address?
-
Coherence?
-
It's up to the programmer (i.e., incoherent)
-
Various flavors of "weak" coherence
-
Fully coherent
-
Atomicity?
Some Fine Points
-
If you have enough local memory, all PEs see the same vars is a better
model....
-
If MIMD PEs are truly operating on local time, can global coherence
exist?
-
Can arbitrary size objects be managed atomically without serializing
all memory access?
Synchronous Shared Memory
-
When a PE wants to write, it sends a parallel signal
and waits for a barrier
-
If a PE sees a signal pending (by polling or hardware interrupt),
it joins the barrier
-
Once all PEs have joined, the PEs vote on who wants to write
-
In turn, each writer broadcasts to all PEs
-
If a writer sees overlapping data, it adjusts its write request to
preserve atomicity in resolving the race
Performance of Synchronous Shared Memory
-
Implemented using C++ classes for "shared" data types
-
Lvalue store causes s_write() call
-
Rvalue fetch causes s_poll() call (~1 usec)
-
Parameters:
-
32-bit shared memory address
-
32-bit datum size (in bytes)
-
Arbitrary data values
-
For a 32-bit object across 8 386DX33:
-
Read takes ~1 usec
-
Write ranged from 61 to 302 usec
(32-bit broadcast is ~130 usec!)
Shared Memory Synchronization
-
Barrier synchronization is a null write!
(about 10 usec)
-
Trivial, due to synchronous race resolution:
-
Locks
-
Combining operations (reductions)
Shared Memory Lock Code
void
s_lock(s_locktype *lock)
{
do {
/* Wait for lock to be open... */
while (*lock != S_LOCKOPEN) {
s_poll();
}
/* Try to grab it for us */
*lock = IPROC;
s_write(((uint8 *) lock), sizeof(*lock));
/* Continue looping if we lost the race... */
} while (*lock != IPROC);
}
Asynchronous Message Passing
-
A very common model these days...
-
PVM (Parallel Virtual Machine)
-
MPI (Message Passing Interface)
-
PE-to-PE, memory buffer to memory buffer
-
Not a good match for our hardware...?
Implementation of PVM
-
When a PE wants to send a message, it sends a parallel signal
and waits for a barrier
-
If a PE sees a signal pending, it joins the barrier
-
Once all PEs have joined, the PEs vote on who wants to send
a message
-
In turn, each sender broadcasts its send opcode and then sends
its message + tag (no source PE id!)
-
Buffering problems are minimized by perfect global flow control
Performance of PVM Implementation
-
Reimplementation not quite done...
(not simply porting PVM!)
-
No daemon processes
-
Buffer management is simplified, but overhead is still high
-
Time is ~30 usec + data transmit time
-
Group operations take about the same time as PE-to-PE send
Summary
-
Barrier synchronization is a good thing
-
Establishes global time
-
Creates a globally coherent state
-
Allows N-way interactions (e.g., voting, multicast)
to test properties of the global state
-
Facilitates run-time static scheduling
-
Barrier synchronization is easy to implement
-
Building "parallel computers" that cannot directly access a
global state doesn't make sense