The following article originally appeared in the September 6, 1996 issue of HPCwire. It is reproduced here with their permission.

by Alan Beck, editor in chief                                         HPCwire

  West Lafayette, Indiana -- PAPERS (Purdue's Adapter for Parallel Execution
and Rapid Synchronization) is custom hardware that allows a cluster of
unmodified PCs and/or workstations to function as a fine-grain parallel
computer capable of MIMD, SIMD, and VLIW execution. Its developers assert
that the total time taken to perform a typical barrier synchronization
using PAPERS is about 3 microseconds, and a wide range of aggregate
communication operations are also supported with very low latency.

  In order to learn more about the functionality and pragmatic potential of
PAPERS, HPCwire interviewed Hank Dietz, associate professor of electrical and
computer engineering at Purdue University and a principal PAPERS developer.
Following are selected excerpts from that discussion.


  HPCwire: Please tell us about the background and fundamental concepts of

  DIETZ: "I've worked in compilers for a long time. In doing very fine-
grained compiler code scheduling and timing analysis, we found that the most
useful hardware construct was a barrier synchronization mechanism. About
three years ago, we found a way to build a very efficient one that could plug
into standard PCs or workstations. That's how the PAPERS project was born.

  "This barrier mechanism doesn't just do barriers, however. In order to
implement the full barrier mechanism that we wanted, we also had to build in
some communication originally intended for collecting and transmitting
barrier group masks -- basically bit masks saying which processors were
involved in each barrier. The hardware that does that turns out to be a
special case of a more general notion called aggregate function
communication, where as a side effect of every barrier synchronization each
processor can put out a piece of data and specify exactly what function of
the data collected from all processors in the barrier group it would like to
receive in return.

  "So basically PAPERS is a very simple piece of custom hardware that plugs
into a group of workstations, PCs or other machines and gives not just a very
low-latency communication mechanism but also one capable of sampling global
state very cheaply. With PAPERS, one operation is sufficient to sample
everybody's state."

  HPCwire: Is PAPERS a unique technology?

  DIETZ: "So far. Normally people talk about shared memory and message
passing. Even though the PAPERS logic -- the aggregate function model -- can
fake either of those, it's fundamentally a different computation model in
terms of the interaction between parallel-processing elements." 

  HPCwire: Your literature claims that PAPERS can turn a NOW (network of
workstations) into something virtually indistinguishable from a
supercomputer. Isn't this an overstatement?

  DIETZ: "It's an overstatement in the sense that obviously if you take a
bunch of 386s and tie them together, you still have a bunch of 386s. But it's
not an overstatement in this sense: Most of the differences between a
traditional NOW and a traditional supercomputer revolve around the fact that
the latter has very low-latency communication and a way of sampling the
global state, so you have a cheap way of effecting global control -- for
example, SIMD and VLIW execution models. It turns out that PAPERS actually
does provide that.

  "True, we're not quite as fast as some supercomputers have been in
providing those functions. But we're much closer to the fastest
supercomputers than to the traditional NOWs. And, in fact, PAPERS provides
much faster global control than some supercomputers, such as the Intel
Paragon or IBM SP2."

  HPCwire: Can you document these claims with benchmark figures?

  DIETZ: "We have performed benchmarks against the Paragon and other
machines. The figures are posted on-line and have also been published
in several papers. For the Paragon, minimum communication time between a
couple of processors is on the order of a couple of hundred microseconds.
Whereas for the PAPERS unit to do a barrier sync across everybody is on the
order of three microseconds. And to do other kinds of aggregate communication
operations ends up being no more than tens of microseconds."

  HPCwire: How many workstations can be linked before there's a slowdown?

  DIETZ: "Normally people think about networks as switched. PAPERS has no
switching. Consequently, when you scale it up, all you're adding is wire and
gate delays: there are no extra logic delays, buffering stages or switching
stages. We've already built prototypes that can literally scale up to
thousands of processors, and the slowdown on the basic operations is on the
order of one or two hundred nanoseconds."

  HPCwire: Aren't there peculiar programming challenges involved?

  DIETZ: "Absolutely. Most people use PVM or MPI as the programming
environment for workstation clusters. There, typically one PE initiates, and
one PE receives. That's not the way PAPERS works. Our programming
environment, AFAPI (Aggregate Function Application Program Interface), has a
full set of operations and can even do things that look like ordinary network
communications -- for example, a permutation communication across all the
processors. The catch is that it's not one operation per processor but one
operation involving all the processors. 

  "Let's say I want to do an arbitrary multibroadcast. Each processor outputs
its piece of data and says who it wants to read the data from. Because of the
way the hardware is structured, this becomes a single operation for the
library. This is quite different from each processor asynchronously deciding
to talk to another processor. It's not a whole bunch of point-to-point links.
It's literally an N-to-N communication."

  HPCwire: Exactly what does this mean for the programmer?

  DIETZ: "AFAPI is different. If you're writing C, you can't just take your
PVM or MPI codes and run them unchanged. You have to sit down and think a
little bit. But if you do this, you get vastly improved performance and
virtually no operating system overhead.

  "I believe it's actually easier to use than PVM or MPI for two reasons. One
is that there's no concept of buffer management. And since all our operations
are cheap, you don't have to worry about restructuring your code, hiding
latency, vectorizing messages, etc. 

  "We also have a major compiler effort going on. Don't forget: PAPERS
started out as a compiler project. Jointly with Will Cohen at the University
of Alabama at Huntsville we have developed a port of the Maspar C dialect
called MPL. We've taken the full compiler for that and retargeted it to
generate code for PAPERS clusters.

  "Rather than thinking of a PAPERS cluster as a traditional NOW, it's better
to conceive of it as a dedicated parallel supercomputer that just happens to
be made out of commodity boxes with a custom connection."

  HPCwire: PAPERS requires hardware. How costly is it?

  DIETZ: "The software and hardware designs are not proprietary; they're all
fully public-domain. And we like keeping things that way. For a two- or four-
processor system, a custom board is not even required. You can buy the parts
from Radio Shack and assemble them on your kitchen table; it would cost about
$50 to $60. Scalable versions are a bit more expensive to build, because the
PAPERS modules have additional hardware for hooking them together.

  "Also, although the AFAPI is normally used with a cluster of UNIX systems
connected by both PAPERS and a conventional network, the same programming
interface works with other hardware configurations. For example, SHMAPERS
AFAPI uses UNIX System V shared memory on SMP hardware; CAPERS AFAPI works
with just two machines connected by a standard 'LapLink' cable."

  HPCwire: What kind of applications are PAPERS systems currently supporting?

  DIETZ: "Seven universities are currently playing with it -- using the
technology in everything from scientific applications to a chess-playing
program. At least one company is looking into an embedded PAPERS-based system
for medical equipment. We also are developing VGA video-wall applications."


  For more information, see the PAPERS Web site

The Aggregate. The only thing set in stone is our name.