Technical Summary: SWAR Technology

*February 4, 1997, by Hank Dietz*

The basic concept of SWAR, SIMD Within A Register, is that operations
on word-length registers can be used to speed-up computations by
performing SIMD parallel operations on `n`
`k`/`n`-bit field values. However, making use of
SWAR technology can be awkward, and some SWAR operations are actually
more expensive than the corresponding sequences of serial operations
because they require additional instructions to enforce the field
partitioning.

To illustrate this point, let's consider a greatly simplified SWAR mechanism that manages four 8-bit fields within each 32-bit register. The values in two registers might be represented as:

D 7:0 | C 7:0 | B 7:0 | A 7:0 |

H 7:0 | G 7:0 | F 7:0 | E 7:0 |

This simply indicates that each register is viewed as essentially a
vector of four independent 8-bit integer values. Alternatively, think
of `A`

and `E`

as values in processing element 0
(PE0), `B`

and `F`

as values in PE1, and so
forth.

The remainder of this document briefly reviews the basic classes of SIMD parallel operations on these integer vectors and how these functions can be implemented.

Some SWAR operations can be performed trivially using ordinary 32-bit
integer operations, without concern for the fact that the operation is
really intended to operate independently in parallel on these 8-bit
fields. We call any such SWAR operation *polymorphic*, since
the function is unaffected by the field types (sizes).

Testing if any field is non-zero is polymorphic, as are all bitwise
logic operations. For example, an ordinary bitwise-and operation (C's
`&`

operator) performs a bitwise and no matter what the
field sizes are. A simple bitwise and of the above registers yields:

D & H 7:0 | C & G 7:0 | B & F 7:0 | A & E 7:0 |

Because the bitwise and operation always has the value of result bit
`k` affected only by the values of the operand bit
`k` values, all field sizes are supported using the same
single instruction.

Unfortunately, lots of important SWAR operations are not polymorphic.
Arithmetic operations such as add, subtract, multiply, and divide are
all subject to carry/borrow interactions between fields. We call such
SWAR operations *partitioned*, because each such operation must
effectively partition the operands and result to prevent interactions
between fields. However, there are actually three different methods
that can be used to achieve this effect.

Perhaps the most obvious approach to implementing partitioned operations is to provide hardware support for "partitioned parallel instructions" that cut the carry/borrow logic between fields. This approach can yield the highest performance, but it requires a change to the processor's instruction set and generally places many restrictions on field size (e.g., 8-bit fields might be supported, but not 12-bit fields).

The AMD/Cyrix/Intel MMX, Digital MAX, HP MAX, and Sun VIS all implement restricted versions of partitioned instructions. Unfortunately, these different instruction set extensions have significantly different restrictions, making algorithms somewhat non-portable between them. For example, consider the following sampling of partitioned operations:

Instruction | AMD/ Cyrix/ Intel MMX | Digital MAX | HP MAX | Sun VIS |
---|---|---|---|---|

Add | 8, 16, & 32 | 16 | 16 & 32 | |

Multiply | 16 | 8 x 16 | ||

Compare | 8, 16, & 32 | 16 & 32 | ||

Merge Max | 8 & 16 | |||

Absolute Difference | 8 | 8 |

In the table, the numbers indicate the field sizes, in bits, for which each operation is supported. Even though the table omits many instructions including all the more exotic ones, it is clear that there are many differences. The direct result is that high-level languages (HLLs) really are not very effective as programming models, and portability is generally poor.

Implementing partitioned operations using partitioned instructions can certainly be efficient, but what do you do if the partitioned operation you need is not supported by the hardware? The answer is that you use a series of ordinary instructions to perform the operation with carry/borrow across fields, and then correct for the undesired field interactions.

This is a purely software approach, and the corrections do introduce overhead, but it works with fully general field partitioning. This approach is also fully general in that it can be used either to fill gaps in the hardware support for partitioned instructions, or it can be used to provide full functionality for target machines that have no hardware support at all. In fact, by expressing the code sequences in a language like C, this approach allows SWAR programs to be fully portable.

The question immediately arises: precisely how inefficient is it to simulate SWAR partitioned operations using unpartitioned operations with correction code? Well, that is certainly the $64k question... but many operations are not as difficult as one might expect.

Consider implementing a four-element 8-bit integer vector add of two
source vectors, `x`

+`y`

, using ordinary 32-bit
operations.

An ordinary 32-bit add might actually yield the correct result, but
not if any 8-bit field carries into the next field. Thus, our goal is
simply to ensure that such a carry does not occur. Because adding two
`k`-bit fields generates an at most `k`+1 bit
result, we can ensure that no carry occurs by simply "masking out" the
most significant bit of each field. This is done by bitwise anding
each operand with `0x7f7f7f7f`

and then performing an
ordinary 32-bit add.

t = ((x & 0x7f7f7f7f) + (y & 0x7f7f7f7f));That result is correct... except for the most significant bit within each field. Computing the correct value for each field is simply a matter of doing two 1-bit partitioned adds of the most significant bits from

`x`

and `y`

to the 7-bit carry result
which was computed for `t`

. Fortunately, a 1-bit
partitioned add is implemented by an ordinary exclusive or operation.
Thus, the result is simply:
(t ^ ((x ^ y) & 0x80808080))Ok, well, maybe that isn't so simple. After all, it is six operations to do just four adds. However, notice that the number of operations is not a function of how many fields there are... so, with more fields, we get speedup. In fact, we may get speedup anyway simply because the fields were loaded and stored in a single (integer vector) operation, register availability may be improved, and there are fewer dynamic code scheduling dependencies (because partial word references are avoided).

While the other two approaches to partitioned operation implementation both center on getting the maximum space utilization for the registers, it can be computationally more efficient to instead control the field values so that inter-field carry/borrow events should never occur. For example, if we know that all the field values being added are such that no field overflow will occur, a partitioned add operation can be implemented using an ordinary add instruction; in fact, given this constraint, an ordinary add instruction appears polymorphic, and is usable for any field sizes without correction code. The question thus becomes how to ensure that field values will not cause carry/borrow events.

One way to ensure this property is to implement partitioned instructions that can restrict the range of field values. The Digital MAX vector minimum and maximum instructions can be viewed as hardware support for clipping field values to avoid inter-field carry/borrow.

However, suppose that we do not have partitioned instructions that can
efficiently restrict the range of field values... is there a
sufficient condition that can be cheaply imposed to ensure
carry/borrow events will not interfere with adjacent fields? The
answer lies in analysis of the arithmetic properties. Adding two
`k`-bit numbers generates a result with at most
`k`+1 bits; thus, a field of `k`+1 bits can safely
contain such an operation despite using ordinary instructions.

Thus, suppose that the 8-bit fields in our earlier example are now 7-bit fields with 1-bit "carry/borrow spacers":

D' | D 6:0 | C' | C 6:0 | B' | B 6:0 | A' | A E 6:0 |

A vector of 7-bit adds is performed as follows. Let us assume that,
prior to the start of any partitioned operation, all the carry spacer
bits (`A'`

, `B'`

, `C'`

, and
`D'`

) have the value 0. By simply executing an ordinary
add operation, all the fields obtain the correct 7-bit values; however,
some spacer bit values might now be 1. We can correct this by just one
more conventional operation, masking-out the spacer bits. Our 7-bit
integer vector add, `x`

+`y`

, is thus:

((x + y) & 0x7f7f7f7f)This is just two instructions for four adds, clearly yielding good speedup.

The sharp reader may have noticed that setting the spacer bits to 0
does not work for subtract operations. The correction is, however,
remarkably simple. To compute `x`

-`y`

, we simply
ensure the initial condition that the spacers in `x`

are all
1, while the spacers in `y`

are all 0. In the worst case,
we would thus get:

(((x | 0x80808080) - y) & 0x7f7f7f7f)However, the additional bitwise or operation can often be optimized out by ensuring that the operation generating the value for

`x`

used `| 0x80808080`

rather than ```
&
0x7f7f7f7f
```

as the last step.
Which method should be used for SWAR partitioned operations? The answer is simply "whichever yields the best speedup." Interestingly, the ideal method to use may be different for different field sizes within the same program running on the same machine.

Although some parallel computations, including many operations on image
pixels, have the property that the `i`th value in a vector is
a function only of values that appear in the `i`th position
of the operand vectors, this is generally not the case. For example,
even pixel operations such as smoothing require values from adjacent
pixels as operands, and transformations like FFTs require more complex
(less localized) communication patterns.

It is not difficult to efficiently implement 1-dimensional nearest
neighbor communication for SWAR using unpartitioned shift operations.
For example, to move a value from `PE`

`i` to
`PE`

(`i`+1), a simple shift operation suffices.
If the fields are 8-bits in length, we would use:

(x << 8)Still, it isn't always quite that simple. For example, to move a value from

`PE`

`PE`

(((x >> 8) & 0x00ffffff)

Adding "wrap-around connections" is also reasonably efficient using
unpartitioned shifts. For example, to move a value from
`PE`

`i` to `PE`

(`i`+1) with
wraparound:

((x << 8) | ((x >> 24) & 0x000000ff))

The real problem comes when more general communication patterns must
be implemented. Only the HP MAX instruction set supports arbitrary
rearrangement of fields with a single instruction, which is called
`Permute`

. This `Permute`

instruction is really
misnamed; not only can it perform an arbitrary permutation of the
fields, but it also allows repetition. In short, it implements an
arbitrary `x[y]`

operation.

Unfortunately, `x[y]`

is very difficult to implement without
such an instruction. The code sequence is generally both long and
inefficient; in fact, it is sequential code. This is very
disappointing. The relatively high speed of `x[y]`

operations in the MasPar MP1/MP2 and Thinking Machines CM1/CM2/CM200
SIMD supercomputers was one of the key reasons these machines performed
well. However, `x[y]`

has always been slower than nearest
neighbor communication, even on those supercomputers, so many
algorithms have been designed to minimize the need for
`x[y]`

operations. In short, without hardware support, it
is probably best to develop SWAR algorithms as though
`x[y]`

wasn't legal... or at least isn't cheap.

A recurrence is a computation in which there is an apparently sequential relationship between values being computed. However, if these recurrences involve associative operations, it may be possible to recode the computation using a tree-structured parallel algorithm.

The most common type of parallelizable recurrence is probably the class known as associative reductions. For example, to compute the sum of a vector's values, one commonly writes purely sequential C code like:

t = 0; for (i=0; i<MAX; ++i) t += x[i];However, the order of the additions is rarely important. Floating point and saturation math can yield different answers if the order of additions is changed, but ordinary wrap-around integer additions will yield the same results independent of addition order. Thus, we can re-write this sequence into a tree-structured parallel summation in which we first add pairs of values, then pairs of those partial sums, and so forth, until a single final sum results. For a vector of four 8-bit values, just two addition steps are needed; the first step does two 8-bit adds, yielding two 16-bit result fields (each containing a 9-bit result):

t = ((x & 0x00ff00ff) + ((x >> 8) & 0x00ff00ff));The second step adds these two 9-bit values in 16-bit fields to produce a single 10-bit result:

((t + (t >> 16)) & 0x000003ff)Actually, the second step performs two 16-bit field adds... but the top 16-bit add is meaningless, which is why the result is masked to a single 10-bit result value.

Scans, also known as "parallel prefix" operations, are somewhat harder to implement efficiently. This is because, unlike reductions, scans produce partitioned results. For this reason, scans can be implemented using a fairly obvious sequence of partitioned operations.

The instruction set extensions of each microprocessor tend to focus on speedup for a particular group of multimedia applications. Different processor designs focussed on different application codes, and thus added different types of instructions; which is best depends on what you are doing, and we have no desire to get involved in arguments about which is best. While there is nothing wrong with tuning the instruction set for specific application routines, this process does not generally yield a good target for high-level language compilers. The resulting instruction set understands only certain bit-field widths, and not all operations can be applied to all supported bit-field widths.

We suggest that SWAR is not just for a few hand-optimized multimedia routines. For example, 8-bit values might represent characters in a text editing system just as well as they represent pixels for multimedia; a 2-bit number is sufficient to represent each "base pair" in a SIMD parallel search of a genome database. To make SWAR a fully general-purpose mechanism, we need to construct a more complete, more regular, and more portable target model for the compiler. This paper has presented an overview of how such a target model can be built to be reasonably efficient for most operations on most computers... with or without any flavor of specialized support instructions. Our portable compiler target specification will be posted here shortly.

This page was last modified .