SWARC: SIMD Within A Register C

Prof. Hank Dietz, hankd@engr.uky.edu
Randy Fisher, randall.fisher@ieee.org

Originally Posted July 30, 1998 at http://dynamo.ecn.purdue.edu/~hankd/SWAR/Scc.html

SWARC, pronounced "swh-are-see," is a C-like language designed to simplify writing portable code using SWAR parallelism. SWAR is our generic name for a variation on SIMD, the Single Instruction Multiple Data parallelism that has been used in supercomputers for decades. Unlike traditional SIMD hardware, SWAR (Simd Within A Register) hardware parallelism width is not a constant, but is a function of the precision of the values being operated upon -- lower precision yields greater parallelism. The primary motivation behind the SWARC language is that SWAR hardware mechanisms in the latest generation of microprocessors offer excellent speedups, but vary widely and have highly machine-dependent programming models that require hand coding of individual instructions and offer virtually no compile-time optimization.

Although having a very smart C compiler automatically generate optimized native SWAR code is a good long-term goal, quite a lot of new compiler technology must be developed for this, and building such a C compiler is not currently practical. To develop and test the new compiler technology, and also to provide an immediately usable SWAR programming abstraction, SWARC is a relatively small and simple language that can be used to generate code modules that can call and be called by ordinary C code. Thus, for example, SWARC has no I/O operations, etc.; these are all provided by the interface to C.

The following sections of this HTML document briefly overview the SWARC language, the SWARC compiler (Scc), a simple example program, and an "evil" example program.

The SWARC Language

The key concept in SWAR is the use of operations on word-sized values to perform parallel operations on fields within a word. Thus, the most distinctive feature of SWAR programming is the need to specify the field sizes and their alignment within a word. The field sizes and alignments are highly machine dependent, so, for portability, the programmer's specification should not require particular layouts, but should provide minimal constraints on how a particular layout can be selected by the compiler. To interface to C as a module compiler, SWARC also needs to be able to specify data types that have layouts equivalent to ordinary C data.

SWARC's solution is very simple. A SWARC data type essentially specifies a first-class single-dimension array type whose elements are either ordinary C-typed objects or are fields with a specified minimum precision. For example, the C declarations:

char c[100];
short s[100];
int i[100];
unsigned int u[100];

are equivalent to the SWARC C-layout declarations:

char[100] c;
short[100] s;
int[100] i;
unsigned int[100] u;

The slight notational difference between C and SWARC is due to the fact that C arrays are not fully first-class objects. For example, given the above declarations, i=s; is a perfectly legal SWARC array assignment that does element-by-element conversion from short to int copying the array value of s into i. In C, i=s; would be illegal; in fact, the closest that we could come to making this work for C would be to wrap each array in a C struct (since they are first-class C objects), but that still would not work, because the two struct types would be incompatible.

However, specifying first-class arrays of C objects is only the C interface portion of SWARC's data typing mechanism. Borrowing the bit-field specification from C struct declaration syntax, SWARC allows first-class arrays of appropriately-aligned SWAR data to be declared as:

char:7[100] ca;
short:[100] ia;

This notation uses any of the basic C types char, short, int, or float, but specifies the minimum number of bits for each field and allows the SWARC compiler to select the most appropriate internal layout (i.e., field alignment and ordering). The declaration of ca specifies that it is an array of at least 100 elements, each of which is an integer field with at least 7 bits. Similarly, ia is an array of at least 100 elements, each of which is an integer field with at least as many bits as a C short (i.e., 16 bits). Thus, the programmer does not know very much about how ca and ia will be represented, but ia=ca; works as expected -- and so would ia=c; or i=ca;.

About now is when you might be thinking that C++ classes and templates could solve these problems without requiring a whole new language. The answer is not quite. It would be very difficult for the C++ compiler to optimize across multiple first-class array operations, or even to do the basic static analysis involving SWAR optimizations.

In addition to this major difference between C types and SWARC types, there are a variety of minor differences. To simplify the compiler analysis, SWARC does not support C struct or union types, pointers, nor even multi-dimensional arrays; of course, there is nothing preventing programmers from coding using these types in C functions that interact with SWARC code. To make SWARC more usable for programmers, there are also a variety of extensions. In addition to the C-like sizeof built-in operation, SWARC provides widthof and precisionof; widthof and precisionof can be applied to an expression to return constant values that are, respectively, the number of elements or the number of bits of precision. For declaring variables or type casting, typeof can be used to generate a type specification from an expression. For type casting, degenerate null-subscripted types can be used to change the precision of elements without changing the number of elements. For example:

int x;
int:8[3] y;
int[3] z;
typeof(y) y2;           /* equivalent to int:8[3] y2; */
x = sizeof(z);          /* x = # bytes in z's memory image */
x = widthof(z);         /* x = 3 */
x = precisionof(z);     /* x = 8 */
y = ((int:8[]) z);      /* y = (z's value as an int:8[3]) */

It is important to note that, unlike C's sizeof, the SWARC construct returns a potentially non-constant value -- the SWARC compiler is even free to use multiple different-sized layouts for values of the same declared type. This operator is used only to pass C code the current size of the value's memory image.

There are also several type specifiers closely related to C storage classes. As in C, data and functions that are defined either later in this file or in another, separately compiled, file can be declared as extern. Also like C, data and functions whose names are to be visible only within the current file are declared static, and data can be declared register as a hint to the compiler that the value need not be placed in memory or as const to suggest that the data is temporarily read-only. SWARC integer values have attributes similar to C integers by default, signed and using modular arithmetic, but can also be explicitly declared as such; the alternatives are, respectively, unsigned and saturation.

The language specification is long and not yet complete, but understanding SWARC data typing is really the key. Operating on these first-class arrays is done using ordinary C operators and SIMD semantics. For example, the if statement, if given an array-valued conditional, will perform the usual SIMD enable-masking when executing the clauses. There are also a few new statements, operators, and some extended semantics. For example, there is a where construct that behaves like if, but always executes the clauses (i.e., unlike if, it implements enable masking only, not control flow). There are new binary operators for maximum, minimum, and average. Further, as was done in Thinking Machines' C* language, assignment operators can be used to perform unary reductions, e.g., +=y would return the single value that is the sum of the elements of y. Most of these extensions should sound familiar if you have used SIMD supercomputers like the old Thinking Machines, MasPar, and DAP systems.

This is not a serious summary, but you can get a reasonable feel for the language by scanning through the PCCTS grammar that was used to build the SWARC compiler's front-end:

swarc :
	  ( CCODE | ( { "extern" | "static" } ( data | func ) ) )+
	;

data :
	  type Name ( "," Name )* ";"
	;

func :
	  "void" Name "(" ( "void" | ( type Name ( "," type Name )*
	  ) ) ")" ( ( block ) | ";" )
	;

type :
	  ( "register" | "const" | "modular" | "saturation" |
	  "signed" | "unsigned" )* ( "char" | "short" | "int" |
	  "float" ) { ":" { num } } { "[" ( expr | ) "]" }
	| "typeof" "(" expr ")"
	;

block :
	  "{" ( data )* { stat ( stat )* } "}"
	;

stat :
	  block
	| CCODE
	| Name ":" stat
	| "goto" Name ";"
	| "if" "(" expr ")" stat { "else" stat }
	| "while" "(" expr ")" stat
	| "do" stat "while" "(" expr ")" ";"
	| "for" "(" expr ";" expr ";" expr ")" stat
	| "continue" { expr } ";"
	| "break" { expr } ";"
	| "return" ";"
	| ";"
	| "where" "(" expr ")" stat { "elsewhere" stat }
	| "everywhere" stat
	| ident "(" { expr ( "," expr )* } ")" ";"
	| expr ";"
	;

expr :
	  expr0
	;

expr0 :
	  expr1 { "=" expr | "&&=" expr | "||=" expr | "?>="
	  expr | "?<=" expr | "+/=" expr | "+=" expr | "-="
	  expr | "*=" expr | "/=" expr | "%=" expr | ">>=" expr |
	  "<<=" expr | "&=" expr | "^=" expr | "|=" expr }
	;

expr1 :
	  expr2 { "?" expr2 ":" expr2 }
	;

expr2 :
	  expr3 ( "||" expr3 )*
	;

expr3 :
	  expr4 ( "&&" expr4 )*
	;

expr4 :
	  expr5 ( "|" expr5 )*
	;

expr5 :
	  expr6 ( "^" expr6 )*
	;

expr6 :
	  expr7 ( "&" expr7 )*
	;

expr7 :
	  expr8 ( "==" expr8 | "!=" expr8 )*
	;

expr8 :
	  expr9 ( "<" expr9 | ">" expr9 | "<=" expr9 | ">="
	  expr9 | "?>" expr9 | "?<" expr9 | "+/" expr9 )*
	;

expr9 :
	  expr10 ( ">>" expr10 | "<<" expr10 )*
	;

expr10 :
	  expr11 ( "+" expr11 | "-" expr11 )*
	;

expr11 :
	  expr12 ( "*" expr12 | "/" expr12 | "%" expr12 )*
	;

expr12 :
	  "-" expr12
	| "!" expr12
	| "~" expr12
	| "++" expr12
	| "--" expr12
	| "(" type ")" expr12
	| "sizeof" "(" ( type | ident ) ")"
	| expr13 { ( "[" expr "]" ) | ( "[<<" expr "]" ) | (
	  "[<<%" expr "]" ) | ( "[>>" expr "]" ) | (
	  "[>>%" expr "]" ) | "++" | "--" }
	| "&&=" expr12
	| "||=" expr12
	| "?>=" expr12
	| "?<=" expr12
	| "+/=" expr12
	| "+=" expr12
	| "&=" expr12
	| "*=" expr12
	| "|=" expr12
	| "^=" expr12
	;

expr13 :
	  num
	| vnum
	| ident
	| "(" expr ")"
	;

ident :
	  Name
	;

vnum :
	  "{" expr { ".." expr } ( "," expr { ".." expr } )* "}"
	| String ( String )*
	;

num :
	  Octal_Number
	| Decimal_Number
	| Hex_Number
	| Binary_Number
	| Char
	| "widthof" "(" expr ")"
	| "precisionof" "(" expr ")"
	;

The SWARC Compiler, Scc

The SWARC compiler is called Scc (for Swar C compiler). It is a very sophisticated, roughly 50K source line, research compiler, using a variety of compiler optimization technologies that are truly "bleeding edge." It was created from scratch using PCCTS and other compiler tools developed by our group at Purdue, and the intent is to make it a Public Domain release, stable enough to be usable for production work.

Right now, as an alpha-test version, a lot of stuff works and a lot (e.g., 3DNow! floating point) doesn't. Currently, it only generates code for MMX variants... although it does understand the variations quite well. We are currently pursuing funding to continue our work in implementing new targets and developing new optimizations.

Despite its "researchy" status, Scc provides a command line interface very similar to that of the Gnu C compiler, gcc; in fact, Scc automatically invokes portions of gcc to handle SWARC source preprocessing and the relatively mundane compilation of Scc-generated C code. Like gcc, Scc associates file types with the suffix of the file name. The following file types are understood:

file.a
An archive file to include at link time.
file.c
A C source file. If given on the Scc command line, it is catenated into the output C code file.
file.h
A C header file. If given on the Scc command line, it is catenated into the output C header file.
file.o
An object file to include at link time.
file.s
An assembly code file to include at assembly time.
file.S
An assembly code file to preprocess and then include at assembly time.
file.Sc
A SWARC source file.
file.Sh
A SWARC header file.
file.Si
A preprocessed SWARC source file.

The command line options implemented thus far for Scc include:

-c
Compile but do not link. Because Scc allows multiple input files to be combined, the default object file name is not derived from the SWARC source file names, but is Scout.o.
-k
Keep both .c and .h (C code and header) files generated by Scc; normally these files are deleted as soon as they are no longer needed. By default, the kept files will be named Scout.c and Scout.h in the current directory. This is equivalent to giving both -kc and -kh.
-kc
Keep the .c (C code) file generated by Scc; normally this file is deleted as soon as it is no longer needed. By default, the kept file will be named Scout.c in the current directory.
-kh
Keep the .h (C header) file generated by Scc; normally this file is deleted as soon as it is no longer needed. By default, the kept file will be named Scout.h in the current directory.
-kSi
Keep all .Si files (preprocessed SWARC) generated by Scc; normally these files are deleted as soon as they are no longer needed. By default, the kept file will be named by replacing the .Sc or .Sh suffix of the SWARC source file with .Si.
-llibrary
When generating the executable file, add -llibrary to the gcc command line.
-m6x86MX
The target machine model is IA32 MMX with Cyrix extended MMX using a Cyrix 6x86MX processor. The extended MMX instructions add a variety of integer operations and conditional (masked) memory-to-register moves. Characteristics like pipeline structure and timing are considered in optimizing the MMX code generated, and the processor model is also reflected in the performance estimates.
-mK6
The target machine model is IA32 MMX using an AMD K6 processor. Characteristics like pipeline structure and timing are considered in optimizing the MMX code generated, and the processor model is also reflected in the performance estimates.
-mK6-2
The target machine model is IA32 MMX, with 3DNow! enhancements, using an AMD K6-2 processor. Although the target is primarily MMX code, 3DNow! instructions are used where appropriate; most 3DNow! instructions deal with floating point, but there are even a few improvements for integer operations. Characteristics like pipeline structure and timing are considered in optimizing the MMX 3DNow! code generated, and the processor model is also reflected in the performance estimates.
-mMII
The target machine model is IA32 MMX with Cyrix extended MMX using a Cyrix MII processor. The extended MMX instructions add a variety of integer operations and conditional (masked) memory-to-register moves. Characteristics like pipeline structure and timing are considered in optimizing the MMX code generated, and the processor model is also reflected in the performance estimates.
-mMMX
The target machine model is IA32 MMX using generic characteristics "averaged" across all current MMX implementations. Characteristics like pipeline structure and timing are considered in optimizing the MMX code generated, and the processor model is also reflected in the performance estimates. This currently is the default target, but should be explicitly specified because the default target may change in future versions of Scc.
-mPentiumII
The target machine model is IA32 MMX using an Intel PentiumII processor. Characteristics like pipeline structure and timing are considered in optimizing the MMX code generated, and the processor model is also reflected in the performance estimates.
-mPentiumMMX
The target machine model is IA32 MMX using an Intel Pentium processor with MMX. Characteristics like pipeline structure and timing are considered in optimizing the MMX code generated, and the processor model is also reflected in the performance estimates.
-o file
Name the output executable as file. By default, the executable would be named a.out in the current directory.
-oc file.c
Name the output C code as file.c. By default, the output C code would be named Scout.c in the current directory. Note that specifying a name for this file does not imply that the file should be kept; use -k or -kc for that.
-oh file.h
Name the output C header as file.h. By default, the output C code would be named Scout.h in the current directory. Note that specifying a name for this file does not imply that the file should be kept; use -k or -kh for that.
-oo file.o
Name the output object module as file.o. Specifying this option implicitly sets -c as well.
-os file.s
Name the output assembly language code as file.s. Specifying this option implicitly sets -S as well.
-p
Enable performance estimation messages. Scc's static timing models do not accurately predict true execution time. However, Scc does compute a detailed static estimate of the expected number of clock cycles for each basic block it generates. It also computes a somewhat cruder estimate of the number of clock cycles that would be needed for conventional word-oriented C code to serially execute the specified vector operations. If Scc's estimates suggest that conventional C code might actually be faster than the parallel code it generates for a particular function, an appropriate warning will be generated. The basic block estimates are also placed as comments in the generated C code, and the per-function static estimates will be output as informational messages if the -W option is also given.
-v
Output verbose version information and trace of actions.
-C
When preprocessing SWARC code (.Sc or .Sh suffix files), the preprocessor should preserve comments.
-Dmacro
When preprocessing SWARC code (.Sc or .Sh suffix files), the preprocessor should first define macro. If macro is just the name of a macro, it is defined as 1; alternatively, macro can be of the form name=value.
-E
Stop after preprocessing SWARC source code (into files with the .Si suffix). Setting this option implicitly also sets -kSi.
-Ec
Stop after preprocessing the generated C code. As is customary for gcc -E, the preprocessed C code is sent to the standard output rather than a file.
-I path
When preprocessing SWARC code (.Sc or .Sh suffix files), the preprocessor should add path to the path search list for processing of #include directives.
-Ic path
When preprocessing the generated C code, the preprocessor should add path to the path search list for processing of #include directives.
-Lpath
When linking, the C compiler should add path to the library path search list.
-Oseconds
Enable exhaustive optimization. Because the potential speedups are often very small for SWAR execution (typically 1.5x-6x), it is particularly important that Scc generate the best possible code, with minimal unexpected overhead. In addition to various more traditional optimizations, Scc uses a modified exhaustive search to simultaneously optimize the pipelined code schedule and register allocation for each basic block. If these searches are allowed to run to completion, the results are always provably optimal schedules for the target processor model used. But exhaustive search can take a long time, so, by default, Scc essentially truncates the search by accepting the first valid schedule encountered -- which is usually fairly good (thanks to a heuristic prescheduler). This option allows you to specify the maximum number of seconds that the search should continue on each basic block without finding a new best schedule and register allocation. Giving -O2 often results in a 20% improvement over the default -O0, but there is no easy way to predict the benefit in using larger numbers. The search usually completes without truncation in less than 2 seconds for a basic block with as many as a couple hundred instructions. Note that Scc will sometimes report failure to schedule a block because the only schedules it found would have required spilling one or more registers; a future version of Scc will take a bit less harsh view of spills, but breaking-up the block by hand is currently the only solution.
-P
When processing SWARC code (.Sc or .Sh suffix files), neither the compiler nor the preprocessor should generate # line number directives. The result is that error messages generated by gcc will refer to source line numbers in the generated C code file, rather than in the original source files. This is intended mostly for debugging and hand-tweaking the C code. However, although later releases will improve, Scc's current line number tracking is very sloppy... so it may actually be easier to track where errors came from by examining the generated C code at the locations flagged when compiled with -P.
-S
Stop after generating assembly code. By default, the assembly language file will be named Scout.s, but -os can be used to rename it.
-Sc
Stop after generating C code and header. By default, the output C code and header respectively would be named Scout.c and Scout.h, both in the current directory. Specifying this option implicitly sets -k as well.
-Umacro
When preprocessing SWARC code (.Sc or .Sh suffix files), the preprocessor should first undefine macro.
-W
Bump warning level for more messages. Currently, it may be useful to specify -W up to two times.

Thus, the basic output of Scc is C code (including inline native assembly code for SWAR operations) and a C interface header file, but Scc is capable of directly producing executable files. In use, Scc handles SWARC code nearly identically to how gcc handles C code.

A Simple Example Program

The following is a simple SWARC program, test.Sc:

void addfour(int[2]x) { x += 4; }

${
main()
{
  int a[2]; a[0] = 1; a[1] = 2;
  addfour(a);
  printf("a={%d,%d}\n", a[0], a[1]);
}
$}

This program doesn't really show-off many of the language or compiler features, but is short enough that we can reasonably include all phases of the code for this example. The addfour function is SWARC code that takes a C-layout first-class array of 2 integers (passed by address), and simply adds 4 to each of its elements. The main is ordinary C code, which can be used in-line within SWARC programs as shown above.

Compiled by Scc -Sc -P, the header file Scout.h is generated as:

/* compiled by Scc alpha-test v980728 for generic IA32 MMX target */
extern void addfour(int *x);

And the C code file Scout.c:

/* compiled by Scc alpha-test v980728 for generic IA32 MMX target */
#include "Sc.h"
void addfour(int *x)
{
    extern mmx_t mmx_cpool[];
    register mmx_t *_cpool = &(mmx_cpool[0]);

    {
	movq_m2r(*(((mmx_t *) x) + 0), mm0);
	paddd_m2r(*(_cpool + 2), mm0);
	movq_r2m(mm0, *(((mmx_t *) x) + 0));
    }

_return:
    emms();
}


main()
{
  int a[2]; a[0] = 1; a[1] = 2;
  addfour(a);
  printf("a={%d,%d}\n", a[0], a[1]);
}


/* MMX constant pool */
mmx_t mmx_cpool[] = {
/*    0 */	0x0000000000000000LL,
/*    1 */	0xffffffffffffffffLL,
/*    2 */	0x0000000400000004LL
};

The actual assembly language translation of the program, as generated by Scc test.Sc -S -O2 is:

	.file	"Scout.c"
	.version	"01.01"
gcc2_compiled.:
.text
	.align 4
.globl addfour
	.type	 addfour,@function
addfour:
	pushl %ebp
	movl %esp,%ebp
	movl 8(%ebp),%edx
	movl $mmx_cpool,%eax
#APP
	movq (%edx), %mm0
	paddd 16(%eax), %mm0
	movq %mm0, (%edx)
	emms
#NO_APP
	leave
	ret
.Lfe1:
	.size	 addfour,.Lfe1-addfour
.section	.rodata
.LC0:
	.string	"a={%d,%d}\n"
.text
	.align 4
.globl main
	.type	 main,@function
main:
	pushl %ebp
	movl %esp,%ebp
	subl $8,%esp
	movl $1,-8(%ebp)
	movl $2,-4(%ebp)
	leal -8(%ebp),%eax
	pushl %eax
	call addfour
	pushl -4(%ebp)
	pushl -8(%ebp)
	pushl $.LC0
	call printf
	leave
	ret
.Lfe2:
	.size	 main,.Lfe2-main
.globl mmx_cpool
.data
	.align 4
	.type	 mmx_cpool,@object
mmx_cpool:
	.long 0
	.long 0
	.long -1
	.long -1
	.long 4
	.long 4
	.size	 mmx_cpool,24
	.ident	"GCC: (GNU) 2.7.2.3"

Notice that the MMX instructions appear without additional overhead, simply using gcc's inline assembly facility with our libmmx wrappers. However, the gcc code does have a significant flaw: the last .align 4 should really have been an .align 8 to ensure optimum performance, but this minor bug will be fixed in future versions of Scc (if it is not fixed in gcc first).

Compiling test.Sc directly into an executable file called test is done by Scc test.Sc -o test. When the file test is executed on an MMX-capable machine, the output is simply:

a={5,6}

An "Evil" Example Program

Ok, so by now you're starting to wonder how clever this compiler is... after all, you could have written those three MMX instructions yourself. Well, you asked for it. Here's a trivial little example that doesn't generate trivial little MMX code:

char[14] c;

void
nasty(unsigned int:2[32]two, typeof(two) twotoo,
unsigned int:4[32]four)
{
	/* Two bit divide... */
	two = two / twotoo;

	/* Four bit field average...
	   but first must promote 2->4 bit
	*/
	four +/= two;

	/* Eight bit add...
	   of missaligned c to itself rotated
	*/
	c += c[<<% 1];
}

This little gem is called torture.Sc, and was compiled by Scc torture.Sc -Sc -P -O2000 to generate the following code. But wait a second... -O2000? Doesn't that mean the Scc optimizer can take up to 2000 seconds? Yup... but it actually took 0.86 seconds for the full compile, meaning that the block was scheduled optimally (at least according to our timing model for a generic MMX target). The output C code is:

/* compiled by Scc alpha-test v980728 for generic IA32 MMX target */
#include "Sc.h"
char c[14];
void nasty(mmx_t *two,
mmx_t *twotoo,
mmx_t *four)
{
    extern mmx_t mmx_cpool[];
    register mmx_t *_cpool = &(mmx_cpool[0]);

    {
	movq_m2r(*(((mmx_t *) two) + 0), mm0);
	movq_m2r(*(((mmx_t *) twotoo) + 0), mm1);
	movq_r2r(mm1, mm2);
	psrlq_i2r(1, mm1);
	movq_r2r(mm0, mm3);
	psrlq_i2r(1, mm0);
	pandn_r2r(mm3, mm1);
	movq_r2r(mm3, mm4);
	pand_r2r(mm0, mm3);
	movq_r2r(mm2, mm5);
	pandn_r2r(mm0, mm2);
	por_r2r(mm3, mm1);
	por_r2r(mm2, mm1);
	pandn_r2r(mm4, mm5);
	pand_m2r(*(_cpool + 2), mm1);
	pand_m2r(*(_cpool + 3), mm5);
	por_r2r(mm5, mm1);
	movq_m2r(*(((mmx_t *) four) + 0), mm6);
	movq_r2m(mm1, *(((mmx_t *) two) + 0));
	movq_m2r(*(((mmx_t *) four) + 1), mm7);
	movq_r2r(mm1, mm5);
	pand_m2r(*(_cpool + 4), mm1);
	movq_r2r(mm5, mm4);
	pand_m2r(*(_cpool + 5), mm5);
	psllq_i2r(16, mm5);
	por_r2r(mm5, mm1);
	movq_r2r(mm1, mm5);
	pand_m2r(*(_cpool + 6), mm1);
	pand_m2r(*(_cpool + 7), mm5);
	psllq_i2r(8, mm1);
	por_r2r(mm1, mm5);
	movq_r2r(mm5, mm1);
	pand_m2r(*(_cpool + 8), mm5);
	pand_m2r(*(_cpool + 9), mm1);
	psllq_i2r(4, mm5);
	por_r2r(mm5, mm1);
	movq_r2r(mm1, mm5);
	pand_m2r(*(_cpool + 10), mm1);
	pand_m2r(*(_cpool + 11), mm5);
	psllq_i2r(2, mm1);
	psrlq_i2r(32, mm4);
	por_r2r(mm1, mm5);
	movq_r2r(mm4, mm1);
	pand_m2r(*(_cpool + 4), mm4);
	pand_m2r(*(_cpool + 5), mm1);
	psllq_i2r(16, mm1);
	por_r2r(mm1, mm4);
	movq_r2r(mm4, mm1);
	pand_m2r(*(_cpool + 6), mm4);
	pand_m2r(*(_cpool + 7), mm1);
	psllq_i2r(8, mm4);
	por_r2r(mm4, mm1);
	movq_r2r(mm1, mm4);
	pand_m2r(*(_cpool + 8), mm1);
	pand_m2r(*(_cpool + 9), mm4);
	psllq_i2r(4, mm1);
	por_r2r(mm1, mm4);
	movq_r2r(mm4, mm1);
	pand_m2r(*(_cpool + 10), mm4);
	pand_m2r(*(_cpool + 11), mm1);
	psllq_i2r(2, mm4);
	movq_r2r(mm6, mm0);
	psrlq_i2r(1, mm6);
	por_r2r(mm4, mm1);
	pand_m2r(*(_cpool + 12), mm6);
	movq_r2r(mm5, mm4);
	psrlq_i2r(1, mm5);
	pand_m2r(*(_cpool + 12), mm5);
	por_r2r(mm4, mm0);
	paddd_r2r(mm5, mm6);
	pand_m2r(*(_cpool + 13), mm0);
	movq_r2r(mm7, mm5);
	psrlq_i2r(1, mm7);
	paddd_r2r(mm0, mm6);
	pand_m2r(*(_cpool + 12), mm7);
	movq_r2r(mm1, mm4);
	psrlq_i2r(1, mm1);
	pand_m2r(*(_cpool + 12), mm1);
	por_r2r(mm4, mm5);
	paddd_r2r(mm1, mm7);
	pand_m2r(*(_cpool + 13), mm5);
	movq_r2m(mm6, *(((mmx_t *) four) + 0));
	paddd_r2r(mm5, mm7);
	movq_m2r(*(((mmx_t *) &c) + 0), mm5);
	movq_r2m(mm7, *(((mmx_t *) four) + 1));
	movq_m2r(*(((mmx_t *) &c) + 1), mm4);
	movq_r2r(mm4, mm1);
	pand_m2r(*(_cpool + 14), mm4);
	movq_r2r(mm4, mm0);
	psllq_i2r(8, mm4);
	movq_r2r(mm5, mm3);
	psrlq_i2r(56, mm5);
	pand_m2r(*(_cpool + 15), mm4);
	pand_m2r(*(_cpool + 16), mm5);
	movq_r2r(mm3, mm2);
	psllq_i2r(8, mm3);
	por_r2r(mm5, mm4);
	pand_m2r(*(_cpool + 15), mm3);
	psrlq_i2r(40, mm0);
	pand_m2r(*(_cpool + 17), mm0);
	por_r2r(mm0, mm3);
	paddb_r2r(mm1, mm4);
	paddb_r2r(mm3, mm2);
	pand_m2r(*(_cpool + 14), mm4);
	movq_r2m(mm2, *(((mmx_t *) &c) + 0));
	pand_m2r(*(_cpool + 18), mm1);
	por_r2r(mm1, mm4);
	movq_r2m(mm4, *(((mmx_t *) &c) + 1));
    }

_return:
    emms();
}


/* MMX constant pool */
mmx_t mmx_cpool[] = {
/*    0 */	0x0000000000000000LL,
/*    1 */	0xffffffffffffffffLL,
/*    2 */	0x5555555555555555LL,
/*    3 */	0xaaaaaaaaaaaaaaaaLL,
/*    4 */	0x000000000000ffffLL,
/*    5 */	0x00000000ffff0000LL,
/*    6 */	0x0000ff000000ff00LL,
/*    7 */	0x000000ff000000ffLL,
/*    8 */	0x00f000f000f000f0LL,
/*    9 */	0x000f000f000f000fLL,
/*   10 */	0x0c0c0c0c0c0c0c0cLL,
/*   11 */	0x0303030303030303LL,
/*   12 */	0x7777777777777777LL,
/*   13 */	0x1111111111111111LL,
/*   14 */	0x0000ffffffffffffLL,
/*   15 */	0xffffffffffffff00LL,
/*   16 */	0x00000000000000ffLL,
/*   17 */	0x0000000000ffffffLL,
/*   18 */	0xffff000000000000LL
};

Ok. So our compiler really does do something non-trivial.

Still, the real question is how fast does the output code run? We don't really know... we don't yet have one of each of the different types of MMX targets. However, the compiler timing estimates on this code were surprisingly consistent. The AMD K6 and Cyrix 6x86MX and MII processors all yielded code that Scc estimated was about 4.7x as fast as word-oriented serial C code; for the AMD K6-2 and Intel PentiumII and Pentium with MMX, Scc predicts a 6.0x speedup (as it also does for the generic MMX model). Why? In this case, all the targets are using the same MMX instructions, so it is just a matter of pipeline scheduling, and having two pipes beats having only one.