EE699 Assignment 0: The Clock Is Ticking

If you are trying to improve program execution speed, you certainly need to be able to accurately measure execution time. There are various UNIX commands for this, such as time and gprof, but these do not allow as much control over precisely what is being measured as you will want in this course. Thus, this assignment, required for both EE599 and EE699, will have you experiment with a few timing methods.

The Computation

Your program really isn't going to do very much; it will simply copy an array into another array. However, it will do it four ways....

Declare the following:

#define	BIG (8 * 1024 * 1024)
volatile union {
	long long	d64[BIG];
	int		d32[BIG * 2];
	short		d16[BIG * 4];
	char		d8[BIG * 8];
} source, dest;

The above code assumes that you are using GCC, which makes the field types 64, 32, 16, and 8 bits, respectively. Your code should simply copy source into dest. However, it should do it five different ways:

source = dest;
for (i=0; i<BIG; ++i) source.d64[i] = dest.d64[i];
for (i=0; i<(BIG*2); ++i) source.d32[i] = dest.d32[i];
for (i=0; i<(BIG*4); ++i) source.d16[i] = dest.d16[i];
for (i=0; i<(BIG*8); ++i) source.d8[i] = dest.d8[i];

For each copy, your code should report, per byte moved, the unix time elapsed (in nanoseconds) and the number of processor clock cycles. The data structure is large enough (64MB) so that the caches do not cause subsequent copy operations to significantly benefit from cache use on earlier copies, however, you may want to run each copy technique several times to get more stable timing values.

The output of your program should be formatted using printf calls in the following sequence:

printf("Copy method\tns/B\tticks/B\n");
printf("Compiler\t%g\t%g\n", ...
printf("long long loop\t%g\t%g\n", ...
printf("int loop\t%g\t%g\n", ...
printf("short loop\t%g\t%g\n", ...
printf("char loop\t%g\t%g\n", ...

Notice that %g expects a floating-point value; even the number of processor ticks per byte isn't necessarily an integer.

You should compile your code two different ways: one without any optimization, the other with -O2.

Timing Support

Traditionally, computers have been shockingly sloppy about measuring time. Early systems commonly counted zero-crossings of the AC power, giving 1/60s timing resolution with substantial processor overhead. Modern systems generally use a combination of a low-resolution hardware clock, a high-resolution (1193180Hz) countdown timer, and software to reconcile the two hardware mechanisms. You can access unix system time using OS functions like clock() and gettimeofday().

Processor designers needed more detailed information about the performance of real codes (so that future processors could perform better), so there are also performance counters inside most modern processors -- one of which simply counts processor clock cycles. This is not something portable across processors, but, ever since the original Pentium, all IA32 processors have used the exact same instruction sequence to read the processor tick count register. Wrapped as C code, the sequence is:

union tick_union {
        unsigned long long      val64;
        struct {
                unsigned int    low32;
                unsigned int    hi32;
        } lowhi;
};

unsigned long long
ReadTick(void)
{
        register union tick_union result;
        register unsigned int msr = 0;

        __asm __volatile(
".byte 0xf; .byte 0x31  # Read the cycle counter\n"
"        movl %%edx, %0  # high order bits\n"
"        movl %%eax, %1  # low order bits\n"
: "=g" (result.lowhi.hi32), "=g" (result.lowhi.low32)
: "g" (msr)
: "eax", "edx"
                         );

        return(result.val64);
}

Note that your readings will vary somewhat due to the fact that it takes instructions to read the tick counter, so pipeline schedule and cache/memory delays can cause repeatability to be no better than about +/- 15 clock cycles. Of course, things like OS interference can change the tick count by much more.

Due Dates & Such

This project is due by 11:59PM, Sunday, March 2, 2003. Submission details will be discussed in class. You will submit the following:

A Makefile such that "make" compiles your code into executable files named "a0" and "a02"; respectively, these should be compiled with no optimization and with -O2. The command "make run" should execute both programs in the order listed above (compiling it if necessary).
Your C source code.
A very brief HTML-formatted document explaining what you found WRT the times. There really isn't much to be said here, but you should report your results and also explain why taking the minimum tick counts from multiple timings really is more accurate than, for example, taking the median tick count.

Submission Form

Advanced Program Optimization & Parallelization.