If you are trying to improve program execution speed, you
certainly need to be able to accurately measure execution time.
There are various UNIX commands for this, such as
time
and gprof
, but these do not allow
as much control over precisely what is being measured as you
will want in this course. Thus, this assignment, required for
both EE599 and EE699, will have you experiment with a few timing
methods.
Your program really isn't going to do very much; it will simply copy an array into another array. However, it will do it four ways....
Declare the following:
#define BIG (8 * 1024 * 1024) volatile union { long long d64[BIG]; int d32[BIG * 2]; short d16[BIG * 4]; char d8[BIG * 8]; } source, dest;
The above code assumes that you are using GCC, which makes the
field types 64, 32, 16, and 8 bits, respectively. Your code
should simply copy source
into dest
.
However, it should do it five different ways:
source = dest;
for (i=0; i<BIG; ++i) source.d64[i] = dest.d64[i];
for (i=0; i<(BIG*2); ++i) source.d32[i] = dest.d32[i];
for (i=0; i<(BIG*4); ++i) source.d16[i] = dest.d16[i];
for (i=0; i<(BIG*8); ++i) source.d8[i] = dest.d8[i];
For each copy, your code should report, per byte moved, the unix time elapsed (in nanoseconds) and the number of processor clock cycles. The data structure is large enough (64MB) so that the caches do not cause subsequent copy operations to significantly benefit from cache use on earlier copies, however, you may want to run each copy technique several times to get more stable timing values.
The output of your program should be formatted using
printf
calls in the following sequence:
printf("Copy method\tns/B\tticks/B\n"); printf("Compiler\t%g\t%g\n", ... printf("long long loop\t%g\t%g\n", ... printf("int loop\t%g\t%g\n", ... printf("short loop\t%g\t%g\n", ... printf("char loop\t%g\t%g\n", ...
Notice that %g
expects a floating-point value; even
the number of processor ticks per byte isn't necessarily an
integer.
You should compile your code two different ways: one without any optimization, the other with -O2.
Traditionally, computers have been shockingly sloppy about
measuring time. Early systems commonly counted zero-crossings
of the AC power, giving 1/60s timing resolution with substantial
processor overhead. Modern systems generally use a combination
of a low-resolution hardware clock, a high-resolution
(1193180Hz) countdown timer, and software to reconcile the two
hardware mechanisms. You can access unix system time using OS
functions like clock()
and
gettimeofday()
.
Processor designers needed more detailed information about the performance of real codes (so that future processors could perform better), so there are also performance counters inside most modern processors -- one of which simply counts processor clock cycles. This is not something portable across processors, but, ever since the original Pentium, all IA32 processors have used the exact same instruction sequence to read the processor tick count register. Wrapped as C code, the sequence is:
union tick_union { unsigned long long val64; struct { unsigned int low32; unsigned int hi32; } lowhi; }; unsigned long long ReadTick(void) { register union tick_union result; register unsigned int msr = 0; __asm __volatile( ".byte 0xf; .byte 0x31 # Read the cycle counter\n" " movl %%edx, %0 # high order bits\n" " movl %%eax, %1 # low order bits\n" : "=g" (result.lowhi.hi32), "=g" (result.lowhi.low32) : "g" (msr) : "eax", "edx" ); return(result.val64); }
Note that your readings will vary somewhat due to the fact that it takes instructions to read the tick counter, so pipeline schedule and cache/memory delays can cause repeatability to be no better than about +/- 15 clock cycles. Of course, things like OS interference can change the tick count by much more.
This project is due by 11:59PM, Sunday, March 2, 2003. Submission details will be discussed in class. You will submit the following: