Sample EE699 Project Paper

This paper, prepared by Hank Dietz, is an example of how you might create your project write-up for EE699.

Introduction: `bzip2`

Despite the fact that it is always getting cheaper to archive a byte of data, compression is important because good compression makes storage devices appear larger without any additional hardware cost. Additionally, if compression and decompression are fast enough, the reduction in the volume of data that must be read from or written to slow disk drives actually can yield speedup. The problem is simply that good compression tends to be computationally expensive... and some of the best techniques are subject to a variety of patent restrictions.

One of the best freely-available compression programs, at least in terms of the degree of compression achieved, is bzip2. The full portable C source code is available at http://sources.redhat.com/bzip2/. Unfortunately, despite a highly-tuned implementation and a very efficient algorithm, bzip2 spends nearly 100% of its time computing. Worse still, it is significantly slower than tools like gzip (which compresses only slightly less well). For example, compressing the source TAR file of bzip2 using an Athlon MP 1800+ Linux system yielded the following:

File                  Bytes    Real    User
----                  -----    ----    ----
bzip2-1.0.2.tar     1771520
bzip2-1.0.2.tar.gz   665198  0.246s  0.220s
bzip2-1.0.2.tar.bz2  624837  3.119s  3.040s

Although bzip2 does achieve about 6% better compression, it does so at the expense of nearly 14X more computation! For this reason, I decided to try to speedup bzip2.

What's Slow?

In order to determine which parts of the bzip2 code were responsible for the poor performance, I began by changing the Makefile to use -pg (and not omit-frame-pointer), so that I could use gprof to analyze the performance in more detail. Using the bzip2 source TAR file as test input, the following routines were responsible for most of the compute time:

 %   cumulative   self              self     total           
time   seconds   seconds    calls   s/call   s/call  name    
41.25      1.25     1.25        1     1.25     1.70  fallbackSort
14.85      1.70     0.45  2156817     0.00     0.00  mainGtU
12.21      2.07     0.37   225210     0.00     0.00  fallbackQSort3
11.22      2.41     0.34        2     0.17     0.49  mainSort
 7.92      2.65     0.24        2     0.12     0.12  generateMTFValues
 4.95      2.80     0.15    40386     0.00     0.00  mainQSort3

The good news is that fallbackSort() takes a large fraction of the time, so speeding it up should yield good speedup for the program as a whole. Even better, most of the time in that routine is spent in a single inner loop. The bad news is that the code already was heavily optimized:

/*-- find the next non-singleton bucket --*/
k = r + 1;
while (ISSET_BH(k) && UNALIGNED_BH(k)) k++;
if (ISSET_BH(k)) {
  while (WORD_BH(k) == 0xffffffff) k += 32;
  while (ISSET_BH(k)) k++;
}
l = k - 1;
if (l >= nblock) break;
while (!ISSET_BH(k) && UNALIGNED_BH(k)) k++;
if (!ISSET_BH(k)) {
  while (WORD_BH(k) == 0x00000000) k += 32;
  while (!ISSET_BH(k)) k++;
}
r = k - 1;

A careful reading of the code reveals that the code is essentially counting runs of 1s and 0s in a large table of bits. The code already took advantage of the fact that, once it has reached a word boundary in the bit table, it can compare 32 bits at a time. That optimization makes an order-of-magnitude difference in runtime... but it's already been done. The only other trick would be to replace the linear summing of run lengths with a log-time SWAR reduction. To do that, I rewrote the inner loop as:

k = r + 1;
{
  /* How many bits are 1 from here? */
  RUInt32 w = WORD_BH(k);
  w >>= (k & 31);
  k += trailing1s(w);
  if ((k & 31) == 0) {
    while ((w = WORD_BH(k)) == 0xffffffff) k += 32;
      k += trailing1s(w);
  }
}
l = k - 1;
if (l >= nblock) break;
{
  /* How many bits are 0 from here? */
  RUInt32 w = ~WORD_BH(k);
  w >>= (k & 31);
  k += trailing1s(w);
  if ((k & 31) == 0) {
    while ((w = WORD_BH(k)) == 0) k += 32;
    k += trailing1s(~w);
  }
}
r = k - 1;

Although places like Hacker's Delight suggest reasonably fast ways to implement trailing1s(), I wanted a fast, branchless, implementation. After a few attempts, I created:

inline static UInt32
trailing1s(RUInt32 x)
{
  RUInt32 t, c = 0;
  t = ((x & 0xffff) - 0xffff); t = (((Int32) t) >> 31);
  t = ((~t) & 16); c += t; x >>= t;
  t = ((x & 0xff) - 0xff); t = (((Int32) t) >> 31);
  t = ((~t) & 8); c += t; x >>= t;
  t = ((x & 0xf) - 0xf); t = (((Int32) t) >> 31);
  t = ((~t) & 4); c += t; x >>= t;
  t = ((x & 0x3) - 0x3); t = (((Int32) t) >> 31);
  t = ((~t) & 2); c += t; x >>= t;
  return(c + (x & 1));
}

That change gave about a 10% speedup on that loop... less than I had hoped for, but still a noticable improvement.

To improve performance further, I tried several other changes:

Insertion of (more) register declarations, where appropriate
Minor loop rewrites throughout the code:
- Use of memset() for clearing arrays
  (tried and removed; it slowed-down the code!)
- Prefix addition using a register across iterations
- Reversed a few loops to walk in memory order

These additional changes had virtually no effect on the time for my test case. However, for very large compression problems, fallbackSort() wasn't as dominant a component of the execution time. These additional changes made the performance improvement more stable over different compression problems.

It is significant that the Makefile for bzip2 automatically tests correctness of the compressor using a short validation suite. Thus, I did not have to worry about introducing errors in my optimizations; such errors were immediately flagged as part of the make process.

Results

For the test case, the final breakdown measured by gprof was:

 %   cumulative   self              self     total           
time   seconds   seconds    calls   s/call   s/call  name    
40.14      1.18     1.18        1     1.18     1.65  fallbackSort
13.95      1.59     0.41   224388     0.00     0.00  fallbackQSort3
13.27      1.98     0.39  2156817     0.00     0.00  mainGtU
11.90      2.33     0.35        2     0.17     0.47  mainSort
 7.48      2.55     0.22        2     0.11     0.11  generateMTFValues
 5.44      2.71     0.16    40386     0.00     0.00  mainQSort3
 2.38      2.78     0.07        2     0.04     0.04  sendMTFValues
 2.04      2.84     0.06   342979     0.00     0.00  fallbackSimpleSort
 1.36      2.88     0.04   111832     0.00     0.00  mainSimpleSort
 1.36      2.92     0.04      357     0.00     0.00  copy_input_until_stop

However, this breakdown required the use of -pg and relatively "wimpy" optimization flags for the C compile. Using a more abitious set of compiler optimizations (including enabling optimization for the athlon architecture), gcc produced the following results:

bzip2-1.0.2.tar.bz2  624837  3.119s  3.040s  original
bzip2-1.0.2.tar.bz2  624837  2.930s  2.890s  optimized

That is just over 1.05X, or about 5% speedup.

As a second test case, I used a huge email archive. Execution time went from 91.95s to 87.80s, for speedup of about 5%.

In conclusion, there is now very little unoptimized code in the slow portions of the program. None of that code can benefit from exploiting compile-time constants (i.e., partial evaluation would not help). There just isn't much that can be done except to change the algorithms and data structures. In particular, fallbackSort() uses a very clever sorting technique, but there may be faster ways to accomplish the same sort.

Hank Dietz
hankd@engr.uky.edu

Sample EE699 Project Paper

Introduction: bzip2

What's Slow?

Results

Introduction: `bzip2`