This paper, prepared by Hank Dietz, is an example of how you might create your project write-up for EE699.
Despite the fact that it is always getting cheaper to archive a byte of data, compression is important because good compression makes storage devices appear larger without any additional hardware cost. Additionally, if compression and decompression are fast enough, the reduction in the volume of data that must be read from or written to slow disk drives actually can yield speedup. The problem is simply that good compression tends to be computationally expensive... and some of the best techniques are subject to a variety of patent restrictions.
One of the best freely-available compression programs, at least in terms of the degree of compression achieved, is bzip2. The full portable C source code is available at http://sources.redhat.com/bzip2/. Unfortunately, despite a highly-tuned implementation and a very efficient algorithm, bzip2 spends nearly 100% of its time computing. Worse still, it is significantly slower than tools like gzip (which compresses only slightly less well). For example, compressing the source TAR file of bzip2 using an Athlon MP 1800+ Linux system yielded the following:
File Bytes Real User ---- ----- ---- ---- bzip2-1.0.2.tar 1771520 bzip2-1.0.2.tar.gz 665198 0.246s 0.220s bzip2-1.0.2.tar.bz2 624837 3.119s 3.040s
Although bzip2 does achieve about 6% better compression, it does so at the expense of nearly 14X more computation! For this reason, I decided to try to speedup bzip2.
In order to determine which parts of the bzip2 code were responsible for the poor performance, I began by changing the Makefile to use -pg (and not omit-frame-pointer), so that I could use gprof to analyze the performance in more detail. Using the bzip2 source TAR file as test input, the following routines were responsible for most of the compute time:
% cumulative self self total time seconds seconds calls s/call s/call name 41.25 1.25 1.25 1 1.25 1.70 fallbackSort 14.85 1.70 0.45 2156817 0.00 0.00 mainGtU 12.21 2.07 0.37 225210 0.00 0.00 fallbackQSort3 11.22 2.41 0.34 2 0.17 0.49 mainSort 7.92 2.65 0.24 2 0.12 0.12 generateMTFValues 4.95 2.80 0.15 40386 0.00 0.00 mainQSort3
The good news is that fallbackSort() takes a large fraction of the time, so speeding it up should yield good speedup for the program as a whole. Even better, most of the time in that routine is spent in a single inner loop. The bad news is that the code already was heavily optimized:
/*-- find the next non-singleton bucket --*/ k = r + 1; while (ISSET_BH(k) && UNALIGNED_BH(k)) k++; if (ISSET_BH(k)) { while (WORD_BH(k) == 0xffffffff) k += 32; while (ISSET_BH(k)) k++; } l = k - 1; if (l >= nblock) break; while (!ISSET_BH(k) && UNALIGNED_BH(k)) k++; if (!ISSET_BH(k)) { while (WORD_BH(k) == 0x00000000) k += 32; while (!ISSET_BH(k)) k++; } r = k - 1;
A careful reading of the code reveals that the code is essentially counting runs of 1s and 0s in a large table of bits. The code already took advantage of the fact that, once it has reached a word boundary in the bit table, it can compare 32 bits at a time. That optimization makes an order-of-magnitude difference in runtime... but it's already been done. The only other trick would be to replace the linear summing of run lengths with a log-time SWAR reduction. To do that, I rewrote the inner loop as:
k = r + 1; { /* How many bits are 1 from here? */ RUInt32 w = WORD_BH(k); w >>= (k & 31); k += trailing1s(w); if ((k & 31) == 0) { while ((w = WORD_BH(k)) == 0xffffffff) k += 32; k += trailing1s(w); } } l = k - 1; if (l >= nblock) break; { /* How many bits are 0 from here? */ RUInt32 w = ~WORD_BH(k); w >>= (k & 31); k += trailing1s(w); if ((k & 31) == 0) { while ((w = WORD_BH(k)) == 0) k += 32; k += trailing1s(~w); } } r = k - 1;
Although places like Hacker's Delight suggest reasonably fast ways to implement trailing1s(), I wanted a fast, branchless, implementation. After a few attempts, I created:
inline static UInt32 trailing1s(RUInt32 x) { RUInt32 t, c = 0; t = ((x & 0xffff) - 0xffff); t = (((Int32) t) >> 31); t = ((~t) & 16); c += t; x >>= t; t = ((x & 0xff) - 0xff); t = (((Int32) t) >> 31); t = ((~t) & 8); c += t; x >>= t; t = ((x & 0xf) - 0xf); t = (((Int32) t) >> 31); t = ((~t) & 4); c += t; x >>= t; t = ((x & 0x3) - 0x3); t = (((Int32) t) >> 31); t = ((~t) & 2); c += t; x >>= t; return(c + (x & 1)); }
That change gave about a 10% speedup on that loop... less than I had hoped for, but still a noticable improvement.
To improve performance further, I tried several other changes:
These additional changes had virtually no effect on the time for my test case. However, for very large compression problems, fallbackSort() wasn't as dominant a component of the execution time. These additional changes made the performance improvement more stable over different compression problems.
It is significant that the Makefile for bzip2 automatically tests correctness of the compressor using a short validation suite. Thus, I did not have to worry about introducing errors in my optimizations; such errors were immediately flagged as part of the make process.
For the test case, the final breakdown measured by gprof was:
% cumulative self self total time seconds seconds calls s/call s/call name 40.14 1.18 1.18 1 1.18 1.65 fallbackSort 13.95 1.59 0.41 224388 0.00 0.00 fallbackQSort3 13.27 1.98 0.39 2156817 0.00 0.00 mainGtU 11.90 2.33 0.35 2 0.17 0.47 mainSort 7.48 2.55 0.22 2 0.11 0.11 generateMTFValues 5.44 2.71 0.16 40386 0.00 0.00 mainQSort3 2.38 2.78 0.07 2 0.04 0.04 sendMTFValues 2.04 2.84 0.06 342979 0.00 0.00 fallbackSimpleSort 1.36 2.88 0.04 111832 0.00 0.00 mainSimpleSort 1.36 2.92 0.04 357 0.00 0.00 copy_input_until_stop
However, this breakdown required the use of -pg and relatively "wimpy" optimization flags for the C compile. Using a more abitious set of compiler optimizations (including enabling optimization for the athlon architecture), gcc produced the following results:
bzip2-1.0.2.tar.bz2 624837 3.119s 3.040s original bzip2-1.0.2.tar.bz2 624837 2.930s 2.890s optimized
That is just over 1.05X, or about 5% speedup.
As a second test case, I used a huge email archive. Execution time went from 91.95s to 87.80s, for speedup of about 5%.
In conclusion, there is now very little unoptimized code in the slow portions of the program. None of that code can benefit from exploiting compile-time constants (i.e., partial evaluation would not help). There just isn't much that can be done except to change the algorithms and data structures. In particular, fallbackSort() uses a very clever sorting technique, but there may be faster ways to accomplish the same sort.
Hank Dietz hankd@engr.uky.edu