As if you didn't know, this document is always under construction....
Don't see your question above? You can send email to
hankd@engr.uky.edu.
Maybe I'll even answer. ;-)
We pronounce NAK like "N-Ack" -- just like the "Negative Acknowledgement" signal.
NAK stands for NVIDIA Athlon cluster in Kentucky.
The name NAK was selected in part to reflect our recognition of the fact that people will not want to seriously acknowledge or duplicate the design of this machine because it is so cheap and built using such old and "wimpy" hosts. However, NAK also pairs with the name of our next planned machine, ACK. ACK also will demonstrate the same nothing but power and ground concept for which we built NAK, but ACK will use a node design with which we expect the mainstream supercomputing community will be much more comfortable.
Although NAK is the latest machine in the sequence of "performance engineered network" machines that began with KLAT2 (Kentucky Linux Athlon Testbed 2) and KASY0 (Kentucky ASYmmetric Zero), NAK isn't about testing a new network design method. In fact, the network is an FNN very much like KLAT2 had. Unlike KASY0, NAK even uses a very conventional physical layout. The innovation in NAK concerns the concept of using GPUs as the primary computation engines, running applications on the GPUs such that most of the time the hosts are providing nothing but power and ground. We'll sometimes call this idea NoBuPAG (pronounced "No-Boo-Pag"), although we're not in love with that abbreviation.
NAK is a GPU cluster -- arguably the first true GPU cluster. Use of the GPUs isn't as "optional acceleration" for some tasks with the CPUs doing most of the work; the intent is that programs run almost entirely within the GPUs. GPU "threads" or "virtual processors" are the processing elements of NAK, not host processor cores. The nodes of NAK are not seen as processors, but as "containers" for the shared-memory parallel systems that are the GPUs.
While it is true that recycling node hardware from KASY0 made NAK amazingly cheap and very "green," the use of Athlon XP hosts, and a simple PCI interface to each GPU, and as much memory in the GPUs as in the hosts, underscores the emphasis on computations truly residing within the GPUs.
When we built KLAT2, we made a lot of references to Klaatu, the fellow from outer space in the classic 1951 science fiction movie The Day The Earth Stood Still. Expect to see a lot of old movie references for NAK (and ACK) too, but to the 1927 film Metropolis.
Why refer to Metropolis? Society in Metropolis is divided into two classes: the planners/managers and the workers. The whole point of the movie is that the managers must give the workers full consideration, rather than blindly commanding them to serve their desires. We like the analogy of the GPUs as the workers and the CPUs as the managers -- it's about maximizing benefit by focusing on the GPUs.
Coincidentally, there is a newly restored version of Metropolis being released this year... which also fits with the new life for KASY0's nodes.
NAK is green in several ways:
We don't know yet. It certainly will not exceed 9 single-precision TFLOPS. Here's the data we have thus far:
Number of Bodies | Total GFLOPS |
---|---|
512 | 1,830 |
1,024 | 3,766 |
2,048 | 3,885 |
4,096 | 3,925 |
8,192 | 3,940 |
16,384 | 3,946 |
32,768 | 3,949 |
65,536 | 3,950 |
131,072 | 3,951 |
262,144 | 3,952 |
Good question. Everybody agrees that a GFLOPS (pronounced "Gig-Ah-Flops") is a billion (1,000,000,000) floating-point operations per second. It's less clear why we don't write it as "BFlOpS," but hey, it's not my fault. The "S" is capitalized for seconds; plural of FLOP is done with lowercase "s" as FLOPs. However one writes it, there are two major ambiguities about the definition of GFLOPS:
Obviously, any operation on a floating-point value, right? Well, yes and no. Computing the sin(x) is really done by computing a series expansion, so does sin(x) count as one FLOP or as however many addition/subtraction/multiplication/division operations are in the series expansion used? The opposite effect occurs with things like the "Cray FLOP" measure: an absolute value operation on an old Cray was implemented by a three-instruction sequence, so it was counted as 3 FLOPs; however, everybody knows all you have to do is zero the sign bit, which takes only a single (integer) bitwise-AND instruction -- no FLOPs at all. How you count can make a huge difference. If your code only does addition and multiplication, there is general agreement on how you count those... but even a subtract causes ambiguity about one subtraction versus an addition of the negative.
The Top500 list essentially defines FLOPs by a formula that counts additions and multiplications assuming a particular algorithm, but in the past it allowed other algorithms (see Jack Dongarra's January 18, 2001 report and note the NEC and Cray machines using Strassen's Algorithm) but overly-generously counted the FLOPs as though Gaussian Elimination were used. For our Linpack (tuned HPL) performance numbers, we use the standard Gaussian Elimination algorithm and will quote the FLOPs counted by the Top500's formula.
Obviously, accurate enough. ;-) Unfortunately, it is very difficult to determine how much accuracy remains after a non-trivial computation is performed using a specific precision, yet precision (number of bits used to store a value) is all that one can directly control. An excellent overview is given in What Every Computer Scientist Should Know About Floating-Point Arithmetic; it isn't exactly light reading, but at least it's lighter than the IEEE 754/854 standards. The standards provide for different bases (e.g., 2, 10, 16), rounding modes, predictive infinities, NaN (Not-a-Number), denormalized arithmetic, etc. The result is that fully compliant implementations of floating point can have a very wide range of accuracy... and there also are many "slightly" non-compliant versions that omit some of the more complex features (which have very little impact, if any, on accuracy). Grossly inferior accuracy, such as the old Crays yielded, is essentially gone from modern machines except for explicit options to perform low-precision versions of inverse, square root, or inverse square root.
Although the Top500 list has a history of accepting whatever floating point was native to the machine (again, see Jack Dongarra's January 18, 2001 report), the latest Top500 FAQ includes an attempt to specify "64 bit floating point arithmetic" -- but, as discussed above, that isn't a well-defined thing. Another interesting point is that 64-bit isn't always 64-bit: because PCs have complicated operations (e.g., sin(x)) implemented as single instructions, x87 floating point registers actually hold 80-bit results that get converted to 64-bits (or even 32-bits) when stored into memory. Thus, PCs actually get significantly higher accuracy than machines with true 64-bit floating-point... sometimes, even 32-bit stored results are more accurate than 64-bit values computed on other machines! The analysis is even more complex in that PC processors using 3DNow!, SSE, and SSE2 -- and GPUs -- do not promise use of an 80-bit internal representation.
In summary, few real-world phenomena can be directly measured to accuracies much greater than the 24-bit mantissa of a typical 32-bit IEEE floating-point number. (Ever see a 24-bit linear Analog-to-Digital converter?) Thus, single precision (roughly 32-bit) values are useful for most carefully-coded floating-point algorithms. For example, the Computational Fluid Dynamics (CFD) code that got us a Gordon Bell award in 2000 works beautifully with 32-bit 3DNow! arithmetic. Double precision allows one to be slightly sloppier about accuracy analysis and also provides a significantly wider dynamic range (more exponent bits). Half precision 16-bit values are now commonly used in DSP applications. To us, all these are valid "FLOP precisions" -- but you should specify which you're counting, and we do.
Oh yes... you also need to specify what program code you're running because some codes do lots of useful stuff that isn't FLOPs, but the above discussion is already rather long.... ;-)
If you know us, you know we have a history of setting new price/performance records in high-performance computing. NAK is very good in this respect, but the reuse of 7-year-old hardware makes it very difficult to account for cost. It could be argued that the bulk of NAK's components have a negative cost because one would have had to pay to get the parts hauled away. At the University of Kentucky, that's called the "Waste Recharge" cost, and is budgeted as 0.99% of the purchase price -- about $195 for the stuff reused in NAK.
An itemized cost of NAK will be posted here soon, but the new stuff was well under $6,000, including many spare parts (arguably more important for a machine with many 7-year-old components). KLAT2 was around $650/GFLOPS. KASY0 was around $84/GFLOPS. Assuming junk parts are free, NAK's price/performance may be as good as $0.65/GFLOPS. If one does the wildly unrealistic exercise of counting old parts at original cost, it's about $2.83/GFLOPS. For now, let's just say that it is easily the cheapest 64-node GPU system ever built and probably around $1/GFLOPS....
Would $1/GFLOPS be a world record? It is difficult to say. In recent years, price/performance records have become muddied by a variety of people claiming performance -- usually on GPUs -- that is significantly higher than the theoretical peak of the hardware they used. (I believe the impolite term for this is lying, but one should never assume malice where incompetence would suffice.) The records are further muddied by Ambric and ATI claiming $0.42 and $0.13 per GFLOPS, respectively, but without including host system cost... while requiring a relatively expensive host that, unlike our hosts, isn't old junk.
We will not play any of those games. We'll publish real numbers as soon as we have them. For now, all we are claiming is that the price/performance is far superior to that of a typical conventional cluster supercomputer, even one augmented by GPUs.
Don't know yet.
Although we have occassionally used such applications as burn-in tests, we have more important things ready to run on NAK.
Each node runs Linux. It's very straightforward except in that the nodes are diskless and the network requires the FNN drivers.
NAK's configuration is:
It is appropriate to note that NAK also has an additional "head" that looks a lot like another node except it has 3GB of RAM and a hard disk drive. That head is used to isolate NAK's internal network and also runs Perceus to provide the diskless nodes of NAK with their images. The head is connected to the nodes via a 12th switch that is connected to 8 of the 11 switches inside NAK's FNN, thus giving a conventional tree connection across the boot Ethernet NICs for the purpose of distributing the images.
Using ATI 4650 AGP, instead of NVIDIA 9500 PCI, GPUs to upgrade KASY0 nodes should have given us a nearly 3X faster cluster at a slightly lower cost and power consumption. We originally had planned on doing that or a half-and-half cluster running OpenCL across both ATI and NVIDIA GPUs.
Unfortunately, our testing revealed that AMD/ATI software -- including the IA32 driver that is officially supposed to work with these cards -- currently uses SSE3 instructions without checking the CPUID to see if they're supported (violating AMD's own coding guidelines). The result is that both OpenCL and CAL programs die on illegal instructions. Oddly, OpenGL code works, but performance was half that of the NVIDIA cards under OpenGL!
In fairness, the AMD/ATI OpenCL SDK documentation does state that SSE3 is required for OpenCL to run on the host. However, it is not explicit about SSE3 being needed when targeting the GPU and that note isn't even on the support click path to the broken driver. There is an environment option discussed on the developer forum which is supposed to make the code run on any IA32 processor... perhaps it disables the SSE3 instructions, but it's actually the SSE2 code and CLFlush extension that cause the problems. In fact, I don't understand why AMD, who didn't support SSE2 in any of their 32-bit processors (it was added to AMD64), is distributing a 32-bit SDK and driver requiring SSE2, CLFlush, and SSE3. The vast majority of motherboards with an AGP slot cannot even take AMD processors that support these extensions.
Enough complaining about AMD/ATI's broken software support; let's fix it. Well, we can't. We tried writing a SIGILL trap handler to emulate the missing SSE3 instructions, but it turns out that some SSE2 instructions are silently mutated into non-equivalent MMX instructions on processors that do not support them. Patching the object code for the driver is very nasty yet technically feasible for us, but distributing the resulting code could have unpleasant legal implications. Through both the AMD/ATI developer forum and a higher-level personal contact, we have been unable to even get permission for us to fix this problem for them; indeed, nobody at AMD/ATI has yet acknowledged that they see this as a problem.
In summary, I am personally deeply disappointed with AMD/ATI's handling of this issue. I believe that all IA32 support code should be capable of running with any IA32 processor, and the use of CPUID testing to select codings for alternative instruction set extensions will be even more important in the AMD64/Intel64 versions (thanks to a multitude of extension sets and some divergence between AMD and Intel). Ideally, I can see the CPUID testing happening at install time, rather than at runtime, but AMD's own guidelines say it should be done and I heartily agree.
It should be said, however, that we don't hold a grudge even though we've been burnt on ATI software several times since we began a two-year development effort using CTM. Our current plans are that ACK will be an ATI GPU cluster. Let's hope that we don't find still more problems with AMD/ATI that will divert us from that plan....
There are really two new things about NAK:
There may be additions to that list as we complete our software environment....
Well, we had them. Actually, they work shockingly well in hosting GPUs. Simple runs of NVIDIA's CUDA benchmarks reveal that the GPU performance suffers less than 2% comparing a beefy 2.6GHz i7 host with PCI-E to our 2GHz Athlon XP host with PCI! Wow! Let me say that again: there is no significant benefit to a fast 64-bit host and PCI-E!
How can that be true? Gamers will tell you that there is a huge performance difference between cards and host configurations... but interactive games tend to have a lot of host-GPU interactions resulting in only a tiny fraction of peak performance for either being delivered. In contrast, our goal is getting as close as possible to peak GPU performance. That implies the computation isn't leaving the GPU for much nor very often. Host speed and GPU bus interface only determine the speed of those infrequent interactions. Like we said, it is really all about nothing but power and ground being demanded from the host most of the time.
Of course, running across a cluster, many applications will depend on performance of the network. The network design for NAK attempts to match the PCI bus performance, and the Athlon XP with 512MB of memory has plenty of speed and buffer space to perform network operations efficiently. In summary, across a cluster we may find that the network performance limits the GPUs -- as is the case for many codes on most well-designed clusters. However, a faster host CPU and bus interface wouldn't change that situation significantly and cost of a network that would profit significantly from a faster host would be many times the total cost of our system.
One last note about the Athlon XP processors: they are not really slower than modern cores. Head-to-head comparisons of a single thread running on the Athlon XP vs. i7 are typically showing that the Athlon XP is more than 2X faster! Put another way, the quad-core i7 is about 3X faster than the Athlon XP, but it needs 8 threads to get that speed. There are a few applications where the i7 blows away the Athlon XP, but that seems to be due to memory accesses. For example, the "smart remove selection" GIMP plug-in runs an order of magnitude faster on the i7, but it does deliberately random access to pixels within an image -- all/most of an image fits in the i7's 8MB cache, while virtually every access is a miss for the Athlon XP's 256KB cache. Fortunately, we run a lot of code that is written to work within a smaller cache.
Of course they should. ;-) However, 512MB each is a reasonable size for system software and network buffers. Actually, we originally tested nodes with 512MB, but initially built them with 1GB in each. A month later, when we realized more 512MB sticks would be useful in building HAK, we brought the NAK nodes back down to 512MB each.
Strange as it sounds, NAK actually has an unusually large memory where it counts: inside the GPUs. Completely independent of the 1GB in the host, each GPU contains 1GB of memory. That is as big as most commodity GPU cards get. However, each GPU has a fairly modest number of processing elements (4 little 8-wide SIMD engines, or what NVIDIA calls 32 CUDA cores), so the size of memory per GPU processing element is actually significantly larger than on higher-end cards. This makes it easier for a computation to stay within a GPU longer.
Since our current research work does not require local disks, NAK nodes are completely diskless; there isn't even a floppy. Disk space is "borrowed" from one of several servers or from another cluster (also in our lab) that has a disk on each node. Surprisingly, NAK has no video displays on any of its nodes, so that functionality also is obtained via connections to other systems. All of this can be done via relatively "thick" connections behind our lab firewall.
It should be noted that NAK does have an administrative boot server (minimal "head node") on the left rack with a status display, keyboard, and mouse in the middle rack, but we do not rely on performance of that system for much. In fact, the boot server is essentially another NAK node that also happens to have local disks, etc. However, the boot server is connected in a funny place relative to the FNN: using just one NIC, it is two hops from any node's boot NIC.
KASY0's network was built using 100Mb/s Ethernet, and so we had plenty of that hardware. In a 4-NIC FNN configuration, the network is saturating the PCI bus; Gig-E also would be limited by the PCI bus, and a 64-port switch would be expensive with no significant advantage over the FNN. Yeah, it's unconventional, but we proved years ago that FNN topologies are superior, and they still are.
In fact, we've recently revived a little audio demo that plays multi-voice music by randomly assigning nodes to each note and toggling the PC speakers to make the sound. The music sounds very good on NAK. When HAK was first assembled, we used a conventional network design with wider switches... and the music demo didn't sound quite right, even using just 64 nodes connected via two 48-port switches with Gig-E (uplink) connections between the two switches. Basically, the problem is that OpenMPI barriers are used to sync the machines once every 100 notes, but on HAK the network structure introduced enough variability in when each node completed a barrier so as to make the attack on some notes audibly "blurry." Making the nodes sync every 16 notes reduces the variability enough for HAK to sound good too, but it is really quite impressive how much more consistent the barrier latency is using an FNN with 24-port switches than using a conventional network with just two wider switches connected by a fast uplink.
Elsewhere.
Our machine room has 210A of 120VAC and a 5-ton air conditioner. NAK needs less than half of that... significantly less than KASY0 needed.
Sort-of. We removed the side-vent fans (most would be blocked by the cases next to them anyway). In KASY0, we then stacked them and put them in the back of the case they came from as a redundant rear exhaust. Many fan deaths later, NAK has just a single rear-mounted fan per node and we have many spares. 7 years after removing them, we're still pondering what to do with the 264 wire thingies that were used to keep folks from sticking their fingers in the fans when they were side-mounted. ;-)
As of May 2010, thanks largely to the GPU temperature monitoring seen at the NAK Status page, we have discovered a little problem with these cases. We have regularly seen as much as 22 C temperature difference between nodes under identical load! Quite a lot of that variation is due to the GPUs themselves, but most of the coldest machines are ones with the open holes on the end of a rack shelf.... Unfortunately, the vent in the case front has high enough resistance to airflow that the holes in the side of the case actually are a major intake even when another case is pushed up against the side with the holes. KASY0 didn't have temperature monitoring support and KASY0's physical layout naturally reduced the impact of this misdirection of airflow, so we simply didn't know about this issue.
So, how do we fix this? Well, there isn't a problem with the holes where they are sucking cold air. The prime exception to that is above the display for NAK, because we didn't seal the back of that open shelf... the holes in the nodes above can suck air from the hot side of the cluster. So, we will probably seal that area and/or tape-over the holes on those cases. For the other hot nodes, we found that simply openning some of the disk bay covers significantly reduced temperature. However, we're going to try mounting a fan in the optional front fan mount of the hottest nodes. The GPUs are mounted low in the case, as the front air intake fan will be, so we expect that to create a better in-case airflow than if we open the relatively high disk bay covers.
I don't know and I don't care. You might as well ask "couldn't you cover course material faster if you didn't have any students?" A large part of why we build things is to give students the experience and understanding.
Incidentally, we didn't have any pizza this time. It was a variety of nicer goodies from a wholesale club and sandwiches from a sub place. For that matter, I had to keep reminding helpers to take some food. Clearly, they came to be involved in the build, not because we bribed them with food.
I assume you mean this photo, or perhaps you mean this one (which is the same but with a black background).
3 standard wire-frame racks. There are 23 nodes on each of the two end racks. Yes, we know you count 24 boxes on the left rack; that's because the one with the CDROM drive isn't a node, but the boot server and status monitor. The center rack holds only 18 nodes, but it also holds the network switches and a status display for the cluster. The switches are mounted vertically using standard rack-mount hardware tied to the lower rack edge and screwed-into a wooden mounting plate on the top shelf. Air flow is from front to back for all the nodes, so you're looking at the cold side.