References: Performance & Supercomputers

This semester the material on performance and supercomputers is merged -- after all, supercomputing is all about performance. The new slides are here.

From performance analysis, the key ideas are really the tabular breakdown of expected instruction execution counts, CPIs, and clock period; the concepts of real, user, and system time; the different types of benchmarks; and Amdahl's law. Here are a few other interesting things about performance:

The most prominent benchmark is HPL (High Performance Linpack), which solves systems of linear equations mostly by doing lots of matrix multiplies. This is a particularly "supercomputer friendly" benchmark because performance of communication between PEs becomes less important as the problem is scaled-up, and the benchmark allows scaling the problem as big as can fit in the machine rather than timing the same-size problem on all machines. The results are reported as FLOPS obtained in running the benchmark. The Top 500 supercomputers in the world by this metric is a list everyone watches closely... which has been good for UK in that machines operated by CCS historically placed well on it, which is true of fewer than ten US universities. That said, UK is not on the most recent list.
It is worth noting that the machines on the Top 500 list have recently made a turn toward the huge -- with lots of machines now having more than a million cores! Although many-core chips (mostly GPUs) are now common in these machines, we're still on a plateau where scaling up is largely a matter of money rather than new technology, and it seems there's always more money for matters of national pride (i.e., countries fighting for positions at the top of the list). Additionally, when the price/performance improvements due to new technologies slow, budget for big machines in general seems to go up to compensate. Machine cost is not listed, but has definitely gone up sharply over the last decade. You should notice that the US has a lot of machines on the list, but right now we don't have the fastest... and we often don't.
A nice reference for standard benchmarks is SPEC, the Standard Performance Evaluation Corporation. The text has always been fond of SPEC, but it's good to understand that there are many benchmark suites out there, and how much they really matter to you depends on how much your application(s) look like them. For example, the HPL benchmark is very intensely using double-precision floating-point multiply-add, but doesn't even count integer operations. Many applications are dominated by the performance of integer, or even character, processing.
Here's another interesting tidbit: the US government has a variety of metrics that they use for determining if a computer can be exported to a particular country. For example, President Obama set the export limit as 3.0 TFLOPS on March 16, 2012. It's getting hard to control spread of computing technology given the dominance of cluster supercomputers built using largely commodity parts. In August 2022, the Biden administration imposed new restrictions on the sale of various high-end processors and GPUs to China and Russia, and it seems that will be followed by export restrictions on the technologies and equipment used to make high-end chips. A document describing how to measure performance for export control is A PRACTITIONER'S GUIDE TO ADJUSTED PEAK PERFORMANCE from the U.S. Department of Commerce Bureau of Industry and Security. The CAPITALIZATION of that title is theirs... I guess this is important enough to shout about? ;-)

Of course, supercomputing might be fundamentally about larger-scale use of parallel processing, but parallel processing is really the key to high performance in any modern computer system. The amount of parallel processing used inside a typical cell phone exceeds that used in most supercomputers less than three decades ago. Looking at supercomputers gives a glimpse of what's coming to more mundane systems sooner than you'd expect.

The textbook places emphasis on shared memory multiprocessors (SMP stuff) and cache coherence issues. We covered a bit of that here, and coherence basics in memory systems, but it's a small piece of the whole pie because these systems really don't scale very large. That said, AMD is pushing 256 cores.

More generally, you should be aware of SIMD (including GPUs and the not-so-scalable SWAR/vector models) and MIMD, and also the terms Cluster, Farm, Warehouse Scale Computer, Grid, and Cloud. In the discussion of interconnection networks, you should be aware of Latency, Bandwidth, and Bisection Bandwidth, as well as some understanding of network topologies including Direct Connections (the book calls these "fully connected"), Toroidal Hyper-Meshes (e.g., Rings, Hypercubes), Trees, Fat Trees, and Flat Neighborhood Networks (FNNs), Hubs, Switches, and Routers. The concept of quantum computing as a form of parallel processing without using parallel hardware was also very briefly introduced.

You will find a lot of information about high-end parallel processing at aggregate.org. Professor Dietz and the University of Kentucky have long been leaders in this field, so Dietz has writen quite a few documents that explain all aspects of this technology. One good, but very old, overview is the Linux Documentation Project's Parallel Processing HOWTO; a particularly good overview of network topologies appears in this paper describing FNNs.

A quick summary of what things look like in Spring 2024:

The photo at the top of this page is now slightly out-of-date. As of the November 2023 Top500 list, Frontier is still the fastest supercomputer and the only one passing the 1 EFLOPS mark, but the exact core count went down slightly and the performance went up slightly to 1.19 EFLOPS.
Nearly all desktop/laptop processors are pipelined, superscalar, SWAR, implementations with 2-64 cores on each processor chip. Intel's Xeon Phi processors were the first with up to about 60 cores per chip and 512-bit SWAR, but they were discontinued a while back. However, as Intel has had a bit of a struggle with their fabs, AMD is strongly back in the game and, with the possibility of 256 cores coming soon (96 shipping in Zen 4), is arguably leading over Intel. However, ARM continues to be very popular because it is sold as IP that you can use to build your own custom chips
Nearly all supercomputers are clusters (although many are now pre-packaged enough to be marketed as massively parallel computers) and, since Fall 2017, virtually all 500 of the Top500 supercomputers are Linux systems using PC processors; also note that a USA machine, Frontier, now tops the list and is the first over 1 Exaflop/s
GPUs are appearing everywhere (although the HW/SW technology for them is still evolving), and now commonly include very fast support for bfloat16 computations such as neural network AI models. NVidia GPUs have come to dominate the high-performance computing market, but AMD GPUs are serious contenders, with lower cost but limited by things like the fact that NVidia's cuda remains the dominant programming environment for general-purpose GPU computing. What's the fastest GPU? How about the GeForce RTX 3090Ti, which claims "78 RT-TFLOPs, 40 Shader-TFLOPs and 320 Tensor-TFLOPs" peak performance. Yes, that single card is claiming around 50X the performance of the 128-node KASY0 cluster from 2003!
The slow transition to integrating GPUs on the processor chip continues, as does the transition from IA32/AMD64 to ARM64. GPUs are in most Top500 machines; but ARM64 processors, which first surfaced on the Top500 list several years ago, are not common. Still, there are new players like Ampere making the 128-core Altra Max -- oh, and guess what processor is inside Apple's new chips: yup, ARMs. We're also starting to see RISC-V cores instead of ARMs in various systems, because you don't even have to pay to use that IP core...
Another developing trend is the use of multi-chip modules, stacking, and other technologies that allow circuit density to keep increasing even if Moore's law predictions are not being met. There's some particularly cool stacking of chips explained in the video about AMD 3D V-Cache and Hybrid Bond 3D. There's also Cerebras making 1.2 Trillion transistor wafer-scale chips! These technologies are also why SC22 was dominated by high-performance cooling system technoloiges... for the first time, literally more plumbing was on display than computer circuitry, and that also was true at SC23.
Clouds are a very popular way to handle applications that need lots of memory/storage, but not so much processing resource; there is a particularly strong push for software as a service with cloud subscriptions rather than software purchases
IoT (Internet of Things), the idea that everything should be connected, continues to develop, with various societal issues ranging from simple privacy and ownership rights issues to potentially life-threatening things like "car hacking"
Quantum computing has become a very intense research focus, but it still isn't clear it will ever be practical. In the past few years, there's been a lot of negative reaction to overblown claims made earlier. However, even if it has a low probability of success, the potential payoff is huge, so it is worth seriously investigating. I have argued that quantum-inspired computing technologies have already been a big win. In any case, SC23 saw a variety of new qubit technologies being developed -- quite different from the formerly leading qubit technologies like IBM's superconducting transmons.

One last note: Tesla's Full Self Driving Chip is a great example of supercomputing moving into mass-market devices

Computer Organization and Design.