References: Performance & Supercomputers

The material on performance and supercomputers is now merged -- after all, supercomputing is all about performance. The new slides are here.

From performance analysis, the key ideas are really the tabular breakdown of expected instruction execution counts, CPIs, and clock period; the concepts of real, user, and system time; the different types of benchmarks; and Amdahl's law. Here are a few other interesting things about performance:

The most prominent benchmark is HPL (High Performance Linpack), which solves systems of linear equations mostly by doing lots of matrix multiplies. This is a particularly "supercomputer friendly" benchmark because performance of communication between PEs becomes less important as the problem is scaled-up, and the benchmark allows scaling the problem as big as can fit in the machine rather than timing the same-size problem on all machines. The results are reported as FLOPS obtained in running the benchmark. The Top 500 supercomputers in the world by this metric is a list everyone watches closely... which has been good for UK in that machines operated by CCS historically placed well on it, which is true of fewer than ten US universities. That said, UK hasn't been on the list for a while; universities have essentially been priced out as the cost of top500 machines has been increasing.
It is worth noting that the machines on the Top 500 list have recently made a turn toward the huge -- with lots of machines now having more than a million cores! Although many-core chips (mostly GPUs) are now common in these machines, we're still on a plateau where scaling up is largely a matter of money rather than new technology, and it seems there's always more money for matters of national pride (i.e., countries fighting for positions at the top of the list). Additionally, when the price/performance improvements due to new technologies slow, budget for big machines in general seems to go up to compensate. Machine cost is not listed, but has definitely gone up sharply over the last decade. You should notice that the US has a lot of machines on the list, and right now we have the fastest... but we often don't.
A nice reference for standard benchmarks is SPEC, the Standard Performance Evaluation Corporation. The text has always been fond of SPEC, but it's good to understand that there are many benchmark suites out there, and how much they really matter to you depends on how much your application(s) look like them. For example, the HPL benchmark is very intensely using double-precision floating-point multiply-add, but doesn't even count integer operations. Many applications are dominated by the performance of integer, or even character, processing.
Here's another interesting tidbit: the US government has a variety of metrics that they use for determining if a computer can be exported to a particular country. For example, President Obama set the export limit as 3.0 TFLOPS on March 16, 2012. It's getting hard to control spread of computing technology given the dominance of cluster supercomputers built using largely commodity parts. In August 2022, the Biden administration imposed new restrictions on the sale of various high-end processors and GPUs to China and Russia, and those restrictions were tightened in 2023 and 2024. The Trump administration started imposing additional restrictions in March 2025, but at this writing it isn't clear what the final policy will be. A document describing how to measure performance for export control is A PRACTITIONER'S GUIDE TO ADJUSTED PEAK PERFORMANCE from the U.S. Department of Commerce Bureau of Industry and Security. The CAPITALIZATION of that title is theirs... I guess this is important enough to shout about? ;-)

Of course, supercomputing might be fundamentally about larger-scale use of parallel processing, but parallel processing is really the key to high performance in any modern computer system. The amount of parallel processing used inside a typical cell phone exceeds that used in most supercomputers less than three decades ago. Looking at supercomputers gives a glimpse of what's coming to more mundane systems sooner than you'd expect.

The textbook places emphasis on shared memory multiprocessors (SMP stuff) and cache coherence issues. We covered a bit of that here, and coherence basics in memory systems, but it's a small piece of the whole pie because these systems really don't scale very large. That said, AMD is working toward 256 cores. As of Spring 2025, they are shipping 192 core chips.

More generally, you should be aware of SIMD (including GPUs and the not-so-scalable SWAR/vector models) and MIMD, and also the terms Cluster, Farm, Warehouse Scale Computer, Grid, and Cloud. In the discussion of interconnection networks, you should be aware of Latency, Bandwidth, and Bisection Bandwidth, as well as some understanding of network topologies including Direct Connections (the book calls these "fully connected"), Toroidal Hyper-Meshes (e.g., Rings, Hypercubes), Trees, Fat Trees, and Flat Neighborhood Networks (FNNs), Hubs, Switches, and Routers. The concept of quantum computing as a form of parallel processing without using parallel hardware was also very briefly introduced.

You will find a lot of information about high-end parallel processing at aggregate.org. Professor Dietz and the University of Kentucky have long been leaders in this field, so Dietz has writen quite a few documents that explain all aspects of this technology. One good, but very old, overview is the Linux Documentation Project's Parallel Processing HOWTO; a particularly good overview of network topologies appears in this paper describing FNNs.

A quick summary of what things look like in Fall 2025:

As of the November 2025 Top500 list, the AMD-based El Capitan is still in the top position. Basically, Intel and NVIDIA are quickly becoming second choices to AMD. Why? One reason is that Intel has moved to a mix of fast and slow cores, while AMD's Epyc and Threadripper processors have all fast cores that make load balancing much easier. As for NVIDIA, well, they've been unfriendly to the supercomputing market in a variety of ways that began with their driver EULA disallowing "cheap" cards from being used in clusters and recently made the decision to stop supporting 64-bit floats (which are the norm for supercomputing and are well supported by AMD GPUs).
There is now a strongly visible segmentation of the high-performance computing market along the lines of supercomputing vs. AI training. This fragmentation didn't really happen for other applications, and is becoming increasingly obvious since 2024. AI training is far less demanding in terms of interprocessor communication and float accuracy. NVIDIA is actually pushing "4-bit floating point" arithmetic! The dominance of non-portable libraries written in CUDA has also kept NVIDIA in the leading position for AI, although libraries are now getting ported...
Nearly all desktop/laptop processors are pipelined, superscalar, SWAR, implementations with 2-64 cores on each processor chip. Intel's Xeon Phi processors were the first with up to about 60 cores per chip and 512-bit SWAR, but they were discontinued a while back. However, as Intel has had a bit of a struggle with their fabs and the Trump administration has not been consistently helping the industry. AMD seems to be doing well and leading the high end market with 192-core processors. Of course, ARM continues to be very popular because it is sold as IP that you can use to build your own custom chips (as Apple now does). However, for 32-bit cores, ARM is starting to lose out to RISC-V, which is vaguely comparable but has free IP.
Nearly all supercomputers are clusters (although many are now pre-packaged enough to be marketed as massively parallel computers) and, since Fall 2017, virtually all 500 of the Top500 supercomputers are Linux systems using PC cores. The catch is that the huge scales make denser packaging more appropriate, and AMD has been leading the way to huge core counts per processor by packaging multiple dies together on a single substrate. In truth, connecting multiple dies together seems to be the way to keep Moore's law going for a while longer. The network hardware for huge scale systems is also getting exotic, and fiber is becoming more common as physical machine size grows communication delays.
The slow transition to integrating GPUs on the processor chip continues, and GPUs are everywhere. The integrated GPUs are also becoming more capable for non-graphical computation.
Another developing trend is the use of multi-chip modules, stacking, and other technologies that allow circuit density to keep increasing even if Moore's law predictions are not being met. There's some particularly cool stacking of chips for AMD 3D V-Cache and Hybrid Bond 3D. There's also Cerebras making 1.2 Trillion transistor wafer-scale chips! These technologies are also why IEEE/ACM SC has been dominated by high-performance cooling system technologies since 2022. A weird related trend has been the idea of using Crypto mining as a "shock absorber" to obtain some profit while burning power plant output that was not needed at that moment, because many power generation methods don't throttle easily.
Clouds are a very popular way to handle applications that need lots of memory/storage, but not so much processing resource; there is a particularly strong push for software as a service with cloud subscriptions rather than software purchases. I would say that most classical "mainframe" computing applications are now outsourced to run on clouds, which does have some scary implications in terms of lots of unrelated things sharing a single failure point.
IoT (Internet of Things), the idea that everything should be connected, continues to develop, with various societal issues ranging from simple privacy and ownership rights issues to potentially life-threatening things like "car hacking."
Quantum computing has become a very intense research focus, but it still isn't clear it will ever be practical. In the past few years, there's been a lot of negative reaction to overblown claims made earlier. However, even if it has a low probability of success, the potential payoff is huge, so it is worth seriously investigating. I have argued that quantum-inspired computing technologies have already been a big win. In any case, SC23 saw a variety of new qubit technologies being developed -- quite different from the formerly leading qubit technologies like IBM's superconducting transmons. SC24 had a bit more mainstream quantum computing emphasis, but it is still a tiny part of SC overall. The quantum presence at SC25 was strong, but there seems to be increasing disagreement on how quantum systems should be built and used.

One last note: Tesla's Full Self Driving Chip is a great example of supercomputing moving into mass-market devices

Computer Organization and Design.