References: Performance & Supercomputers
The material on performance and supercomputers is now
merged -- after all, supercomputing is all about performance.
The new slides are here.
From performance analysis, the key ideas are really the tabular
breakdown of expected instruction execution counts, CPIs, and
clock period; the concepts of real, user, and system time; the
different types of benchmarks; and Amdahl's law. Here are a few
other interesting things about performance:
-
The most prominent benchmark is HPL (High
Performance Linpack), which solves systems of linear
equations mostly by doing lots of matrix multiplies. This is a
particularly "supercomputer friendly" benchmark because
performance of communication between PEs becomes less important
as the problem is scaled-up, and the benchmark allows scaling
the problem as big as can fit in the machine rather than timing
the same-size problem on all machines. The results are reported
as FLOPS obtained in running the benchmark. The Top 500 supercomputers in the
world by this metric is a list everyone watches closely... which
has been good for UK in that machines operated by CCS historically placed well
on it, which is true of fewer than ten US universities. That
said, UK hasn't been on the list for a while; universities have
essentially been priced out as the cost of top500 machines has
been increasing.
-
It is worth noting that the machines on the Top 500 list have recently
made a turn toward the huge -- with lots of machines now
having more than a million cores! Although many-core
chips (mostly GPUs) are now common in these machines, we're
still on a plateau where scaling up is largely a matter of money
rather than new technology, and it seems there's always more
money for matters of national pride (i.e., countries fighting
for positions at the top of the list). Additionally, when the
price/performance improvements due to new technologies slow,
budget for big machines in general seems to go up to compensate.
Machine cost is not listed, but has definitely gone up sharply
over the last decade. You should notice that the US has a lot
of machines on the list, and right now we have the
fastest... but we often don't.
-
A nice reference for standard benchmarks is SPEC, the Standard Performance
Evaluation Corporation. The text has always been fond of SPEC,
but it's good to understand that there are many
benchmark suites out there, and how much they really matter to
you depends on how much your application(s) look like them. For
example, the HPL benchmark is very intensely using
double-precision floating-point multiply-add, but doesn't even
count integer operations. Many applications are dominated by the
performance of integer, or even character, processing.
-
Here's another interesting tidbit: the US government has a
variety of metrics that they use for determining if a computer
can be exported to a particular country. For example, President
Obama set the export limit as 3.0 TFLOPS on March 16, 2012. It's
getting hard to control spread of computing technology given the
dominance of cluster supercomputers built using largely
commodity parts. In August 2022, the Biden administration imposed new restrictions on the sale
of various high-end processors and GPUs to China and Russia,
and those restrictions were tightened in 2023 and 2024. The Trump
administration started imposing additional restrictions in March 2025,
but at this writing it isn't clear what the final policy will be.
A document describing how to measure performance for export
control is A PRACTITIONER'S GUIDE TO ADJUSTED PEAK PERFORMANCE from
the U.S. Department of Commerce Bureau of Industry and Security.
The CAPITALIZATION of that title is theirs... I guess this is
important enough to shout about? ;-)
Of course, supercomputing might be fundamentally about
larger-scale use of parallel processing, but parallel
processing is really the key to high performance in any modern
computer system. The amount of parallel processing used
inside a typical cell phone exceeds that used in most
supercomputers less than three decades ago. Looking at
supercomputers gives a glimpse of what's coming to more mundane
systems sooner than you'd expect.
The textbook places emphasis on shared memory multiprocessors
(SMP stuff) and cache coherence issues. We covered a bit of that
here, and coherence basics in memory systems, but it's a small
piece of the whole pie because these systems really don't scale
very large. That said, AMD is working toward 256 cores.
As of Spring 2025, they are shipping 192 core chips.
More generally, you should be aware of SIMD (including GPUs and
the not-so-scalable SWAR/vector models) and MIMD, and also the
terms Cluster, Farm, Warehouse Scale Computer, Grid, and Cloud.
In the discussion of interconnection networks, you should be
aware of Latency, Bandwidth, and Bisection Bandwidth, as well as
some understanding of network topologies including Direct
Connections (the book calls these "fully connected"), Toroidal
Hyper-Meshes (e.g., Rings, Hypercubes), Trees, Fat Trees, and
Flat Neighborhood Networks (FNNs), Hubs, Switches, and Routers.
The concept of quantum computing as a form of parallel processing
without using parallel hardware was also very briefly introduced.
You will find a lot of information about high-end parallel
processing at aggregate.org. Professor Dietz and the University
of Kentucky have long been leaders in this field, so Dietz has
writen quite a few documents that explain all aspects of this
technology. One good, but very old, overview is the Linux
Documentation Project's Parallel Processing HOWTO; a particularly good overview of
network topologies appears in this paper
describing FNNs.
A quick summary of what things look like in Fall 2025:
-
As of the November 2025 Top500 list, the AMD-based El Capitan is
still in the top position. Basically, Intel and NVIDIA are
quickly becoming second choices to AMD. Why? One reason is that
Intel has moved to a mix of fast and slow cores, while AMD's
Epyc and Threadripper processors have all fast cores that make
load balancing much easier. As for NVIDIA, well, they've been
unfriendly to the supercomputing market in a variety of ways
that began with their driver EULA disallowing "cheap" cards from
being used in clusters and recently made the decision to stop
supporting 64-bit floats (which are the norm for supercomputing
and are well supported by AMD GPUs).
-
There is now a strongly visible segmentation of the
high-performance computing market along the lines of
supercomputing vs. AI training. This fragmentation didn't really
happen for other applications, and is becoming increasingly
obvious since 2024. AI training is far less demanding in terms
of interprocessor communication and float accuracy. NVIDIA is
actually pushing "4-bit floating point" arithmetic! The
dominance of non-portable libraries written in CUDA has also
kept NVIDIA in the leading position for AI, although libraries
are now getting ported...
-
Nearly all desktop/laptop processors are pipelined, superscalar,
SWAR, implementations with 2-64 cores on each processor chip.
Intel's Xeon Phi processors were the first with up to about 60
cores per chip and 512-bit SWAR, but they were discontinued a
while back. However, as Intel has had a bit of a struggle with
their fabs and the Trump administration has not been
consistently helping the industry. AMD seems to be doing well
and leading the high end market with 192-core processors. Of
course, ARM continues to be very popular because it is sold as
IP that you can use to build your own custom chips (as Apple now
does). However, for 32-bit cores, ARM is starting to lose out to
RISC-V, which is vaguely
comparable but has free IP.
-
Nearly all supercomputers are clusters (although many are now
pre-packaged enough to be marketed as massively parallel
computers) and, since Fall 2017, virtually all 500 of the Top500 supercomputers are Linux
systems using PC cores. The catch is that the huge scales
make denser packaging more appropriate, and AMD has been leading
the way to huge core counts per processor by packaging multiple
dies together on a single substrate. In truth, connecting
multiple dies together seems to be the way to keep Moore's law
going for a while longer. The network hardware for huge scale
systems is also getting exotic, and fiber is becoming more common
as physical machine size grows communication delays.
-
The slow transition to integrating GPUs on the processor chip
continues, and GPUs are everywhere. The integrated GPUs are also
becoming more capable for non-graphical computation.
-
Another developing trend is the use of multi-chip modules,
stacking, and other technologies that allow circuit density to
keep increasing even if Moore's law predictions are not being
met. There's some particularly cool stacking of chips for AMD 3D V-Cache and Hybrid Bond 3D. There's also Cerebras making 1.2 Trillion transistor wafer-scale chips!
These technologies are also why IEEE/ACM SC has been dominated
by high-performance cooling system technologies since 2022. A
weird related trend has been the idea of using Crypto mining as
a "shock absorber" to obtain some profit while burning power
plant output that was not needed at that moment, because many
power generation methods don't throttle easily.
-
Clouds are a very popular way to handle applications that need
lots of memory/storage, but not so much processing resource;
there is a particularly strong push for software as a
service with cloud subscriptions rather than software
purchases. I would say that most classical "mainframe" computing
applications are now outsourced to run on clouds, which does
have some scary implications in terms of lots of unrelated
things sharing a single failure point.
-
IoT (Internet of Things), the
idea that everything should be connected, continues to develop,
with various societal issues ranging from simple privacy and
ownership rights issues to potentially life-threatening things
like "car hacking."
-
Quantum computing has become a very intense research focus, but
it still isn't clear it will ever be practical. In the past few
years, there's been a lot of negative reaction to overblown
claims made earlier. However, even if it has a low probability
of success, the potential payoff is huge, so it is worth
seriously investigating. I have argued that quantum-inspired
computing technologies have already been a big win. In any case,
SC23 saw a variety of new qubit technologies being developed --
quite different from the formerly leading qubit technologies
like IBM's superconducting transmons. SC24 had a bit more
mainstream quantum computing emphasis, but it is still a tiny
part of SC overall. The quantum presence at SC25 was strong, but
there seems to be increasing disagreement on how quantum systems
should be built and used.
One last note:
Tesla's Full Self Driving Chip is a great example of supercomputing moving into mass-market devices
Computer Organization and Design.