References: Performance & Supercomputers
This semester the material on performance and supercomputers is
merged -- after all, supercomputing is all about performance.
The new slides are here.
From performance analysis, the key ideas are really the tabular
breakdown of expected instruction execution counts, CPIs, and
clock period; the concepts of real, user, and system time; the
different types of benchmarks; and Amdahl's law. Here are a few
other interesting things about performance:
-
The most prominent benchmark is HPL (High
Performance Linpack), which solves systems of linear
equations mostly by doing lots of matrix multiplies. This is a
particularly "supercomputer friendly" benchmark because
performance of communication between PEs becomes less important
as the problem is scaled-up, and the benchmark allows scaling
the problem as big as can fit in the machine rather than timing
the same-size problem on all machines. The results are reported
as FLOPS obtained in running the benchmark. The Top 500 supercomputers in the
world by this metric is a list everyone watches closely... which
has been good for UK in that machines operated by CCS historically placed well
on it, which is true of fewer than ten US universities. That
said, UK is not on the most recent list.
-
It is worth noting that the machines on the Top 500 list have recently
made a turn toward the huge -- with lots of machines now
having more than a million cores! Although many-core
chips (mostly GPUs) are now common in these machines, we're
still on a plateau where scaling up is largely a matter of money
rather than new technology, and it seems there's always more
money for matters of national pride (i.e., countries fighting
for positions at the top of the list). Additionally, when the
price/performance improvements due to new technologies slow,
budget for big machines in general seems to go up to compensate.
Machine cost is not listed, but has definitely gone up sharply
over the last decade. You should notice that the US has a lot
of machines on the list, but right now we don't have the
fastest... and we often don't.
-
A nice reference for standard benchmarks is SPEC, the Standard Performance
Evaluation Corporation. The text has always been fond of SPEC,
but it's good to understand that there are many
benchmark suites out there, and how much they really matter to
you depends on how much your application(s) look like them. For
example, the HPL benchmark is very intensely using
double-precision floating-point multiply-add, but doesn't even
count integer operations. Many applications are dominated by the
performance of integer, or even character, processing.
-
Here's another interesting tidbit: the US government has a
variety of metrics that they use for determining if a computer
can be exported to a particular country. For example, President
Obama set the export limit as 3.0 TFLOPS on March 16, 2012. It's
getting hard to control spread of computing technology given the
dominance of cluster supercomputers built using largely
commodity parts. In August 2022, the Biden administration imposed new restrictions on the sale
of various high-end processors and GPUs to China and Russia,
and it seems that will be followed by export restrictions on the
technologies and equipment used to make high-end chips. A
document describing how to measure performance for export
control is A PRACTITIONER'S GUIDE TO ADJUSTED PEAK PERFORMANCE from
the U.S. Department of Commerce Bureau of Industry and Security.
The CAPITALIZATION of that title is theirs... I guess this is
important enough to shout about? ;-)
Of course, supercomputing might be fundamentally about
larger-scale use of parallel processing, but parallel
processing is really the key to high performance in any modern
computer system. The amount of parallel processing used
inside a typical cell phone exceeds that used in most
supercomputers less than three decades ago. Looking at
supercomputers gives a glimpse of what's coming to more mundane
systems sooner than you'd expect.
The textbook places emphasis on shared memory multiprocessors
(SMP stuff) and cache coherence issues. We covered a bit of that
here, and coherence basics in memory systems, but it's a small
piece of the whole pie because these systems really don't scale
very large. That said, AMD is pushing 256 cores.
More generally, you should be aware of SIMD (including GPUs and
the not-so-scalable SWAR/vector models) and MIMD, and also the
terms Cluster, Farm, Warehouse Scale Computer, Grid, and Cloud.
In the discussion of interconnection networks, you should be
aware of Latency, Bandwidth, and Bisection Bandwidth, as well as
some understanding of network topologies including Direct
Connections (the book calls these "fully connected"), Toroidal
Hyper-Meshes (e.g., Rings, Hypercubes), Trees, Fat Trees, and
Flat Neighborhood Networks (FNNs), Hubs, Switches, and Routers.
The concept of quantum computing as a form of parallel processing
without using parallel hardware was also very briefly introduced.
You will find a lot of information about high-end parallel
processing at aggregate.org. Professor Dietz and the University
of Kentucky have long been leaders in this field, so Dietz has
writen quite a few documents that explain all aspects of this
technology. One good, but very old, overview is the Linux
Documentation Project's Parallel Processing HOWTO; a particularly good overview of
network topologies appears in this paper
describing FNNs.
A quick summary of what things look like in Spring 2024:
-
The photo at the top of this page is now slightly out-of-date.
As of the November 2023 Top500 list, Frontier is still the
fastest supercomputer and the only one passing the 1 EFLOPS
mark, but the exact core count went down slightly and the
performance went up slightly to 1.19 EFLOPS.
-
Nearly all desktop/laptop processors are pipelined, superscalar,
SWAR, implementations with 2-64 cores on each processor chip.
Intel's Xeon Phi processors were the first with up to about 60
cores per chip and 512-bit SWAR, but they were discontinued a
while back. However, as Intel has had a bit of a struggle with
their fabs, AMD is strongly back in the game and, with the
possibility of 256 cores coming soon (96 shipping in Zen 4), is
arguably leading over Intel. However, ARM continues to be very
popular because it is sold as IP that you can use to build your
own custom chips
-
Nearly all supercomputers are clusters (although many are now
pre-packaged enough to be marketed as massively parallel
computers) and, since Fall 2017, virtually all 500 of the Top500 supercomputers are Linux
systems using PC processors; also note that a USA
machine, Frontier, now tops the list and is the first over 1
Exaflop/s
-
GPUs are appearing everywhere (although the HW/SW technology for
them is still evolving), and now commonly include very fast
support for bfloat16 computations such as neural
network AI models. NVidia GPUs have come to dominate the
high-performance computing market, but AMD GPUs are serious
contenders, with lower cost but limited by things like the fact
that NVidia's cuda remains the dominant programming environment
for general-purpose GPU computing. What's the fastest GPU?
How about the
GeForce RTX 3090Ti,
which claims "78 RT-TFLOPs, 40 Shader-TFLOPs and 320 Tensor-TFLOPs"
peak performance. Yes, that single card is claiming around 50X
the performance of the 128-node KASY0 cluster from 2003!
-
The slow transition to integrating GPUs on the processor chip
continues, as does the transition from IA32/AMD64 to ARM64.
GPUs are in most Top500 machines; but ARM64 processors, which
first surfaced on the Top500 list several years ago, are not
common. Still, there are new players like Ampere making the 128-core Altra Max -- oh, and guess what
processor is inside Apple's new chips: yup, ARMs. We're also
starting to see RISC-V cores
instead of ARMs in various systems, because you don't even have
to pay to use that IP core...
-
Another developing trend is the use of multi-chip modules,
stacking, and other technologies that allow circuit density to
keep increasing even if Moore's law predictions are not being
met. There's some particularly cool stacking of chips explained
in the video about AMD 3D
V-Cache and Hybrid Bond 3D. There's also Cerebras making 1.2 Trillion transistor wafer-scale chips!
These technologies are also why SC22 was dominated by
high-performance cooling system technoloiges... for the first
time, literally more plumbing was on display than computer
circuitry, and that also was true at SC23.
-
Clouds are a very popular way to handle applications that need
lots of memory/storage, but not so much processing resource;
there is a particularly strong push for software as a
service with cloud subscriptions rather than software
purchases
-
IoT (Internet of Things), the
idea that everything should be connected, continues to develop,
with various societal issues ranging from simple privacy and
ownership rights issues to potentially life-threatening things
like "car hacking"
-
Quantum computing has become a very intense research focus, but
it still isn't clear it will ever be practical. In the past few
years, there's been a lot of negative reaction to overblown
claims made earlier. However, even if it has a low probability
of success, the potential payoff is huge, so it is worth
seriously investigating. I have argued that quantum-inspired
computing technologies have already been a big win. In any case,
SC23 saw a variety of new qubit technologies being developed --
quite different from the formerly leading qubit technologies
like IBM's superconducting transmons.
One last note:
Tesla's Full Self Driving Chip is a great example of supercomputing moving into mass-market devices
Computer Organization and Design.