References for EE599/699 GPU & Multi-Core Computing
All materials posted here are for personal use only.
Material will be added incrementally throughout the Fall 2020 semester.
Basic MIMD Architecture & Concepts
A little about historically how this has evolved...
-
Repetition Filter Memory in CHoPP
-
A. Klappholz. (1981). IMPROVED DESIGN FOR A STOCHASTICALLY CONFLICT-FREE
MEMORY/INTERCONNECTION SYSTEM.. 443-448. Paper presented at Conf Rec Asilomar
Conf Circuits Syst Comput 14th, Pacific Grove, CA, USA.
(still looking for copy of this or other relevent article...)
-
Fetch-&-Add in the NYU Ultracomputer
-
A. Gottlieb,
R. Grishman,
C.P. Kruskal,
K.P. McAuliffe,
L. Rudolph, and
M. Snir, "The NYU Ultracomputer -- Designing an MIMD Shared Memory Parallel
Computer" in IEEE Transactions on Computers, vol. 32, no. 02, pp. 175-189,
1983. doi: 10.1109/TC.1983.1676201 (URL,
local copy)
-
"An Overview of the NYU Ultracomputer Project (1986)"
(PDF) is a better, but more obscure, reference
-
Explanation of the "Hot Spot" problem for RP3
-
G. F. Pfister and V. A. Norton, "``Hot spot'' contention
and combining in multistage interconnection networks," in IEEE Transactions on
Computers, vol. C-34, no. 10, pp. 943-948, Oct. 1985.
(URL, local copy)
-
Memory consistency models
-
"Shared Memory Consistency Models: A Tutorial"
(PDF) -- Sarita Adve has done quite a few versions of this
sort of description
-
Modern atomic memory access instructions
-
AMD64 atomic instructions
-
Futexes
-
Many short, yet still confusing, descriptions of Futexes are
available and here's probably the best early overview (PDF); the
catch is that various Linux kernels have different
futex() implementations with 4, 5, or 6 arguments
-
Transactional memory
-
Transactional Memory has been a hot idea for quite a while.
Intel's Haswell processors incorporate a hardware implementation
described in chapter 8 of this
PDF (locally, PDF); but there were (still are) problems.
-
Wikipedia has a nice summary of software support for transactional memory.
-
There is a version of software transactional memory implemented in GCC.
-
Replicated/Distributed Shared Memory
-
A very odd one is implemented in AFAPI as Replicated Shared Memory
-
The best known is Treadmarks, out of Rice University
-
One of the latest is DEX: Scaling Applications Beyond Machine Boundaries, which is part of
Popcorn Linux
Shared Memory Programming
From very low level to very high level....
-
Mutex (exclusive lock) vs. Semaphore (signaling mechanism)
-
Don't yet have a great reference for this, but they're everywhere.
Basic Mutex operations are lock(m) and unlock(m),
withe many implementations.
Basic Semaphore operations are classically called P and V (wait and signal).
The simplest counting semaphore would be something like
void p(semaphore s) { while (s<=0); --s; }
and void v(semaphore s) { ++s; }.
-
Barrier synchronization
-
There are various atomic counter algorithms; alternatively, here is GPU
SyncBlocks algorithm from my Magic Algorithms page
-
That's basically the same as used in The Aggregate Function API: It's Not Just For PAPERS Anymore
-
Direct use of System V shared memory
-
My System V shared memory version of the Pi computation is shmpi.c -- note that
this version uses raw assembly code to implement a lock, which
has far less overhead than using the System V OS calls (unless
you're counting on the OS to schedule based on who's waiting for
what)
-
POSIX Threads
-
POSIX Threads (pthreads) is now a standard library included in
most C/C++ compilation environments, and linked as the -lpthread
library under Linux GCC; my Pi computation example for pthreads is pthreadspi.c
-
OpenMP (aka, OMP)
-
Here is a nice overview intro to OpenMP/OMP as slides (PDF). OMP pragmas are understood by recent GCC releases
(GOMP is built-in), but must be enabled by giving
-fopenmp on the gcc command line with no other special
options; my Pi computation example for OMP is mppi.c. Normally,
environment variables are used to control things like how many
processes to make
-
UPC (unified parallel C)
-
UPC (Unified Parallel C) is an
extension of the C language, and hence requires a special
compiler. There are several UPC compilers; the fork of GCC
called GUPC must be installed as described at the project
homepage (in my systems, it is installed at
/usr/local/gupc/bin/gupc). My Pi computation example
for UPC is upcpi.upc; compilation
is straightforward, but the executable produced processes some
command line arguments as UPC controls, for example, -n
is used to specify the number of processes to create.
Basic SIMD Architecture & Concepts
Papers describing basic (pronounced "old") SIMD architecture.
Notice that traditional SIMD is often bit-serial and extremely
simple per processing element.
-
Architecture of a massively parallel processor (PDF)
-
This paper describes Ken Batcher's SIMD MPP design at Goodyear Aerospace.
-
DAP -- a distributed array processor (PDF)
-
This paper describes the ICL DAP, another early SIMD machine.
-
Thinking Machines CM-2
(PDF)
-
A (relatively late) version of the "Connection Machine
Model CM-2 Technical Summary, Version 6.0, November 1990."
This includes description of the (CM-200) floating-point
hardware to the design.
-
Activity Counter Implementation Of Enable Logic
(PDF)
-
This paper describes a clever method for handling
tracking of nested SIMD enable/disable without use of a bit
stack.
Basic SWAR Architecture & Concepts
The next step after big SIMD machines was SIMD Within A Register
(SWAR). This is used in nearly all modern processors.
-
Multimedia Extensions For Microprocessors:
SIMD Within A Register
(HTML,
PDF)
-
One of the first talks on the concepts of SWAR...
originally presented in February 1997 at Purdue University.
The HTML is a little ugly, but this is the original HTML,
and the server it was on supported different server-side
processing....
-
Compiling for SIMD within a Register
(PDF)
-
One of the best generic descriptions of the concepts of SWAR.
The above link is direct from Springer-Verlag but using UK's EZProxy access
Basic GPU Architecture & Concepts
-
The NVIDIA Developer CUDA education site has many nice links,
including this set of slides from Mark Harris
-
Lots of good stuff here. I'm using the above slides from Mark Harris
to introduce CUDA C/C++, starting with the October 30, 2020 lecture
-
An Introduction to Modern GPU Architecture
-
A very nice set of oldish slides from NVIDIA....
-
Introduction to the CUDA Platform
-
Very minimal overview slides from NVIDIA, but points at everything....
GPU Programming Tricks
Our MIMD On GPU work. The 2009 paper giving the details
isn't freely available, but for this course, here's an unofficial copy and here are slides for it. An interesting little bit to look at is
mog.cu, which is a later version of the
MOG interpreter core.
Synchronization across multiple little SIMD engines within a GPU is described
in our Magic Algorithms page
The latest (CUDA 9) CUDA
Warp-Level Primitives are described here.
The atomic primitives are described in this section of the CUDA-C programming guide.
Here are slides from NVIDIA overviewing their use.
Cooperative Groups: Flexible CUDA Thread Programming is an
API for groups within a block.
Mark Harris 2007 slides on reduction optimization
It is useful to note that there is now even better efficiency possible
using
warp shuffle, and lots of optimized functions are now available
using CUB
NVIDIA's developer site on using OpenCL
Here is a nice summary of OpenCL support in GPUs/CPUs (not FPGAs)
Intel's FPGA SDK FOR OPENCL (remember, Altera is now part of Intel)
OpenACC (and OpenMP for GPUs)
Both these sets of directives (pragmas) allow you to get code
running on a GPU without much fuss, but that doesn't mean
they're simple. Pragmas are part of the C/C++ languages, but
they're not really integrated. The rule is that a program should
still work if compiled ignoring all pragmas, and that's mostly
true for OpenACC and OpenMP programs in C/C++.
That said, both sets of pragmas are supported by GCC. There are
lots of similarities with strikingly unnecessary differences.
For example, what OpenACC calls a "gang" is pretty much what
OpenMP calls a "team" -- although there are lots of differences,
both roughly correspond to what NVIDIA calls a "block". In any
case, tools like nvprof still work with the code they
generate... because it all ends up being kernels to run on
NVIDIA GPUs. Of course, both OpenMP and OpenACC are intended to
run code on Intel and AMD GPUs too, but those targets are
currently less well supported by the free implementations.
Dr. Dobb's Easy GPU Parallelism with OpenACC
OpenACC (yeah, it should
really be OpenAcc, but that's not what they call themselves) and
here's their reference card (which isn't too bad, really)
OpenMP was really designed for shared-memory, multi-core, processors...
but now includes support similar to OpenACC;
Here is a little summary of the OpenMP 5 support for GPUs.
Graphics and OpenGL
There are lots of overview slides out there.
These slides by Daniel Aliaga at Purdue CS are about as good an
overview as I've found of both history and the basic graphics pipeline.
Learn OpenGL is a website with a nice intro tutorial
What Every CUDA Programmer Should Know About OpenGL
The Open-Source OpenGL Utility Toolkit, better known as freeglut
OpenGL- GLUT Program Sample Code... which isn't explicitly using CUDA
GPU and Multi-Core Computing