References for EE599/699 GPU & Multi-Core Computing

All materials posted here are for personal use only. Material will be added incrementally throughout the Fall 2020 semester.

Basic MIMD Architecture & Concepts

A little about historically how this has evolved...

Repetition Filter Memory in CHoPP: A. Klappholz. (1981). IMPROVED DESIGN FOR A STOCHASTICALLY CONFLICT-FREE MEMORY/INTERCONNECTION SYSTEM.. 443-448. Paper presented at Conf Rec Asilomar Conf Circuits Syst Comput 14th, Pacific Grove, CA, USA. (still looking for copy of this or other relevent article...)
Fetch-&-Add in the NYU Ultracomputer: A. Gottlieb, R. Grishman, C.P. Kruskal, K.P. McAuliffe, L. Rudolph, and M. Snir, "The NYU Ultracomputer -- Designing an MIMD Shared Memory Parallel Computer" in IEEE Transactions on Computers, vol. 32, no. 02, pp. 175-189, 1983. doi: 10.1109/TC.1983.1676201 (URL, local copy); "An Overview of the NYU Ultracomputer Project (1986)" (PDF) is a better, but more obscure, reference
Explanation of the "Hot Spot" problem for RP3: G. F. Pfister and V. A. Norton, "``Hot spot'' contention and combining in multistage interconnection networks," in IEEE Transactions on Computers, vol. C-34, no. 10, pp. 943-948, Oct. 1985. (URL, local copy)
Memory consistency models: "Shared Memory Consistency Models: A Tutorial" (PDF) -- Sarita Adve has done quite a few versions of this sort of description
Modern atomic memory access instructions: AMD64 atomic instructions
Futexes: Many short, yet still confusing, descriptions of Futexes are available and here's probably the best early overview (PDF); the catch is that various Linux kernels have different futex() implementations with 4, 5, or 6 arguments
Transactional memory: Transactional Memory has been a hot idea for quite a while. Intel's Haswell processors incorporate a hardware implementation described in chapter 8 of this PDF (locally, PDF); but there were (still are) problems.; Wikipedia has a nice summary of software support for transactional memory.; There is a version of software transactional memory implemented in GCC.
Replicated/Distributed Shared Memory: A very odd one is implemented in AFAPI as Replicated Shared Memory; The best known is Treadmarks, out of Rice University; One of the latest is DEX: Scaling Applications Beyond Machine Boundaries, which is part of Popcorn Linux

Shared Memory Programming

From very low level to very high level....

Mutex (exclusive lock) vs. Semaphore (signaling mechanism): Don't yet have a great reference for this, but they're everywhere. Basic Mutex operations are lock(m) and unlock(m), withe many implementations. Basic Semaphore operations are classically called P and V (wait and signal). The simplest counting semaphore would be something like void p(semaphore s) { while (s<=0); --s; } and void v(semaphore s) { ++s; }.
Barrier synchronization: There are various atomic counter algorithms; alternatively, here is GPU SyncBlocks algorithm from my Magic Algorithms page; That's basically the same as used in The Aggregate Function API: It's Not Just For PAPERS Anymore
Direct use of System V shared memory: My System V shared memory version of the Pi computation is shmpi.c -- note that this version uses raw assembly code to implement a lock, which has far less overhead than using the System V OS calls (unless you're counting on the OS to schedule based on who's waiting for what)
POSIX Threads: POSIX Threads (pthreads) is now a standard library included in most C/C++ compilation environments, and linked as the -lpthread library under Linux GCC; my Pi computation example for pthreads is pthreadspi.c
OpenMP (aka, OMP): Here is a nice overview intro to OpenMP/OMP as slides (PDF). OMP pragmas are understood by recent GCC releases (GOMP is built-in), but must be enabled by giving -fopenmp on the gcc command line with no other special options; my Pi computation example for OMP is mppi.c. Normally, environment variables are used to control things like how many processes to make
UPC (unified parallel C): UPC (Unified Parallel C) is an extension of the C language, and hence requires a special compiler. There are several UPC compilers; the fork of GCC called GUPC must be installed as described at the project homepage (in my systems, it is installed at /usr/local/gupc/bin/gupc). My Pi computation example for UPC is upcpi.upc; compilation is straightforward, but the executable produced processes some command line arguments as UPC controls, for example, -n is used to specify the number of processes to create.

Basic SIMD Architecture & Concepts

Papers describing basic (pronounced "old") SIMD architecture. Notice that traditional SIMD is often bit-serial and extremely simple per processing element.

Architecture of a massively parallel processor (PDF): This paper describes Ken Batcher's SIMD MPP design at Goodyear Aerospace.
DAP -- a distributed array processor (PDF): This paper describes the ICL DAP, another early SIMD machine.
Thinking Machines CM-2 (PDF): A (relatively late) version of the "Connection Machine Model CM-2 Technical Summary, Version 6.0, November 1990." This includes description of the (CM-200) floating-point hardware to the design.
Activity Counter Implementation Of Enable Logic (PDF): This paper describes a clever method for handling tracking of nested SIMD enable/disable without use of a bit stack.

Basic SWAR Architecture & Concepts

The next step after big SIMD machines was SIMD Within A Register (SWAR). This is used in nearly all modern processors.

Multimedia Extensions For Microprocessors: SIMD Within A Register (HTML, PDF): One of the first talks on the concepts of SWAR... originally presented in February 1997 at Purdue University. The HTML is a little ugly, but this is the original HTML, and the server it was on supported different server-side processing....
Compiling for SIMD within a Register (PDF): One of the best generic descriptions of the concepts of SWAR. The above link is direct from Springer-Verlag but using UK's EZProxy access

Basic GPU Architecture & Concepts

The NVIDIA Developer CUDA education site has many nice links, including this set of slides from Mark Harris: Lots of good stuff here. I'm using the above slides from Mark Harris to introduce CUDA C/C++, starting with the October 30, 2020 lecture
An Introduction to Modern GPU Architecture: A very nice set of oldish slides from NVIDIA....
Introduction to the CUDA Platform: Very minimal overview slides from NVIDIA, but points at everything....

GPU Programming Tricks

Our MIMD On GPU work. The 2009 paper giving the details isn't freely available, but for this course, here's an unofficial copy and here are slides for it. An interesting little bit to look at is mog.cu, which is a later version of the MOG interpreter core.

Synchronization across multiple little SIMD engines within a GPU is described in our Magic Algorithms page

The latest (CUDA 9) CUDA Warp-Level Primitives are described here.

The atomic primitives are described in this section of the CUDA-C programming guide. Here are slides from NVIDIA overviewing their use.

Cooperative Groups: Flexible CUDA Thread Programming is an API for groups within a block.

Mark Harris 2007 slides on reduction optimization
It is useful to note that there is now even better efficiency possible using warp shuffle, and lots of optimized functions are now available using CUB

NVIDIA's developer site on using OpenCL

Here is a nice summary of OpenCL support in GPUs/CPUs (not FPGAs)

Intel's FPGA SDK FOR OPENCL (remember, Altera is now part of Intel)

OpenACC (and OpenMP for GPUs)

Both these sets of directives (pragmas) allow you to get code running on a GPU without much fuss, but that doesn't mean they're simple. Pragmas are part of the C/C++ languages, but they're not really integrated. The rule is that a program should still work if compiled ignoring all pragmas, and that's mostly true for OpenACC and OpenMP programs in C/C++.

That said, both sets of pragmas are supported by GCC. There are lots of similarities with strikingly unnecessary differences. For example, what OpenACC calls a "gang" is pretty much what OpenMP calls a "team" -- although there are lots of differences, both roughly correspond to what NVIDIA calls a "block". In any case, tools like nvprof still work with the code they generate... because it all ends up being kernels to run on NVIDIA GPUs. Of course, both OpenMP and OpenACC are intended to run code on Intel and AMD GPUs too, but those targets are currently less well supported by the free implementations.

Dr. Dobb's Easy GPU Parallelism with OpenACC

OpenACC (yeah, it should really be OpenAcc, but that's not what they call themselves) and here's their reference card (which isn't too bad, really)

OpenMP was really designed for shared-memory, multi-core, processors... but now includes support similar to OpenACC; Here is a little summary of the OpenMP 5 support for GPUs.

Graphics and OpenGL

There are lots of overview slides out there. These slides by Daniel Aliaga at Purdue CS are about as good an overview as I've found of both history and the basic graphics pipeline.

Learn OpenGL is a website with a nice intro tutorial

What Every CUDA Programmer Should Know About OpenGL

The Open-Source OpenGL Utility Toolkit, better known as freeglut

OpenGL- GLUT Program Sample Code... which isn't explicitly using CUDA

GPU and Multi-Core Computing