All materials posted here are for personal use only. This material is being massively restructured for Fall 2018.
Papers describing basic (pronounced "old") SIMD architecture are linked here. Notice that traditional SIMD is often bit-serial and extremely simple per processing element.
The next step after big SIMD machines was SIMD Within A Register (SWAR). This is used in nearly all modern processors. References are linked here.
Modern GPU Architecture Modern GPU Architecture
Inside the Volta GPU Architecture and CUDA 9
Kepler GK110/210 white paper
First CUDA program video and course materials
The first CUDA program there uses host memory mapping;
here's a version that doesn't
Our MIMD On GPU work. The 2009 paper giving the details isn't freely available, but for this course, here's an unofficial copy and here are slides for it. An interesting little bit to look at is mog.cu, which is a later version of the MOG interpreter core.
Synchronization across multiple little SIMD engines within a GPU is described in our Magic Algorithms page
The latest (CUDA 9) CUDA Warp-Level Primitives are described here.
The atomic primitives are described in this section of the CUDA-C programming guide. Here are slides from NVIDIA overviewing their use.
Cooperative Groups: Flexible CUDA Thread Programming is an API for groups within a block.
Mark Harris 2007 slides on reduction optimization
It is useful to note that there is now even better efficiency possible
using
warp shuffle, and lots of optimized functions are now available
using CUB
NVIDIA's developer site on using OpenCL
Here is a nice summary of OpenCL support in GPUs/CPUs (not FPGAs)
Intel's FPGA SDK FOR OPENCL (remember, Altera is now part of Intel)
Both these sets of directives (pragmas) allow you to get code running on a GPU without much fuss, but that doesn't mean they're simple. Pragmas are part of the C/C++ languages, but they're not really integrated. The rule is that a program should still work if compiled ignoring all pragmas, and that's mostly true for OpenACC and OpenMP programs in C/C++.
That said, both sets of pragmas are supported by GCC. There are lots of similarities with strikingly unnecessary differences. For example, what OpenACC calls a "gang" is pretty much what OpenMP calls a "team" -- although there are lots of differences, both roughly correspond to what NVIDIA calls a "block". In any case, tools like nvprof still work with the code they generate... because it all ends up being kernels to run on NVIDIA GPUs. Of course, both OpenMP and OpenACC are intended to run code on Intel and AMD GPUs too, but those targets are currently less well supported by the free implementations.
Dr. Dobb's Easy GPU Parallelism with OpenACC
OpenACC (yeah, it should really be OpenAcc, but that's not what they call themselves) and here's their reference card (which isn't too bad, really)
OpenMP was really designed for shared-memory, multi-core, processors... but now includes support similar to OpenACC; here's their 12-page reference card
There are lots of overview slides out there. These slides by Daniel Aliaga at Purdue CS are about as good an overview as I've found of both history and the basic graphics pipeline.
Learn OpenGL is a website with a nice intro tutorial
What Every CUDA Programmer Should Know About OpenGL
Not-yet-updated stuff follows....
This site contains a variety of news, paper links, etc., about use of GPUs (Graphic Processing Units) for General-Purpose computing -- commonly known as GPGPU. Note that general-purpose is a misnomer; it is really about programming GPUs for tasks that are not entirely graphical.
The first paper on ATI's CTM (Close To the Metal) software interface to GPUs (Graphics Processing Units) for general-purpose computing. Referenced directly from ATI's site, which is now part of AMD's site. There are also slides and a full manual at the ATI/AMD site.
We'll be starting with NVIDIA's CUDA environment. The latest version is 4.0. Note that the version numbers are different for the various components of the CUDA system, and do not have any obvious relationship to the Compute Capability levels that are supported. However, version numbers are consistent across the supported platforms.
 GPU Computing
 GPU Computing