Fall 2019 EE599-002 GPU & Multi-Core Computing (undergrads)
Fall 2019 EE699-002 GPU & Multi-Core Computing (grads)

Time & Place: TR 12:00-12:50 PM; meeting in 253 F. Paul Anderson Tower
Instructor: Professor Hank Dietz
Instructor URL: http://aggregate.org/hankd/
Course URL: http://aggregate.org/GPUMC/
Course 1-page ad: adF19.pdf

GPU & Multi-Core Processor Computing is about the large-scale parallel processing within a modern computer. Multi-core refers to the multiple conventional processors now found within a processor chip. Graphics Processing Units (GPUs) were once all about video output, but have mutated into the dominant general-purpose many-core parallel computing architecture. In this course, you'll not only learn about the key architectural features of both, but also how to use them effectively -- primarily by directly programming them, but we'll also discuss some libraries.

The first exam was in class on Wednesday, October 9, 2019. It covered the muti-core MIMD stuff, as was reviewed in class Monday, October 7, 2019.

The second exam, and presentations from the graduate students, will be in in the final exam timeslot, 3:30-6:00PM, Monday, December 16, 2019. It will primarily cover the many-core GPU stuff, as will be reviewed in class Friday, December 13, 2019.

Course Materials

All course materials will be linked here:


Aside from working on your own systems, there are two alternatives. There are multi-core systems with NVIDIA GPUs in Marksbury that can be remotely accessed for this course, but we also might be using the large system managed by CCS. All the projects will be C/C++ based, but you'll be using CUDA, OpenACC, and OpenMP for your projects.

  1. The first "real" project is now posted: Simple As 1, 2, 3, .... It involves writing raw C code using System V shared memory to implement a simple ordering of events across processes. It's due before class, Friday, September 20, 2019.
  2. The next project is now posted: Kentucky's Line Extrusion Orderer. It involves converting a serial GA to run in parallel using OpenMP. The deadline has been extended to before class, Friday, October 11, 2019. Note that the assignment has been updated to include the discussion about have to make your own rand() that is thread safe (one per thread).
    My sample solution is ompkleo.c and my Implementor's Notes on it are ompkleo.pdf. The source for the notes is ompkleo.tex, which requires sig-alternate-05-2015.cls and acmcopyright.sty to be built by issuing the command pdflatex ompkleo at least twice.
  3. How Fast Is This? involved observing simple scaling behavior for CUDA, and thus should be run on flint for the run data you report in your implementor's notes. It was due before class, Monday, November 4, 2019.
  4. The next project is now posted: Here, Little Fishy, Fishy?. It was due before class, Monday, November 25, 2019, but has been extended to to before class, Monday, December 2, 2019 and will be accepted without penalty until a solution is reviewed in class.
  5. The last project is now posted: Mandelbrot. It's a boring and simple project, but there's not much time left... so, that's how it goes. Besides, everybody should write a Mandelbrot code sometime.... Note that the assignment has been updated to include a sample program and compilation instructions for OpenACC.

Course Staff

Professor Hank Dietz is usually in the Davis Marksbury Building; see his home page for complete contact info. He has an "open-door" policy that whenever his door is open and he's not busy with someone else, he's available -- and yup, there really is a slow-update live camera in his office so you can check. Alternatively, you also can email hankd@engr.uky.edu to make an appointment; please use "GPUMC" in the email subject line for anything related to this course.

About The Graphic

About the graphic: The AMD Threadripper processor (actually a multi-chip module) provides 16 cores supporting 32 threads. The NVIDIA Quadro GV100 is a video card that can support up to four 5K displays, but provides an even more impressive level of floating-point performance: over 7TFLOPS for 64-bit values and nearly 120TFLOPS for 16-bit values. In context, it wasn't until 2003 that the fastest supercomputer in Kentucky broke the 1TFLOPS barrier... now, even a low-end laptop GPU can do that!

EE599/699 GPU and Multi-Core Computing