Assignment 1: Never Gonna Leave Ya

The goal of this NVIDIA CUDA project is to write a simple reduction routine that operates entirely within the GPU. This is quite different from the reductions in the NVIDIA reduction example, which has the host invoke the same kernel multiple times.

To Begin

Basically, you're going to be combining one of the reductions from the reduction benchmark with the barrier synchronization mechanism, SyncBlocks, from our Magic Algorithms page. The idea is simple: there is a cost associated with starting/stopping a kernel, so the idea is to do the complete reduction within a single kernel call. You will not see a huge speedup for this, because most of the time penalty is really not the kernel start/stop per se, but the loss of local state that implies -- reduction happens to have no significant local state. Still, having a self-contained reduction routine for within the GPU is a huge help.

Now, if you've been playing around with the SDK a bit, you may have noticed that there is already a single-kernel reduction example called threadFenceReduction. Problem is, that relies on atomicInc, which is in all current NVIDIA GPUs, but is not fully portable. It also relies on __threadfence(), which first appeared in CUDA 2.2 but will work on all compute capabilities (because it was a null operation on hardware predating CUDA 2.2).

Functional Requirements

Start with the threadFenceReduction demo. Make a new copy of it as a project called reduceInside.

The object is to replace the atomicInc call, and the references to retirementCount, with use of a barrier algorithm based on SyncBlocks. There should be very little variation in the execution time for different blocks, so you may simplify the SyncBlocks routine to use just a single count per barrier rather than two. In fact, I'll accept any synchronization mechanism that uses an array in global memory rather than atomic functions for synchronization.

Although you don't need retirementCount, the barrier synchronization algorithm does use an array called barnos[]. This will require changes to both the kernel code and the calling sequence on the host (to initialize that data structure).

What You Should Find

Again, we're not really looking for performance improvement here, but for correct execution despite using a portable barrier mechanism. You will also note that a "bad" choice for the size of a block, or even the number of blocks, can cause bad things to happen. Discuss this a bit in the notes you submit.

Due Dates, Submission Procedure, & Such

You will be submitting source code (reduceInside.cu and reduceInside_kernel.cu), the make file (Makefile), and a simple HTML-formatted "implementor's notes" document (reduceInside.html) which also should summarize your findings.

For full consideration, your project should be submitted no later than October 30, 2012. Submit your tar file here:

GPU Computing