Assignment 3: How Fast Is This?

The goal of this first NVIDIA CUDA project is to give you some idea for how fast a GPU is to do some very simple operations -- and how that performance scales with virtualization.

To Begin

When you log into one of the machines for this course, you'll find a directory NVIDIA_CUDA-9.2_Samples with a variety of subdirectories. The best, and safest, examples to start playing with remotely and examining source code for are 1_Utilities/deviceQuery and 0_Simple/vectorAdd. They contain the basic ideas for everything you'll need to do in this first project.

What You Must Do

Take a look at the vectorAdd demo. This is not a very interesting nor high-performance code. However, it is really easy to understand.

Start by making a new directory called NVIDIA_CUDA-9.2_Samples/0_Simple/scaling and copy the vectorAdd files into it. Change all the references to "vectorAdd" to "scaling", recompile using make, and run the newly-compiled ./scaling. It should behave identically to the original vectorAdd.

Now find C[i] = A[i] + B[i]; in the scaling function. Insert for (int j=0; j<1024; ++j) before it so that the code will have roughly 1024 times as much work to do per element -- this will make it easier to time. (You may have noticed that a smart compiler could eliminate the loop, but NVIDIA's CUDA compiler by default will not optimize that code away.)

Running the deviceQuery demo will tell you a variety of things about the GPU in the system. In particular, it will tell you how many "Multiprocessors" the system contains and how many "CUDA Cores/MP": let's call those numbers respectively MP and CMP. Inside your scaling.cu file, you'll notice that threadsPerBlock and blocksPerGrid are used to specify how to distribute numElements worth of work across virtual processing elements. Well, instead of using those parameter computations, I want you to run a set of benchmarks with them set as all combinations of:

threadsPerBlock should start at 32, then double to 64, then 128, etc., until the kernel fails to launch.
blocksPerGrid should start at MP/4, then MP, MP*16, and finally MP*256.
numElements should always be threadsPerBlock*blocksPerGrid

Of course, you need some way to get timing from each run. Although you could use the unix time command, instead I want you to use nvprof. Running nvprof ./scaling instead of ./scaling will give you a pile of profiling output. The number I want you to look at is the "Time" for scaling(). You probably expect that execution time will change linearly with numElements, but it will not. I want you to observe what it actually does.

Although you can change these parameters by recompiling each time, it is easiest to make threadsPerBlock and blocksPerGrid into command-line options to your program's main. Note that atoi() can convert a string that looks like an integer into its integer value.

Due Dates

The recommended due date for this assignment is before class, Friday, November 1, 2019. This submission window will close when class begins on Monday, November 4, 2019. You may submit as many times as you wish, but only the last submission that you make before class begins on Monday, November 4, 2018 will be counted toward your course grade.

Note that you can ensure that you get at least half credit for this project by simply submitting a tar of an "implementor's notes" document explaining that your project doesn't work because you have not done it yet. Given that, perhaps you should start by immediately making and submitting your implementor's notes document? (I would!)

Submission Procedure

For each project, you will be submitting a tarball (i.e., a file with the name ending in .tar or .tgz) that contains all things relevant to your work on the project. Minimally, each project tarball includes the source code for the project and a semi-formal "implementors notes" document as a PDF named notes.pdf (this is following the same guidelines used for CPE480). Your implementors notes for this project must include your observations about the execution time as the parameters are changed; your observations can be stated in paragraph for, as a table, or as a graph of an appropriate type. For this particular project, place everything in the scaling directory and submit a tarball of that directory. The CUDA source file in that directory should be named scaling.cu.

Submit your tarball below. The file can be either an ordinary .tar file created using tar cvf file.tar yourprojectfiles or a compressed .tgz file file created using tar zcvf file.tgz yourprojectfiles. Be careful about using * as a shorthand in listing yourprojectfiles on the command line, because if the output tar file is listed in the expansion, the result can be an infinite file (which is not ok).

GPU & Multi-Core Computing