Assignment 4: Here, Little Fishy, Fishy?

There's a silly little "game" -- really a simulation -- called Conway's Game of Life. Basically, it is a cellular automaton that simulates how colonies of cells grow and die over time. The simulated world consists of a simple 2D rectangular array, each element of which can be either live (1) or dead (0). At each timestep, what happens at each point in the array depends on what is in that point and its 8 neighbors:

The "game" aspect of this is the setting of the initial pattern. Then one simply watches how the pattern evolves over many timesteps. It's a boring game... so we're going to do Sharks and Fishes instead. I've given you a sequential C program that does this: sharky.c. The way the code works is:

Unfortunately, as written, sharky only processes about 10M pixels/s, and I want you to be processing a 4K (8MP) image for about 10,000 generations (time steps). That would take over two hours running the code as it is. You should be able to speed it up a lot by running on a GPU using CUDA.

To Begin

For the previous project, you started by creating an edited copy of NVIDIA_CUDA-9.2_Samples/0_Simple/vectorAdd. That's still a great way to get started. However, the big problem is how to reorganize the data structure for running on the GPU.

To begin, you certainly do not want to be moving the data structure between the host and GPU any more than necessary. I'd expect you to plop it in global memory on the GPU and leave it there until 10,000 timesteps have completed. However, the data structures are a bit strange in the C program to facilitate graphical display -- you don't need to keep them in that format internally. Here are a few suggestions:

In any case, it's pretty straightforward to get this working on a GPU, and you'll see huge speedups thanks to the massive parallelism available.

An Example

The example case I've prepared for you is demo.ppm. It is a 4K still image that looks like:

Keep in mind that each pixel's color channels really mean different things. The red channel is 0 where there isn't a shark, otherwise, it represents the shark's age (scaled, because 20 would still be a very dim pixel):

Similarly, although with a different scaling factor, the green channel represents the fish age:

Finally, we have the random seeds:

You don't need to use demo.ppm for debugging your code; any P6 (8-bit per color channel, binary) PPM file will do. However, I do expect your project submission to give times for running with demo.ppm as the test image. Just keep in mind that sharky overwrites the image file you give it as an argument, so don't pass it demo.ppm, but rather use a copy of it.

Incidentally, the execution time here is not very sensitive to the contents of the sea cells (nope, not even by the sea shore ;-) ). Thus, you can quote a rate by dividing the real time measured over at least 100 timesteps. For example, the demo image is 3840x2160, or a total of 8,294,400 pixels. It took 78 seconds to run 100 timesteps on my laptop, thus giving a rate of about 10.6MP/s for the processing. You should be able to beat that by a large margin using the GPU. In fact, minor sequential code optimizations can get a factor 3X or better without even using any parallelism.

Due Dates

The recommended due date for this assignment is before class, Friday, November 22, 2019. This submission window will close when class begins on Monday, November 25, 2019. You may submit as many times as you wish, but only the last submission that you make before class begins on Monday, November 25, 2018 will be counted toward your course grade.

Note that you can ensure that you get at least half credit for this project by simply submitting a tar of an "implementor's notes" document explaining that your project doesn't work because you have not done it yet. Given that, perhaps you should start by immediately making and submitting your implementor's notes document? (I would!)

Submission Procedure

For each project, you will be submitting a tarball (i.e., a file with the name ending in .tar or .tgz) that contains all things relevant to your work on the project. Minimally, each project tarball includes the source code for the project and a semi-formal "implementors notes" document as a PDF named notes.pdf (this is following the same guidelines used for CPE480). Your implementors notes for this project must include your observations about the execution time as the parameters are changed; your observations can be stated in paragraph for, as a table, or as a graph of an appropriate type. For this particular project, place everything in the sharky directory and submit a tarball of that directory. The CUDA source file in that directory should be named sharky.cu. Be sure to discuss not only how you restructured this program to run on a GPU using CUDA, but also the performance you measured for your code in the Implementor's Notes.

Submit your tarball below. The file can be either an ordinary .tar file created using tar cvf file.tar yourprojectfiles or a compressed .tgz file file created using tar zcvf file.tgz yourprojectfiles. Be careful about using * as a shorthand in listing yourprojectfiles on the command line, because if the output tar file is listed in the expansion, the result can be an infinite file (which is not ok).

Your account is
Your password is

Your section is EE599-002 (undergrad) EE699-002 (grad)


EE599/699 GPU & Multi-Core Computing