Assignment 4: Here, Little Fishy, Fishy?

There's a silly little "game" -- really a simulation -- called Conway's Game of Life. Basically, it is a cellular automaton that simulates how colonies of cells grow and die over time. The simulated world consists of a simple 2D rectangular array, each element of which can be either live (1) or dead (0). At each timestep, what happens at each point in the array depends on what is in that point and its 8 neighbors:

A live cell surrounded by 0 or 1 live cells dies of loneliness
A live cell surrounded by 2 or 3 live cells survives
A live cell surrounded by 4 live cells dies of overcrowding
A dead cell surrounded by exactly 3 live cells is filled by a new cell (i.e., the cells around it reproduce)

The "game" aspect of this is the setting of the initial pattern. Then one simply watches how the pattern evolves over many timesteps. It's a boring game... so we're going to do Sharks and Fishes instead. I've given you a sequential C program that does this: sharky.c. The way the code works is:

Using the read-only copy of the sea, sum up how many sharks, adult sharks, fishes, and adult fishes are around the current position. The definition of "around" is the 8 nearest neighbors with wrap-around on the top/bottom and left/right sides.
Update the seed value for this position by computing the 8-bit value (13 * seed) + 1 and storing that into the write-only sea. The current random number is ((seed >> 3) ^ (x + y)) & 31. You must exactly duplicate how the random numbers are computed in order for your runs to match those obtained by others for the same input sea.
If the current cell holds a shark, increment its age. It dies if it is too old (greater than 20), starves because there are at least 6 sharks neighboring and no fish, or because the random number was 0 (random death from natural causes).
If the current cell holds a fish, increment its age. It dies if it is too old (greater than 10), is eaten by a shark (at least 5 neighboring sharks... apparently, not all sharks eat fish), or starves (all 8 neighbors are fish).
Any empty cell gets a baby shark if there are at least 4 sharks neighboring of which at least 3 are adults, and there are fewer than 4 fish neighboring. The empty cell instead gets a baby fish if there are at least 4 fish neighboring of which at least 3 are adults, and there are fewer than 4 sharks neighboring. Babies are counted as being aged 1 at birth... because age 0 is used to represent an empty cell of sea.

Unfortunately, as written, sharky only processes about 10M pixels/s, and I want you to be processing a 4K (8MP) image for about 10,000 generations (time steps). That would take over two hours running the code as it is. You should be able to speed it up a lot by running on a GPU using CUDA.

To Begin

For the previous project, you started by creating an edited copy of NVIDIA_CUDA-9.2_Samples/0_Simple/vectorAdd. That's still a great way to get started. However, the big problem is how to reorganize the data structure for running on the GPU.

To begin, you certainly do not want to be moving the data structure between the host and GPU any more than necessary. I'd expect you to plop it in global memory on the GPU and leave it there until 10,000 timesteps have completed. However, the data structures are a bit strange in the C program to facilitate graphical display -- you don't need to keep them in that format internally. Here are a few suggestions:

The sequential code uses two arrays (one being the mmap file) to implement double-buffering. Each timestep, one of the arrays is read only and the other write only. After a barrier synchronization confirms everybody is done with the reading, you can then swap the notion of which array is which. You also have a variety of choices for how to implement the barrier synchronizations. Alternatively, your parallel code could use more complex synchronizations to ensure the right data is being accessed... but I wouldn't recommend that.
A self-similar layout with "fake zones" could be used to significant advantage (even in the sequential code). Remember that GPUs don't like conditional control flow, and testing for edge cases does a lot of that. Of course, that testing is slow even in sequential code.
Try to arrange the arrays so that each PE is working on a single 32-bit word at a time. With a little cleverness, you can actually implement a SWAR-style packing of everything you need to know about a sea cell. You do have to read-in the image as 24-bit color data per pixel, and you must output your final result in that format, but nothing says that you need to keep it in anything like that format while you're working on it.

In any case, it's pretty straightforward to get this working on a GPU, and you'll see huge speedups thanks to the massive parallelism available.

An Example

The example case I've prepared for you is demo.ppm. It is a 4K still image that looks like:

Keep in mind that each pixel's color channels really mean different things. The red channel is 0 where there isn't a shark, otherwise, it represents the shark's age (scaled, because 20 would still be a very dim pixel):

Similarly, although with a different scaling factor, the green channel represents the fish age:

Finally, we have the random seeds:

You don't need to use demo.ppm for debugging your code; any P6 (8-bit per color channel, binary) PPM file will do. However, I do expect your project submission to give times for running with demo.ppm as the test image. Just keep in mind that sharky overwrites the image file you give it as an argument, so don't pass it demo.ppm, but rather use a copy of it.

Incidentally, the execution time here is not very sensitive to the contents of the sea cells (nope, not even by the sea shore ;-) ). Thus, you can quote a rate by dividing the real time measured over at least 100 timesteps. For example, the demo image is 3840x2160, or a total of 8,294,400 pixels. It took 78 seconds to run 100 timesteps on my laptop, thus giving a rate of about 10.6MP/s for the processing. You should be able to beat that by a large margin using the GPU. In fact, minor sequential code optimizations can get a factor 3X or better without even using any parallelism.

Due Dates

The recommended due date for this assignment is before class, Friday, November 22, 2019. This submission window will close when class begins on Monday, November 25, 2019. You may submit as many times as you wish, but only the last submission that you make before class begins on Monday, November 25, 2018 will be counted toward your course grade.

Note that you can ensure that you get at least half credit for this project by simply submitting a tar of an "implementor's notes" document explaining that your project doesn't work because you have not done it yet. Given that, perhaps you should start by immediately making and submitting your implementor's notes document? (I would!)

Submission Procedure

For each project, you will be submitting a tarball (i.e., a file with the name ending in .tar or .tgz) that contains all things relevant to your work on the project. Minimally, each project tarball includes the source code for the project and a semi-formal "implementors notes" document as a PDF named notes.pdf (this is following the same guidelines used for CPE480). Your implementors notes for this project must include your observations about the execution time as the parameters are changed; your observations can be stated in paragraph for, as a table, or as a graph of an appropriate type. For this particular project, place everything in the sharky directory and submit a tarball of that directory. The CUDA source file in that directory should be named sharky.cu. Be sure to discuss not only how you restructured this program to run on a GPU using CUDA, but also the performance you measured for your code in the Implementor's Notes.

Submit your tarball below. The file can be either an ordinary .tar file created using tar cvf file.tar yourprojectfiles or a compressed .tgz file file created using tar zcvf file.tgz yourprojectfiles. Be careful about using * as a shorthand in listing yourprojectfiles on the command line, because if the output tar file is listed in the expansion, the result can be an infinite file (which is not ok).

GPU & Multi-Core Computing