Assignment 2: Kentucky's Line Extrusion Orderer

No doubt, you've heard of the traveling salesperson (salesman?) problem. Well, this is a version of the problem in a world filled with wormholes. Ok. That sounds strange enough... but it really isn't that strange a problem. It's really a rather obvious problem involving fused-deposition (extrusion-based) 3D printers.

Most inexpensive 3D printers build things by extruding lines of melted pastic in thin layers. Thus, every 3D model is computationally "sliced" for manufacture into layers, each of which is then translated into a set of lines within that layer. The catch is that what you get is a set of lines for each layer -- the sequence in which to build them is normally taken from the slice in an outline-first-then-fill-in pattern, which is often quite good, but might not be optimal. Given a set of lines to extrude, your program will simply order them such that the total time the printhead spends traveling -- including gaps between lines -- is minimal. The "wormholes" come from the fact that once you've decided to go to one end of a line, you come out of it at the other end. There is also the minor complication that the printhead speed moving between lines can be faster than the speed when depositing a line, although we'll ignore that issue in this project.

How hard is this problem? Well, ignoring the choice of which of the two endpoints on each line to go to, this is quite obviously a factorial problem. 5! is 120. 10! is 3,628,800, 15! is 1,307,674,368,000, and 20! is 2,432,902,008,176,640,000. Of course, for a particulare sequence of N lines there are 2**N possible flips of line segments, so 20 lines really gives (2**20)*20!, or a mere 2,551,082,656,125,828,464,640,000 possible orders. That's big enough to justify using a Genetic Algorithm (GA)... especially when your program is supposed to allow a set of up to at least 1,024 lines as input.

Your Project

The serial GA code in C is kleo.c. It's a pretty straightforward GA. The only differences are that the cross() and mutate() are a little more complex in order to give somewhat better performance. There are actually several flavors of mutation and also a little trick involving seeding the initial population. The #ifdef VERBOSE stuff is just for the serial version; to get lots of tracing output, compile it with:

cc kleo.c -o kleo -lm -DVERBOSE

The command line arguments are simple enough. The first is the population size, with a maximum of 8192. The second and third arguments determine the relative frequency of crossover vs. mutation. The fourth argument is the number of individuals (steady-state "generations") to execute. Finally, the last two arguments are the X,Y coordinates of the printhead before printing the input set of lines.

The input comes on stdin as a simple text file. Each set of 4 floating-point numbers gives the X,Y endpoints of a line to extrude. For example:

20 20 20 10
10 20 20 20
10 20 10 10
20 10 10 10

If we specify on the command line that the printhead starts at position 0,0, this would then draw a box between 10,10 and 20,20. Assuming the start position is given as 0,0, here is one of the two optimal sequences for these lines:

10 10 10 20
10 20 20 20
20 20 20 10
20 10 10 10

Note that the 10 20 10 10 line in the input got flipped.

Functional Requirements

Your task is simply to get some speedup by re-writing this code to use MPI. Now, how do you parallelize this GA code? There are many ways, including the embarrassingly-parallel "island model" which is quite amenable to distributed-memory implementation with controllably-infrequent communications. I don't want you to do that.

So, how will you parallelize it? Using OpenMP, of course. I want you to use OpenMP to parallelize the code in main() that is commented as "The genetic algorithm loop." Basically, that will require you to figure-out some way to ensure no two processors are working on the same population member at the same time. That will involve some locking and/or a bit of cleverness about picking which population members each will work on at any given point in time (scheduling). You can use any method you wish.

A Little Help

You might be wondering how you get test data for this. Well, you could extract it from "G code" produced by a slicer, but then how would you know a good ordering? Instead, I'm giving you a little C program, linemaker.c. This program takes two command-line arguments. The first sets the random seed, but if that number is negative, it also forces the program to output lines in a perhaps-optimal order. The number of lines to generate is the last argument. For example, ./linemaker 42 10 gives:

876166 592740 128881 283241
128881 283241 951012 146758
951012 146758 999021 614940
999021 614940 800535 523743
800535 523743 815874 643143
815874 643143 599717 474504
599717 474504 233696 529762
233696 529762 389892 169448
389892 169448 580698 161259
580698 161259 313867 976352

while ./linemaker -42 10 outputs the same line segments, but in a different order -- the possibly-optimal order in which it was generated:

815874 643143 599717 474504
389892 169448 580698 161259
876166 592740 128881 283241
800535 523743 815874 643143
128881 283241 951012 146758
951012 146758 999021 614940
313867 976352 580698 161259
233696 529762 389892 169448
800535 523743 999021 614940
233696 529762 599717 474504

Note not only reordering of line segements, but also that some segments are traversed in the opposite direction. It's not a big difference in performance between the two examples above, but the differences get larger with more lines....

A Little More Random Help

Many of you have had problems with rand(), as we discussed in class. You need a good quality one on the host for the initial data, but you also need one that can give a decent random sequence that is different on each PE. Fortunately, the parallel one doesn't need to be as high quality. You can easily make your own that is very fast and good enough. You can use a very simple generator of the form:

rand = (a * rand) % m

If m is 2 to the 32nd power, the modulus is a free operation, since a 32-bit integer value is naturally clipped that way. In such a case, a decent choice of the value for a is ((8 * 16807) + 5). Of course, the bottom few bits don't look very random, so you'll probably want to ignore some of the least significant bits rather than using the rand value directly.

That generator by itself would give the same value sequence for every processing element. Instead, you can use the same seed for all processing elements, but also use the trick that for NPROC virtual processing elements, you start each processing element with their private rand variable initialized as

rand = (b * seed) % m

where b is a to the IPROC power. Each processing element then substitutes a to the NPROC power for a in the original formula. For example, b on different PEs would be 134461, 134461*134461, 134461*134461*134461, etc. Note that we have to avoid the value 0. If NPROC is 64, then the formula for a 32-bit rand in each PE would be rand = (1352027393 * rand);. It's probably better to use (rand >> 8) for your random value rather than rand itself so that the not-very-random low bits are ignored. Simple enough, right?

Some of you also found stdlib's rand_r(), which is intended to be thread safe, but depends on being passed a private pointer to an unsigned int. That works fine, but you need to make sure the private seeds get initialized to different values, which some of you struggled with. I'd recommend something like (getpid()*time(0)) as an inital seed value; these will generally produce unique values for each process each time the code is run.

Due Dates

The recommended due date for this assignment is before class,Friday, October 4, 2019. This submission window will close when class begins on Monday, October 7, 2019. You may submit as many times as you wish, but only the last submission that you make before class begins on Monday, October 7, 2019 will be counted toward your course grade.

Note that you can ensure that you get at least half credit for this project by simply submitting a tar of an "implementor's notes" document explaining that your project doesn't work because you have not done it yet. Given that, perhaps you should start by immediately making and submitting your implementor's notes document? (I would!)

Submission Procedure

For each project, you will be submitting a tarball (i.e., a file with the name ending in .tar or .tgz) that contains all things relevant to your work on the project. Minimally, each project tarball includes the source code for the project and a semi-formal "implementors notes" document as a PDF named notes.pdf (this is following the same guidelines used for CPE480). In this project, your implementor's notes must describe the synchronization algorithm you used. The tarball also may include test cases, sample output, a make file, etc., but should not include any files that are built by your Makefile (e.g., no binary executables). For this particular project, name the C source file kleo.c.

Submit your tarball below. The file can be either an ordinary .tar file created using tar cvf file.tar yourprojectfiles or a compressed .tgz file file created using tar zcvf file.tgz yourprojectfiles. Be careful about using * as a shorthand in listing yourprojectfiles on the command line, because if the output tar file is listed in the expansion, the result can be an infinite file (which is not ok).

Your account is
Your password is

Your section is EE599-002 (undergrad) EE699-002 (grad)


EE599/699 GPU & Multi-Core Computing