Assignment 1: Basic Block Parallelizer

Well, maybe I should call this project "Kcolb Cisab" -- because it's mostly about backwards stuff? Nah. Too silly.

Starting with your (or my sample solution TGZ file) basic block optimizer, you're now going to add:

Removal of dead stores and other dead operations.
Do this first! If you get this much done, documented, and submitted, I will send you a copy of my solution up to and including dead code elimination. The idea is to force you to think about the project this way and to reward you with more code you can reference if you do use this approach.
Detection of potential concurrency.
Instruction scheduling:
- If you're an undergrad, schedule either for a very abstract VLIW that has latency and enqueue time both 1 for all instructions and as many function units as you want or for a scalar pipeline with the latency and enqueue times given below
- If you're a grad student, schedule for a VLIW architecture that has pipelined units with the latency and enqueue times given below

Most of that is pretty straightforward. The part that's not involving dealing with stores....

Dead Operation Elimination

Dead operation elimination is a classical backward-flow problem -- however, it is slightly easier to solve it using a forward pass and a backward pass. The forward pass incrementally tracks how many times each tuple has been referenced; the backward pass then simply deletes the unreferenced tuples. Notice that the forward portion of the dead code elimination procedure is easily embedded within the incremental (forward) optimization code you have have already written.

The backward part done incrementally-forward is:

/* set initial reference counts (forward) */
for (t is each tuple generated in forward order) {
  switch (opcode of t) {
  case ST, STX, KILL:
    if (there is an available store with a matching
        subscript value and no ambiguously aliased
	intervening loads or stores) {
      set reference count of that store to 0;
    }
    set reference count of t to 1;
    break;
  case LAB:
    set reference count of t to 1;
    break;
  case SEL:
    if (select is conditional) {
      increment reference count of tested tuple;
    }
    set reference count of t to 1;
    break;
  default:
    increment reference counts of all tuples
     used as arguments to tuple t;
    set reference count of t to 0;
  }
}

That doesn't sound too bad, but that first if statement is a little scary. For example, consider a sequence like:

a[i]=1;
b=a[j];
a[i]=2;

Is the store a[i]=1 dead? Hint: no!

The full backward pass is:

/* remove dead code and propagate death (backward) */
for (b = each basic block) {
  for (t is each tuple within b, scanning backward) {
    if (t has a reference count of 0) {
      decrement reference counts of all tuples
       referenced as arguments to tuple t;
      delete tuple t;
    }
  }
}

Note that you really don't have to keep tuples around for more than one basic block at a time. All the end-of-block processing can happen as soon as the end of a block is detected.

Scheduling

Although not exactly a backward flow problem (depending on the scheduling technique you employ), code scheduling must be applied after dead code removal in order to get the best results.

Code scheduling can be very simple or very complex -- here, the intent is to make it very simple. Of course, "very simple" can also be read as "generating suboptimal code." A small fraction of your grade on the project will correspond to how well your code scheduling technique performs: generating better code will reap a slightly higher grade.

If you're an undergrad, you have two choices for your target machine: either a VLIW or a pipelined machine. In either case, we will assume that certain operations are not involved in parallel execution, i.e., tuples with opcodes LAB and SEL are output as sequential code. KILL tuples are to be ignored (deleted). Further, your scheduling will operate on each block independently. If you're a grad student, you'll kind-of do both....

Oddly enough, both VLIW and pipelined machines have fixed parallelism, hence, you need to have a "null-operation" opcode which can be used to fill slots not usable by the program code. Since the intermediate code does not have a special operation to serve this purpose, you should simply use const(0). Hence, your final parallel code will generally contain a number of useless (i.e., unreferenced) const(0) operations.

Undergrad Choice 1: VLIW Scheduling

Scheduling for a VLIW is simply a matter of packing at most N opcodes into each instruction for parallel execution. The rules for this packing are very simple:

There are always N operations packed into each instruction. For this project, N should be four (4). However, you may make N be input on the command line -- provided the default value is 4.
No instruction executes before another instruction it directly or indirectly depends on.
The sequence of values being loaded from/stored into each individual memory cell must be unaltered. Note that this is essentially the same concern expressed in the dead code removal algorithm given above.

Further note that there is no implicit order within a single VLIW instruction (e.g., the result of placing ld(a) and st(a,x) within the same VLIW instruction is undefined, hence, such placement is illegal). Of course, stores that could store into the same object cannot be scheduled in the same VLIW instruction.

How do you pick a VLIW schedule? That's up to you.

When you output the code, the tuples within each VLIW instruction should be listed on the same line separated by semicolons. In addition to output of the code, your program should output the total number of useful tuples generated and the total number of VLIW instructions. Of course, it is expected that there typically will be fewer VLIW instructions than there are useful tuples.

Undergrad Choice 2: Pipeline Scheduling

Scheduling for a pipelined machine depends on knowing the pipeline structure. For this project, that structure is given by the following table:

Opcode Enqueue Time Latency Pipeline

ADD 1 2 ADDER

SUB 1 2 ADDER

AND 1 1 -

OR 1 1 -

XOR 1 1 -

GT 1 2 ADDER

GE 1 2 ADDER

EQ 1 1 -

SSL 1 1 -

SSR 1 1 -

CONST 1 1 -

LD 2 4 FETCH

LDX 2 4 FETCH

ST 1 1 -

STX 1 1 -

LAB - - -

SEL 1 1 -

KILL 1 1 -

Opcode	Enqueue Time	Latency	Pipeline
`ADD`	1	2	`ADDER`
`SUB`	1	2	`ADDER`
`AND`	1	1	-
`OR`	1	1	-
`XOR`	1	1	-
`GT`	1	2	`ADDER`
`GE`	1	2	`ADDER`
`EQ`	1	1	-
`SSL`	1	1	-
`SSR`	1	1	-
`CONST`	1	1	-
`LD`	2	4	`FETCH`
`LDX`	2	4	`FETCH`
`ST`	1	1	-
`STX`	1	1	-
`LAB`	-	-	-
`SEL`	1	1	-
`KILL`	1	1	-

Hence, there are two separate (independent) pipelines within the processor. The scheduling rules are simply:

A second instruction cannot be scheduled to use a pipeline unless the last instruction scheduled to use that pipeline was scheduled at least the enqueue time earlier, e.g., the following code is incorrect:
```
ld(a)
ldx(b,x)
```
No instruction executes before another instruction it directly or indirectly depends on has completed.
The sequence of values being loaded from/stored into each individual memory cell must be unaltered. Note that this is essentially the same concern expressed in the dead code removal algorithm given above. Make no assumptions about how much of an instruction has completed except that the entire instruction has completed once the latency time has elapsed.
Some attempt must be made to overlap useful instructions rather than insert null operations, but, as with the VLIW scheduling, the choice of algorithm is up to you. In any case, the code you generate must abide by the rules for scheduling the two pipelines described above.
In addition to output of the code, your program should output the total number of operations which would have been needed if no overlap with useful instructions was possible, the total number of tuples generated, and the total number of null operations which were inserted.

Graduate No Choice: Pipelined VLIW

You are to schedule for a VLIW that has three pipelined units like those described above: ADDER0, ADDER1, and FETCH.

Due Dates

The recommended due date for this assignment is before class, March 20, 2018. The actual submission will not close until before class Thursday, March 22, 2018. You may submit as many times as you wish, but only the last submission that you make before the final deadline will be counted toward your course grade.

Note that you can ensure that you get at least half credit for this project by simply submitting a tar of an "implementor's notes" document explaining that your project doesn't work because you have not done it yet. Given that, perhaps you should start by immediately making and submitting your implementor's notes document? (I would!)

Submission Procedure

For each project, you will be submitting a tarball (i.e., a file with the name ending in .tar or .tgz) that contains all things relevant to your work on the project. Minimally, each project tarball includes the source code for the project and a semi-formal "implementors notes" document as a PDF named notes.pdf. It also may include test cases, sample output, a make file, etc., but should not include any files that are built by your Makefile (e.g., no binary executables). Make it very clear in your implementor's notes exactly what each file is and, if you're an undergrad, which project you did: VLIW or pipeline.

Submit your tarball below. The file can be either an ordinary .tar file created using tar cvf file.tar yourprojectfiles or a compressed .tgz file file created using tar zcvf file.tgz yourprojectfiles. Be careful about using * as a shorthand in listing yourprojectfiles on the command line, because if the output tar file is listed in the expansion, the result can be an infinite file (which is not ok).

Optimizing Compilers