Assignment 4: nItebHa', Doch cha'

In Assignment 3, you defined the instruction encoding, built an assembler, and wrote Verilog code for a pipelined implementation of the loQ Don instruction set architecture, created a test coverage plan, and tested your design. Just as when you went from Assignment 2 to 3, you'll be able to reuse most of that. However, this time you need to make your implementation do two things together -- implement superscalar execution capable of completing more than one instruction per clock under at least some circumstances. Qapla'!

Where To Begin

The biggest problem you have in implementing superscalar execution is that you have to be able to keep the system fed. For this project, you must fetch more than one instruction per clock cycle under at least some circumstances. In particular, each instruction is 16 bits long, but I suggest that you fetch a pair of instructions at a time, with the pair always being from instruction addresses that differ only in the least significant bit. For example, if you want to fetch the instruction at 0x1234, you actually fetch the instructions at 0x1234 and 0x1235. Logically, you're fetching an instruction pair at a time as a single 32-bit "word."

How you process each pair of instructions is entirely up to you, but your design must be able to complete more than one instruction per clock cycle under some circumstances -- which is not the same as saying it can do that every clock cycle. Here are some of the key ideas and requirements to mull over:

Fetching a pair of instructions should be implemented as fetching a 32-bit "instruction parcel" from an apparent address that ignores the low bit of the PC. In other words, it fetches from both (PC & ~1) and (PC | 1) at the same time. In straight-line code, that's not very strange... but keep in mind that a jz or jnz is allowed to jump to either an even or odd address. For example, if the code jumped to 0x1235, you would still fetch from both 0x1234 and 0x1235. In other words, you will be fetching an instruction from 0x1234 that you don't use.
Implementing some type of out-of-order instruction scheduling is worth 5% of the project grade. The out-of-order scheduling can be any type of rearrangement. For example, you might simply let instructions queue waiting for function units such that if there was a sequence of load/store instructions (assuming you have just one load/store data memory), you might let a later ALU instruction execute before one of the load/store instructions. You could do even more complex things, such as implementing register renaming and using that to do more aggressive out-of-order scheduling, but the 5% only requires you to do something out of order....
If you implement out-of-order instruction scheduling, you are not required to handle speculative execution. In other words, you may implement jz and jnz by disabling further instructions being fetched. This can mean that your superscalar design would often be outperformed by a simple speculative pipeline, but many of the fancier schemes for out-of-order execution standardly assume that only one basic block is being processed at a time. Note that you are required to implement speculative execution if you did not implement some form of out-of-order scheduling, and will lose an additional 5% if you implement neither speculation nor out-of-order execution.

Internal Hints

Here are some hints about the internal structure:

How many stages should you use? Well, it's up to you but I'll suggest 3 is a good number to start thinking about. Stage 0 would be instruction fetch and decode, Stage 1 would be register read, and Stage 2 would be ALU and data memory access and register write. You may find it appropriate to add more stages. For example, if you do register renaming, a 5-stage design might be more natural -- adding a register rename stage after instruction fetch and also adding a register "retirement" stage at the end.

I strongly recommend that Stage 0 should decode what it fetched. The loQ Don instruction format doesn't always have the register names that serve similar purposes in the same field; you want to fix as early as possible by converting vertical instruction encoding to horizontal control words. Why? Because then all the interlock/forwarding logic is always looking at the same "fields" of the state passed from one stage to the next. There are also some simplifications, such as the unary neg, negv only uses 1 operand... so you'll want to make the second operand slot be something that doesn't cause an extra dependence, such as a copy of the first operand register number.

Each stage or other entity is an always @(posedge clk) if (!halt)

Take advantage of the owner computes concept. Each stage should own any "variables" it writes; never write the same register in multiple stages. E.g., s0dest is the destination written by stage 0, s1dest is the destination written by stage 1; note s0dest is used by stage 1 and s1dest by stage 2. In a 3-stage pipe, register write is done by s2 -- it writes, so it owns the register file. Note that the register file is most used by s1, which is fine. Also keep in mind that assignment using = takes effect "immediately," leaving timing of accesses in other always clauses a bit unclear; the <= assignment distinguishes before and after.

If the owner needs to know stuff from other stages to update a register, there should be one always that collects that info into a simple signal. In other words, one always should own each inter-stage communication signal. For example, you'll probably want to have a separate "squash" flag that can be tested by stages.

Register 1 is also special: it is the PC value from when the instruction was fetched. In fact, it's not even a 32-bit register, but 16-bit with 16 0's above. Thus, a read from register 1 should simply get the saved PC value passed from stage 0. In any case, memory addresses never exceed 16 bits.

If you make the ALU a separate module with all the mess contained in it, you could easily instantiate it more than once to support more parallel execution (i.e., two ALU instructions in the same clock).

For loading instructions into memory (initialization), build yourself two instruction memories. Initialize the one that is an array of 16-bit words normally, then use a loop to copy it into the 32-bit word memory that is really the instruction memory. This gets around the issues of having 32-bit parcels of 16-bit instructions.

Due Dates

The recommended due date is TBA. Final submissions will be accepted up to TBA.

Submission Procedure

Each team will submit a project tarball (i.e., a file with the name ending in .tar or .tgz) that contains all things relevant to your work on the project. Minimally, each project tarball includes the source code for the project and a semi-formal "implementors notes" document as a PDF named notes.pdf. (Fairly obviously, the Implementor's Notes should also say who the implementors are -- list all team members as authors.) It also may include test cases, sample output, a make file, etc., but should not include any files that are built by your Makefile (e.g., no binary executables). For this particular project, name the Verilog source file super.v.

Submit your tarball below. The file can be either an ordinary .tar file created using tar cvf file.tar yourprojectfiles or a compressed .tgz file file created using tar zcvf file.tgz yourprojectfiles. Be careful about using * as a shorthand in listing yourprojectfiles on the command line, because if the output tar file is listed in the expansion, the result can be an infinite file (which is not ok).

Use the submission form below to submit your project as a single submission for your team -- you do not submit as individuals. The last submission before the final deadline is the one that will be graded.

Advanced Computer Architecture.