Assignment 3: A SIK Pipeline

In this project, your team is going to build a pipelined implementation of SIK, the little instruction set design you built multi-cycle implementation of in Assignment 2.

Pipelined Stack Code?

SIK is a stack machine. That is not the kind of model you expect to be pipelining, because there are very frequenlty dependences between adjacent stack instructions thanks to everything operating on the top of the stack. In fact, the changes to the stack pointer itself are effectively serializing dependences! So, how are we going to get any parallelism out of this?

The dependences on stack pointer updates become irrelevant to the pipeline if they happen immediately after fetching an instruction. Why? Well, stack instruction sets generally don't change the stack pointer by unpredictable amounts late in the pipeline. For example, Add always bumps the stack pointer by +1 no matter what the values being added are. The only suspect instruction in SIK is Pop, but the constant that the stack pointer is being adjusted by is built-into the instruction, so even it is available immediately after instruction fetch. Thus, the stack pointer isn't accessed from all stages, but rather the instructions that need it can effectively carry a copy of the value they need with them.
There are 256 registers used to implement the stack. Thus, the first step is to stop thinking of the stack as a stack, but rather as a bunch of registers. This is not difficult. As soon as a stack instruction is fetched, you can (and probably should) convert it into a register-to-register operation. However, this probably should take a clock cycle (pipe stage) because the register numbers are generally derived as offsets from the register number the stack pointer contains... so there is an adder.
Dependences on registers are dependences on registers. Your stack code transformed to work on registers still needs to have all the usual checks for dependences on register use between stages. You will still need to implement either interlock or value forwarding.
Because stack instructions tend to work on the top of the stack values, it is common that you'll get pipeline bubbles. Lots of pipeline bubbles. However, you're going to get around that the same way Intel got around instruction scheduling in their processors. Rather than resorting to out-of-order execution, your implementation will simply be multithreaded. Multithreading means that there is more than one virtual processor in your processor, and the hardware itself directly implements timesharing at the single-clock-cycle level. That probably sounds nasty, but it's really easy to do Intel-style two-way hyperthreading:
- You'll need two copies of all the registers: pc, stack pointer, ... the entire register file. Yup; you'll have 512 registers instead of 256. That's basically a nine-bit register number with the top bit saying which thread the register is associated with. Think of everything as a 2-element array of all the registers you would have in a single-thread processor. The cool thing is that dependences can only happen between references in the same thread, so alternating between threads each clock cycle ensures there will never be a dependence between adjacent pipe stages.
- You will be running two programs at once. This means not all stage contents are affected by things like a delay in a conditional branch (JumpF or JumpT) -- only the thread that executes such an instruction is affected by how that gets resolved. For example, if you speculatively execute an instruction and find that you guessed wrong and that instruction shouldn't be executed, you would convert it into a NOP; however, the instructions in other stages that are from a different thread shouldn't become NOPs. This also brings up the point that if you stall one thread, that doesn't necessarily mean the other thread must stall. You could even let the one active thread schedule instructions in the stalled threads cycles... or not (Intel does not; if one thread is stalled, the other thread runs only every other clock cycle, not every clock cycle). When do you really halt? Well, you halt when both threads have halted.
Start with a single-cycle design, but not the one from EE380.

That's really all that's strange. Just keep in mind that a register-based stack machine is really a general register machine with some bookkeeping figuring out which registers we are using rather than just having constant register numbers in the instruction. As you'll see in the pipelined project, it even can be useful to think of there being a separate state in which the bookkeeping determines which registers will be used....

Single-Cycle Starting Point

Remember EE380? Not really? That's ok... just play along anyway. Back in EE380, we followed a rather neat plan in the textbook that basically recommended that a pipelined design could best be created by initially designing a slow single-cycle implementation. The function units, data paths, and control signals defined for the single-cycle implementation could then be used (with only minor modifications) in the pipelined version. It was mostly just a matter of carving the single-cycle design into appropriate pipeline stages. Well, now is the time we see if that approach really works....

I could ask you to design your own single-cycle implementation from scratch, but I'll save us all some time and give you an overview diagram for such a thing right here. You are not required to use the following, but let's start thinking with:

Of course, that leaves out a lot of details... and even then you're not yet ready to start writing Verilog code: design first, code second.

Setting The Stage(s)

One of the first steps in making a pipelined implementation is figuring-out how many stages there should be and what belongs in each.

It is fairly obvious that the memories (including the register file) will take a little while to access, and we all know ALUs are notoriously slow. Thus, we'd expect a stage for each of those things along any circuit path. However, there is also that funny little issue of decoding the register references from each stack instruction, and that's probably worthy of its own stage. That probably gives you at least the four stages shown above in yellow.

So, what else should you be thinking about?

There's a little concept I like to call Owner Computes. Basically, the idea is that multiple potentially simultaneous writers to an object (register) is a problem, so let's have a single entity responsible for updating each writable object. We'll call that thing the owner. A thing here would either be a module or at least an always block (and you probably at least want an always block for each pipe stage). For example, the PC shouldn't be stored into by multiple stages even though you might be in very different stages when you determine to update it by adding 1 vs. jumping to a target address. Pick one stage to store new values into the PC and let it look at (read) things set by other stages to determine which value should be stored into the PC. The same is true of the register file; it should be owned by what was called stage 4 in the above diagram.
At various times in EE380 and EE480, we've mentioned that most modern processors don't literally execute the instruction set seen by the programmer, but translate it into something more convenient to execute. Well, that's what stage 2 in the above diagram is doing. In effect, you're defining an internal instruction set... and it's probably good for you to be aware that you are doing that, and to document any significant aspects of the internal instruction set that aren't obvious from the SIK documents you've been given in the assignments.
You're probably thinking that the multithreading is going to cause problems. Well, not really. However, your fetch logic needs to flip back and forth between the threads with each clock cycle. You might be tempted to design the processor without mutithreading and then add it, but don't. Multithreading makes this design easier. Think about it. Enough said. (Well, almost enough -- also be careful that you don't update the wrong PC, etc.)
One of the little details in multithreaded pipelining is how you decide which thread gets to go next. Like Intel's hyperthreading, you only have two threads -- so it's obviously alternating when both have work to do. As I've stated in class, the folks at Intel seemed to think it was fine to literally do odd/even clock cycle scheduling: thread 0 fetches an instruction in even clock cycles, and thread 1 in odd cycles. However, various Intel processors never stray from that even/odd pattern even if one thread isn't ready to fetch during its turn; it will not fetch in that cycle rather than run the thread that was ready. It would be desirable to let the thread that isn't blocked use every cycle, but they don't do that. I'll leave it up to you to decide how you pick which thread gets to fetch an instruction during each clock cycle, but I'll repeat my little in-class comment that Intel has some pretty smart engineers....
Ignoring multithreading, there can be register dependences at the register read stage: you can either implement value forwarding or interlock to resolve dependences, but fundamentally it is most desirable to avoid blocking the entire pipe because of a dependence in just one of the two threads. The catch is that this seems difficult to implement, because it would require allowing two instructions, one from each thread, to simultaneously be in the register read cycle. Thus, for this project, it is permissible that a register dependence blocks the entire pipe at the register read stage, not just the thread it occurs within. However, the most interesting observation is that, if you use a 4-stage design like that suggested in the diagram, multithreading can make the entire register dependence problem disappear. Why? With a strictly odd/even thread scheduler, by the time an instruction from one thread is in stage 3, the instruction before it in that thread has already completed execution... so there is nothing left to wait for! Don't forget that the registers used by each thread are inherently independent of those used by the other thread.
As soon as you see that there are conditional jump instructions, JumpF and JumpT, you should immediately be thinking about how to handle fetching the next instruction. Thankfully, they're jumps and not branches, so you don't need to run through an adder to compute the target address. However, to know if the jump is taken or not, you need to know how torf is set. The setting of the torf bit can happen arbitrarily far ahead of time -- if the compiler or assembler code it that way. However, it is very likely that the instruction that sets torf is a Test instruction right in front of the JumpF or JumpT. My diagram above makes it look like the address for the jump target and the torf flag come from stage 2, but I'll tell you right now that they don't; I want you to think about where they come from. Given that torf might not be set until later in the pipe, you have the classic choice of how to handle this: stall until torf is ready or speculatively fetch instructions and squash them if it turns out that your taken/not-taken guess was wrong. However, again remember that you are implementing a multithreaded processor: if you decide to stall, that stall should only block instruction fetch for that thread. The next clock cycle, the other thread can still fetch. Thus, blocking really would need to be implemented by pretending to fetch a NOP. However, think about what you're doing here -- if you're careful about your pipeline design, multithreading can make this problem a lot easier to solve.

In lectures you got a fairly detailed overview of how to go about designing hardware for a complete computer system. The bottom line is that you should start by defining the set of function units, data paths, and control signals you will need. Define the interfaces and signals. Then build the modules themselves. Note also that for this project, you are allowed to use things like the Verilog + operator to build an adder: you need synthesizable Verilog, but you don't have to specify things at any particular level. You are also free to use earlier project work from any of your team members, or the sample solutions for previous projects, as resources in creating this project; if you reuse significant portions, be sure to state that in your implementor's notes. My sample solution for the previous project, as was presented in class, is the following AIK specification and Verilog code.

Let's be completely clear about what I expect: your submission should be a viable four-or-more-stage, two-threaded, pipelined Verilog implementation of the SIK instruction set. For this project, I expect you to treat the instruction memory and data memory as able to sustain one access each per clock cycle; it is up to you if you treat them as a unified memory or as Harvard-style separate instruction and data memories. Of course, all significant design decisions made should be discussed in your Implementor's Notes.

Test Plan

Yes, you still need one. Your project needs to include a test plan (best described in your Implementor's Notes) as well as a testbench implementing the planned test procedure.

I strongly recommend writing a test program that tells you what failed by where it halts... but keep in mind that you will be running two threads interleaved. Thus, you'll need to have some way to intialize the two PCs, and you'll need the code for both programs loaded into memory somewhere. Despite that, the Verilog portion of your testbench can be something very simple, like:

module testbench;
reg reset = 0;
reg clk = 0;
wire halted;
processor PE(halted, reset, clk);
initial begin
  $dumpfile;
  $dumpvars(0, PE);
  #10 reset = 1;
  #10 reset = 0;
  while (!halted) begin
    #10 clk = 1;
    #10 clk = 0;
  end
  $finish;
end
endmodule

This just enables trace generation, intializes everything with a reset, and then keeps toggling the clk until the processor says it has reached a halted state.

Note that my online Verilog WWW interface allows use of $readmem directives, so it is much simpler to use that mechanism to initialize memory for your test cases. Include any such files in your submission as files with names ending in .vmem (to indicate that they are Verilog memory initialization files).

Due Dates

The recommended due date for this assignment is before class, Wednesday, April 5, 2017. However, the submission window will close when class begins on Monday, April 10, 2017. You may submit as many times as you wish, but only the last submission that you make before class begins on Monday, April 10, 2017 will be counted toward your course grade.

Note that you can ensure that you get at least half credit for this project by simply submitting a tar of an "implementor's notes" document explaining that your project doesn't work because you have not done it yet. Given that, perhaps you should start by immediately making and submitting your implementor's notes document? (I would!)

Submission Procedure

You should submit a tarball (i.e., a file with the name ending in .tar or .tgz) that contains all things relevant to your work on the project. Minimally, the tarball should include the Verilog and AIK source code for the project and a semi-formal "implementors notes" document as a PDF named notes.pdf. It also may include test cases (e.g., source SIK code and .VMEM files), sample output, a make file, etc., but should not include any files that are built by your Makefile (e.g., no binary executables). Be sure to make it obvious which files are which; for example, if the Verilog source file isn't sik.v or the AIK file isn't sik.aik, you should be saying where these things are in your implementor's notes.

Submit your tarball below. The file can be either an ordinary .tar file created using tar cvf file.tar yourprojectfiles or a compressed .tgz file file created using tar zcvf file.tgz yourprojectfiles. Be careful about using * as a shorthand in listing yourprojectfiles on the command line, because if the output tar file is listed in the expansion, the result can be an infinite file (which is not ok).

Advanced Computer Architecture.