Assignment 3: Quicker Logick

In Assignment 2, you defined the instruction encoding, built an assembler, wrote Verilog code for a multi-cycle implementation of the Logick instruction set architecture, created a test coverage plan, and tested your design. The good news is that most of that stuff can be reused for this project. The bad news is that you still have a pile of Verilog to write... because you'll be implementing a nice, fast, pipelined version.

A Single-Cycle Starting Point

Remember EE380? Not really? That's ok... just play along anyway. Back in EE380, we followed a rather neat plan in the textbook that basically recommended that a pipelined design could best be created by initially designing a slow single-cycle implementation. The function units, data paths, and control signals defined for the single-cycle implementation could then be used (with only minor modifications) in the pipelined version. It was mostly just a matter of carving the single-cycle design into appropriate pipeline stages. Well, now is the time we see if that approach really works....

You really need to design your own single-cycle implementation from scratch for this project. Why? Because there are too many options. For example, there are a number of very different, yet comparably-good, ways that you can handle condition codes, immediate values, etc. Fortunately, it's a pretty simple instruction set in any case.

So, I don't really claim you should take the following diagram too literally.... However, here's a rough diagram showing one way to start thinking about a single-cycle implementation that you can then pipeline:

We have been discussing most of the issues in class. I'm not going to go through everything again here, but some key ideas were:

Although I've shown datapaths, I've deliberately left some things ambiguous, such as which of S, T, or D fields goes to the Data Memory Addr input. These are things you need to decide for yourselves, and will depend on things like how you decide to encode your instructions. Remember that you are free to again redefine the instruction encoding and change the assembler accordingly, so pick field assignments to make your hardware design more straightforward.
Unlike MIPS, there really isn't any instruction that requires the ALU to be used before accessing memory. This makes load (lo) and store (st) pretty simple.
The system call instruction, sy, should just halt. There really isn't any function unit asociated with that because it's just a matter of notifying the control logic that you're halted.
When you think about it, the handling of li and si is a little interesting. Here, I've just drawn a path taking the immediate value to the ALU... but you'll need to think about exactly how that would work.
Unlike MIPS -- but like most processors -- you have condition code registers for comparison results. This greatly complicates things in that the condition code registers can have dependences just like a register in the register file can. In the single-cycle design, I really don't have to show much for this, but don't forget to test for these dependences in the pipelined version.
There are three separate places the next PC value can come from: PC+1, PC+1 plus the offset from a br instruction, or the register specified in a jr instruction. It's not that confusing in a single-cycle design, but in pipelined form, be careful about which you should be using at any point. It is possible that both br and jr instructions could be in the pipe at the same time....

Well, that wasn't so bad now, was it? Well, of course not... because that's not what you have to do. You have to make a pipelined design.

Setting The Stage(s)

One of the first steps in making a pipelined implementation is figuring-out how many stages there should be and what belongs in each.

It is fairly obvious that the memories (including the register file) will take a little while to access, and we all know ALUs are notoriously slow. Thus, we'd expect a stage for each of those things along any circuit path. The most natural result here is a pipeline with 3 or 4 stages (you really shouldn't be ending-up with fewer than 3 stages).

As we have been discussing in class, one of the most useful concepts in creating hardware (or parallel software) is owner computes: the idea that each register/memory should be written into by only one entity, its owner, and that entity should also compute the value that will be written. Thus, a pipeline doesn't really look at all sequential. Instead, a pipeline is a set of independent, parallel-executing, entities that communicate by the owner of each register updating the register value which is read by one or more other entities. For example, the buffer at the end of the instruction fetch stage (let's call this stage 0) will certainly include a register that holds the destination register number ($d), and this "d0" register is owned and written by stage 0. Because register 0 is read only, let's imagine that you set d0 to 0 if the instruction doesn't write a value -- for example, a st instruction doesn't write into a register (just into data memory). It sounds like the value written into d0 depends only on the instruction fetched, but in fact it might depend on what's happening in other stages because an instruction that was speculatively fetched after a br might turn into a null operation, meaning it doesn't write anywhere. Owner computes says stage 0 should be the thing that determines what to write, which will require it to also examine register owned by other stages. Similarly, stage 1 obviously will be reading the value in d0 -- if for no other reason than to copy it into register d1 (which stage 1 owns).

Keep in mind that computing complex formulas also can be isolated into little always, or even combinatorial assign, blocks that exist for the sole purpose of owning that computation's result. For example, you might find it easier to have a separate block that owns and computes the interlock condition that would prevent the instruction fetch and register read stages (stages 0 and 1) from advancing when there is a dependence on an instruction further in the pipeline. For example, in the (very different from your project) sample solution for the SIK stack machine, you'll not only see an always clause per pipe stage, but also assign blocks that own things like the halt signal (which is derived from two separate halt signals because SIK actually uses a superscalar pipeline that isn't truly halted until both halves of the pipe are halted). Even simpler examples of non-stages that might appropriately become separate blocks include the ALU and the main memory.

Not too bad, right? Of course not... but your task is a little harder. The ALU instructions and load and store are all easy, and sy is still just halt, but that leaves some instructions with issues. How can you deal with them?

The br and jr instructions depend on stuff that you probably will not know until it's too late to fetch the next instruction. Both depend on condition codes, which might be set long in advance or might be set by a co instruction that is right ahead of them in the pipe. The br instruction needs to add an offset to the PC+1 value, and that's not really reasonable to do in the same clock cycle (stage) in which you fetch the instruction. Similarly, jr needs to read from a register, and that really can't happen in the same cycle as instruction fetch. Thus, you need to either:
- Pretend to fetch nops until you know the target address and whether it is taken or not
- Speculatively fetch instructions after the br or jr based on the assumption that the jump will not be taken. If it happens that the jump isn't taken, the instructions you started speculatively are correct. If the jump isn't taken, then you'll need to change the PC accordingly and squash any instructions you speculatively began executing.
Remember those wacky little scheduling issues having to do with sequences like ad $u0,$u1,$u2 followed by or $u3,$u4,$u0? There must be time for the pipeline to store the new value of $u0 before the or moves past the register read stage in the pipleline. You may handle this using forwarding, but you don't have to use that approach -- feel free to just implement a simple hardware interlock that prevents the or from advancing to the next pipe stage when there is a conflict. You do have to implement some type of hardware solution for this interlock/value forwarding problem -- you can't just say "don't write code like that" because your assembler doesn't insert the nops that would be needed to make the code work correctly without hardware help.
You might be surprised to realize this, but the condition code register(s) have the exact same issue that the register file has. Thus, you might find it appropriate to do forwarding of the condition code. It will probably only save you a clock cycle or so, but either hardware interlock or forwarding will be necessary. Of course, only co (and cl -- which you are not implementing in this project) store into the condition codes and only br and jr read them. An interesting side issue is that the t and f conditions never really depend on a compare, so you could treat them as not having the usual condition-code dependence at all.

Let's be completely clear about what I expect: your submission should be a viable three-or-more-stage pipelined Verilog implementation of Logick's instruction set. The significant design decisions made should also be discussed in your Implementor's Notes.

Stuff You Can Reuse

Assignment 2 was scary, but that was because you had never done something like this before -- now you have. For this project, you are allowed to reuse any pieces from the Assignment 2 solutions that you, or any of your Assignment 3 teammates. helped create. You also may use any materials I give you here; for example, here are the guts of my sample solution for the Verilog code of the multi-cycle Logick project. Perhaps most importantly, you also are free to not use any of those things; in other words, you can combine any of those materials and make changes as your Assignment 3 team sees fit. For example, if you don't like the way instruction fields are encoded, feel free to re-write the AIK assembler (but make sure your Implementor's Notes documents how instructions are encoded and why).

In general, you are not allowed to use anything from another Assignment 3 team nor from an Assignment 2 team that none of your Assignment 3 team members were on. You can use things done by any of your Assignment 3 team members, including things their teams did on Assignment 2, and things provided as part of this assignment.

As discussed in class, Verilog code that specifies memory accesses somewhat carelessly is very likely to result in a bigger circuit than if we carefully factored things into modules and created single instances of those modules. For example, a Verilog compiler might fail to map Data Memory into a dedicated memory block within an FPGA, instead constructing a memory using thousands of logic cells. Using an instance of a memory module designed to comply with the FPGA-maker's guidelines (e.g., this dual-port RAM with a Single Clock from ALTERA) ensures that the vendor's Verilog toolchain will correctly infer use of the intended hardware modules inside the FPGA. Of course, in this class we are not rendering designs into physical circuits, so these issues of complexity (and timing analysis) are neither obvious nor critical... but you should always be aware of the potential hardware complexity you risk introducing by using a specification style that doesn't explicitly factor-out the desired modules.

Testing

Again, the test coverage plan and testbench from Assignment 2 are probably very close to what you want. However, you do need to seriously think about coverage again. Why? You are not testing the same Verilog code, so there may be some paths that didn't exist before -- and they might not be covered with a testbench that covered your old version.

Due Dates

The recommended due date is Friday, November 3. Final submissions will be accepted up to when class begins on Wednesday, November 8.

Submission Procedure

Each team will submit a project tarball (i.e., a file with the name ending in .tar or .tgz) that contains all things relevant to your work on the project. Minimally, each project tarball includes the source code for the project and a semi-formal "implementors notes" document as a PDF named notes.pdf. (Fairly obviously, the Implementor's Notes should also say who the implementors are -- list all team members as authors.) It also may include test cases, sample output, a make file, etc., but should not include any files that are built by your Makefile (e.g., no binary executables). For this particular project, name the Verilog source file pipe.v.

Submit your tarball below. The file can be either an ordinary .tar file created using tar cvf file.tar yourprojectfiles or a compressed .tgz file file created using tar zcvf file.tgz yourprojectfiles. Be careful about using * as a shorthand in listing yourprojectfiles on the command line, because if the output tar file is listed in the expansion, the result can be an infinite file (which is not ok).

Use the submission form below to submit your project as a single submission for your team -- you do not submit as individuals. The last submission before the final deadline is the one that will be graded.

Advanced Computer Architecture.