Assignment 2: Multi-Cycle Tangled

In this project, your team is going to build a multi-cycle implementation of Tangled, the instruction set design you built an assembler for in Assignment 1. However, there are a few important simplifications....

The Simplifications

Tangled is not a particularly simple system. Don't worry -- the parts you'll be building are not that complex either -- but things are signficantly more complex than you're likely to be able to handle in your first team project. That's why there is a sequence of three separate team projects. Each one is not only having you build (well, ok, design and code using Verilog) something significant, but they are also guiding you to become more sophisticated in your understanding of computer architecture and more experienced at design of relatively complex systems. That all means the first of these team projects is intended to be a relatively gentle first experience in processor design.

The first, rather huge, simplification is that you are not building a pipelined processor, but a simple multi-cycle implementation. However, you're also not implementing everything in this first version of the Tangled processor. So, what's missing?

Diddly Sqat of Qat

The first thing missing is Qat. Each Qat instruction should still have a unique encoding generated by your assembler, and your Verilog code should recognize and decode all Qat instructions. That includes dealing with the fetching of the second instruction word for the 32-bit instructions. However, you will not be implementing the Qat coprocessor yet. Thus, all the Qat instructions should simply be handled by executing a trap with a "missing coprocessor" error -- which simply means you will treat all the instructions in the table of Qat instructions in the Tangled instruction set reference as if they were 16-bit or 32-bit sys instructions. Obviously, there's not much to test here.

This handling of Qat instructions probably seems like cheating, and it is, but it's a cheat with a long and proud history. What? Well, once upon a time, microprocessors didn't have floating-point units because they literally couldn't fit that much circuitry on the chip; that's why floating-point was often implemented in a second, coprocessor, chip. For example, the Intel 8086 had the 8087. However, the 8087 chips were not available when the 8086 was first released, and evene when they were shipping in quantity, they weren't cheap. So, lot of people started using 8086 processors without an 8087. The cool thing is, their software could still pretend they had an 8087. Trapping each 8087 instruction allows the OS to be invoked, and it was a relatively simple matter for the OS to execute integer 8086 code to simulate the 8087 instruction that caused the trap. This trick actually goes even further back. For example, the IBM 360 line consisted of many compatible computer models -- except the cheaper ones weren't really fully compatible in hardware. They simply trapped and had the OS emulate the instructions they didn't implement in hardware. Of course, you are not going to write OS code to handle the trapping for Qat instructions... you'll simply add the Qat coprocessor in a later project.

The Magic of Floating-Point

You will fully implement the floating-point instructions, but the Verilog implementation and test values are essentially given to you here.

I'm not gonna lie to you: floating-point arithmetic isn't easy to implement, and efficient implementation is even harder. It would be harder still for you right now, because we haven't yet talked about the details of how floating-point works. Thus, I'm giving you this Verilog implementation called floaty (which I originally wrote for the TACKY processor). You should integrate that code in your project -- although you should omit the module testbench it contains. Also note it needs the VMEM 0 initialization to implement a lookup table for reciprocal. So, you're not really implementing most floating-point operations using your own Verilog code, but your assembler will still have to encode the floating-point instructions and you have to write Verilog code that will execute each floating-point instruction.

There's also the matter of how to test them, because we haven't yet really discussed how our 16-bit floats are represented. We will later, and here are some details we'll discuss then, but you don't need to understand that yet.

Here's how you should handle each of the floating point instructions:

addf should be processed by the floaty code for module fadd(r, a, b);. The recommended test case is that adding 0x3f80 + 0x4000 should give 0x4040. That's really 1.0 + 2.0 is 3.0.
float should be processed by the floaty code for module i2f(f, i);. The recommended test case is that converting 0x0001 should produce 0x3f80. That's really (float)1 is 1.0.
int should be processed by the floaty code for module f2i(i, f);. The recommended test case is that converting 0x3f80 should produce 0x0001. That's really (int)1.0 is 1.
mulf should be processed by the floaty code for module fmul(r, a, b);. The recommended test case is that multiplying 0x4000 * 0x4000 should give 0x4080. That's really 2.0 * 2.0 is 4.0.
negf is simple enough for you to implement directly: the result is the same as the input, except bit 15 is flipped. The recommended test case is that negating 0x3f80 should produce 0xbf80. That's really -(1.0) is -1.0.
recip should be processed by the floaty code for module frecip(r, a);. The recommended test case is that the reciprocal of 0x4000 should be 0x3f00. That's really 1/2.0 is 0.5.
sltf should be processed by the floaty code for module fslt(torf, a, b); -- but there's a catch. That module produces a single-bit result, but you need a 16-bit integer result. The integer result is supposed to be 0 or 1, so the single-bit value computed is bit 0, and bits 15:1 are always 0. The recommended test case is that 0x3f80 < 0x4000 should give 0x0001. That's really 1.0 < 2.0 is 1.

That's all there is to the floating-point support.

Top Down

I said it in class, but let me repeat it here: you're going to be building a fairly complex collection of stuff. You'll never get it all working unless you're pretty methodical about the development process... which I'm strongly recommending should be mostly top down.

Before doing anything, look at the instruction set. Think about what kind of hardware structures you're going to need to implement each type of instruction -- and group similar instructions together. Remember those high-level processor architecture diagrams in EE380? Well, you want to think a bit about what one of those would look like for your Tangled processor. In fact, your multi-cycle design will probably look a lot like the Simple Processor Architecture from CPE380, although there will be various simplifications (e.g., you don't need a MFC line because you can assume your memory completes an access in one cycle; you also can have as many busses as you want). Remember how we built-up that design in EE380 by going through the instruction set and incrementally adding whatever was needed to implement each instruction? Think about this project the same way.

Am I saying you need to draw one of those diagrams right at the start of the project? Not at all. What I'm saying is that you should always have in the back of your head roughly what the big picture is expected to look like. As you think about each instruction, think about what hardware will be involved in executing it and what types of control signals and datapaths will be needed. What things seem hard to do (the fancy title for this is identify technological risk factors)? Make little notes to yourself. Discuss these things in your team. Make the big or confusing decisions as a team -- and document the non-obvious things in your Implementor's Notes.

Instruction Encoding And The Assembler

The Tangled/Qat ISA should be familiar by now. You need to define the mapping of the ISA into instruction bit patterns -- you know, like you just did in the previous project. The catch is that now you might want to rethink the encoding to try to simplify your Verilog implementation. You should discuss that within your team -- and each of the members of your team probably had somewhat different insights about encoding the instructions, so together you'll probably do better than any one would individually.

Group things together by how they work, and try to make the encoding reflect those groupings. As we discussed in class, a good encoding can make it a lot easier to decode instructions. I strongly recommend devising a scheme whereby each instruction type has a magic number associated with it, and that number can easily be encoded into the opcode field(s) and extracted from them for the Verilog implementation to use as state numbers. You can use whatever instruction encoding is most convenient for you to implement in this project, which might or might not be the same field arrangements or values any of your team members used in the previous assignment -- or the one that I used in my assembler sample solution. You'll notice that my sample solution understands multiple forms to simplify the synthesized pseudo-instructions, and your assembler does not need to implement them as cleverly. Of course, since the synthesized instructions are really sequences of Tangled instructions, there are no additional things for your Verilog code to implement.

Now you're probably getting nervous about the encoding choices. Don't be. Unlike the real world, in this class you can always change your mind if you later discover your instruction encoding was awkward. It should also be understood that many different encodings are comparably good, so don't be nervous if you hear that somebody else did things very differently... you really can both be equally right. Still nervous? Explaining any nervousness-inducing decisions you made in your Implementor's Notes should help you feel better. ;-)

The Verilog Hardware Design

I bet a lot of you are scared of this. You should be; it could be a huge mess. The trick is to never let it become a huge mess by sticking to that top down structured design discipline.

This design problem is not entirely new for you, but the design work you did in EE380 skipped a lot of implementation details that you cannot skip here. Still, think about things as you were told to in EE380. Step through what each instruction needs to do and logically build-up that big picture of the implementation architecture. Think about what function units, data paths, and control signals you will need. Do this before writing Verilog definitions of any piece. In fact, write it up in your implementor's notes before you write Verilog code.

When you think you're nearly ready to start writing Verilog code, recall that in lecture I showed you several sample solutions for previous semester projects, and they're all linked-into the home page for this course. Don't try to copy and edit that Verilog code; Tangled is (very deliberately) too different. However, nothing you're doing requires a solution that is more complex than in any previous semester -- although your solution will be significantly longer than most, partly because you have a fair number of instructions, but mostly because of those modules from floaty. If you think your solution needs to be significantly more complex, you're not yet ready to start writing Verilog code: design first, code second.

Structuring Your Verilog Code

As I stated in class, I strongly suggest that you think in terms of writing definitions of control signals and dummy top-level modules (with their output and input specifications). I very much like the idea of having an abstracted list of control signal definitions using `define. By consistently using things like `WORD or `ADDR instead of writing both as [15:0], the Verilog hardware description becomes just a little more abstract; you no longer have to ask yourself if something that says [15:0] is a 16-bit word, memory address, or if it is something else that just happens to also be 16 bits. The same benefit happens by using something like `OPadd instead of 4'b0110, but you also get three more benefits:

Directly deriving the control signals and state numbers from the instruction opcode can greatly simplify things. Of course, the opcode structure for Tangled is slightly complicated by the fact that there isn't space to decode all instructions looking only at a fixed opcode field, but you can create a virtual opcode from the instruction fields -- as I did in the IDIOT example, where I made a virtual 5-bit opcode value to distinguish OPjz, OPsys, and OPsz. These virtual opcodes can then also be used as state numbers, etc., saving at least one translation.
If you decide to make the ALU a separate module (which I didn't in most sample solutions for previous projects, but it is a valid option and can actually simplify your Verilog code), you know that the module implementing the ALU will understand the same control signal the same way as any module that instantiates an ALU.
Knowing the complete set of ALU operations and their encoding becomes a fairly detailed specification of what your ALU must implement. This little header of `defines is really both a design specification and a part of the design implementation.

In summary, in lectures you got a fairly detailed overview of how to go about designing hardware for a complete computer system. The bottom line is that you should start by defining the set of function units, data paths, and control signals you will need. Define the interfaces and signals. Then build the modules themselves. Note also that for this project, you are allowed to use things like the Verilog + operator to build an integer adder: you need synthesizable Verilog, but you don't have to specify things at any particular level.

How Many Modules Should There Be?

Well, it isn't too difficult to build the entire processor as a single module -- as I often have done. However, that makes the Verilog code harder to test and debug. You don't want to wait until everything is written to start testing and debugging the pieces. It also makes it much harder to reuse pieces of it in the next project, which will be a pipelined implementation. Worse still, if we were rendering the design to an FPGA or ASIC, it is quite possible that a single-module version of the Verilog code will generate unnecessarily complex hardware. This can happen by the compiler failing to factor function units (e.g., creating multiple ALUs when one would suffice) or, even more often, by implementing memories at the gate level because the Verilog compiler failed to recognize that your memory could be implemented using a standard memory structure (RAM block). Still, how many modules you make is entirely up to you.

If you do decide to implement the entire processor and memory inside one module, make sure you still do incremental testing. You can test everything after you add code for each instruction, rather than testing only after all instructions have been implemented.

Test Plan

As we have been discussing in class, testing a complex piece of hardware is a lot more difficult than simply enumerating all input values and comparing circuit outputs to those of an oracle (correct reference) computation. Your project needs to include a test plan (best described in your Implementor's Notes) as well as a testbench implementing the planned test procedure.

In class, we distinguished testing correctness of the design from testing correct operation of an implementation of the design. For this project, you do not need to worry about implementation test issues: i.e., your test plan does not need to target identification of faults caused by faulty manufacture, timing issues, etc. Neither do you need to "design for testability" in this project -- for example, you don't need to insert scan access paths for internal state that would otherwise be unobservable in the circuit implementation. What you need to do is develop a test plan that will give good certainty that your design itself is logically, functionally, correct.

In class, we discussed the covered test coverage tool, the metrics it collects, and what should be considered acceptable coverage values. Fundamentally, the most important type of coverage for this project is that every circuit path (every Verilog statement) should be used in some test case. You need not use the covered tool, nor its version embedded in this course's Verilog WWW form interface, to perform the coverage analysis, but you should provide some explanation in your Implementor's Notes of how your suite of test cases covers approximately 100% of all statements (lines of Verilog). You may (should) assume that built-in Verilog structures and operators, such as +, are operating correctly without exhaustively testing them... but implementations of operations like rotation probably require some test cases (because rotation isn't a Verilog operator).

The testbench you create to implement your test plan should look a lot like the testbench you wrote for Assignment 0, except:

Rather than writing Verilog code to compare output with an oracle, you should use the little trick we discussed in lecture of writing a single assembly language test program that tests all instructions and simply falls into a trap instruction and halts early if any test fails. The nice thing is that, using such a test program, you only need to know the final PC value (or number of clock cycles simulated) to know what test failed -- you don't need to examine internal state such as registers and memory. If you think about it, the truth is that your assembly langauge code actually embeds an oracle: explicitly testing the add instruction that was supposed to add 1 + 2 really does equal the constant 3 is done with Tangled instructions, rather than Verilog test code.
Before you put everything together, it is perfectly reasonable for you to be testing the pieces separately. This is particularly true if, for example, you build a separate module for the ALU or memories. In fact, the floaty Verilog code I've given you was written and tested as independent modules -- and you can even see test code for it in floaty's module testbench;. Combinatorial logic modules could even be exhasutively tested, much as you did in Assignment 0. Remember that testing is about removing technological risk; the need for testing is directly proportional to how uncertain you are about what you're doing. Again, how and what you test is all up to you (except for Qat and floating-point, which are to be handled as described above). Be sure to document your test plan in your Implementor's Notes, and you can submit any additional files involved in implementing the tests as part of your submission tarball.

If you think about it, that basically means the Verilog portion of your final testbench can be something very simple, like:

module testbench;
reg reset = 0;
reg clk = 0;
wire halted;
processor PE(halted, reset, clk);
initial begin
  $dumpfile;
  $dumpvars(0, PE);
  #10 reset = 1;
  #10 reset = 0;
  while (!halted) begin
    #10 clk = 1;
    #10 clk = 0;
  end
  $finish;
end
endmodule

This just enables trace generation, intializes everything with a reset, and then keeps toggling the clk until the processor says it has reached a halted state. Remember that the traces aren't really intended for you to read (although you could), but primarily are there to be examined by Covered to determine test coverage.

Note that my online Verilog WWW interface allows use of $readmem directives, so it is much simpler to use that mechanism to initialize memory for your test cases. Include any such files in your submission as files with names ending in .vmem (to indicate that they are Verilog memory initialization files).

Due Dates

The recommended due date is before class, Friday, October 16, 2020. By that time, you should definitely have at least submitted something that includes your assembler specification (tangled.aik) and Implementor's Notes including an overview of the structure of your intended design. That overview could be in the form of a diagram, or it could be a list of top-level modules, but it is important in that it ensures you are on the right track. Final submissions will be accepted up to just before class on Monday, October 19, 2020.

Note that you can ensure that you get at least half credit for this project by simply submitting a tar of an "implementor's notes" document explaining that your project doesn't work because you have not done it yet. Given that, perhaps you should start by immediately making and submitting your implementor's notes document? (I would!)

Submission Procedure

For each project, your team (NOT each person individually) will be submitting a tarball (i.e., a file with the name ending in .tar or .tgz) that contains all things relevant to your work on the project. Minimally, each project tarball includes the source code for the project and a semi-formal "implementors notes" document as a PDF named notes.pdf. It also may include test cases, sample output, a make file, etc., but should not include any files that are built by your Makefile (e.g., no binary executables). Be sure to make it obvious which files are which; for example, if the Verilog source file isn't tangled.v or the AIK file isn't tangled.aik, you should be saying where these things are in your implementor's notes.

Submit your tarball below. The file can be either an ordinary .tar file created using tar cvf file.tar yourprojectfiles or a compressed .tgz file file created using tar zcvf file.tgz yourprojectfiles. Be careful about using * as a shorthand in listing yourprojectfiles on the command line, because if the output tar file is listed in the expansion, the result can be an infinite file (which is not ok).

Advanced Computer Architecture.