Assignment 3: A Faster IDIOT

In Assignment 2, you defined the instruction encoding, built an assembler, wrote Verilog code for a multi-cycle implementation of the IDIOT instruction set architecture, created a test coverage plan, and tested your design. The good news is that most of that stuff can be reused for this project. The bad news is that you still have a pile of Verilog to write... because you'll be implementing a nice, fast, pipelined version of IDIOT.

A Single-Cycle Starting Point

Remember EE380? Not really? That's ok... just play along anyway. Back in EE380, we followed a rather neat plan in the textbook that basically recommended that a pipelined design could best be created by initially designing a slow single-cycle implementation. The function units, data paths, and control signals defined for the single-cycle implementation could then be used (with only minor modifications) in the pipelined version. It was mostly just a matter of carving the single-cycle design into appropriate pipeline stages. Well, now is the time we see if that approach really works....

I could ask you to design your own single-cycle implementation from scratch, but I'll save us all some time and give you an overview diagram for such a thing right here:

You should recognize this from my little walk-through in class on March 25, but I have re-drawn it much neater, as I said I would. It's also color coded. Registers and other memory units are green, things that compute values are blue, and multiplexors are red. I'm not going to go through the wholen lecture again here, but the key ideas were:

All the ALU instructions -- and that's 12 out of a total of 18 instructions (counting the floating-point operations you don't yet need to worry about) -- get handled straightforwardly going PC to Instruction Memory to read Register File to ALU to write Register File, with PC getting PC + 1.
Load (ld) and store (st) are just like ALU instructions except that they substitute the Data Memory for the ALU (and store shouldn't write into a register).
The system call instruction, sys, should just halt. There really isn't any function unit asociated with that because it's just a matter of notifying the control logic that you're halted.
The sz operation basically increments the PC by 2 if the destination register holds the value zero. That's why there is a +2 ALU. Incidentally, note that +2 is really just an incrementer that skips the first bit position... so a clever implementation might combine the +1 and +2 ALUs.
The jz instruction stuffs the source register's value into the PC if the destination register holds the value zero.
I saved the worst for last: li isn't really executed in one cycle at all. It could be -- but that would require fetching at least two words from instruction memory each clock cycle and gluing-together the two words that make-up an li. Instead, my single-cycle design really just takes a single cycle per instruction word, so li takes two cycles. That's not too hard to do, but remember that the second word is just data, without an opcode or destination register field. When I say "remember," I mean you literally need to put some info from the first word of an li into a little register so you can use it to override looking at the opcode field in the second word. Basically, the li register needs to note if the next word is the second word of an li instructions and, if it is, it needs to remember the destination register that the second cycle will write the immediate value into.

Well, that wasn't so bad now, was it? Well, of course not... because that's not what you have to do. You have to make a pipelined design.

Setting The Stage(s)

One of the first steps in making a pipelined implementation is figuring-out how many stages there should be and what belongs in each. You may have noticed that the function units are not all that is colored in the above diagram -- well, the other colors are a hint about how to pipeline.

It is fairly obvious that the memories will take a little while to access, and we all know ALUs are notoriously slow. Thus, we'd expect a stage for each of those things along any circuit path. Roughly speaking, that means the black wires are active during the first stage as the instruction is fetched, the magenta wires are active as the register file reads the two register values (which requires a register file capable of two reads in one clock cycle), the gold/brown wires are active during the third cycle when the main ALU for data memory is used, and the cyan wires are active while writing into the destination register. This fairly obviously leads to a 4-stage pipeline: Instruction Fetch to Register Read to ALU / Memory access to Register Write.

See? Now that wasn't so bad, was it? Of course not... but your task is a little harder. The ALU instructions and load and store are all easy, and sys is still just halt, but that leaves three instructions with issues. How can you deal with them?

The sz instruction required a +2 incrementer in the single-cycle design, but here we have the problem that we don't know if we are supposed to skip until rather late in the pipe. As drawn, you'd only know by the end of stage 2, but the machine needs to know what address belongs in the PC next at the end of stage 0. Fortunately, there's an elegant solution to this problem: simply always load the next instruction and, if you later find-out that sz should have skipped it, just squash that instruction in the pipeline. In other words, turn the instruction into a side-effect-free null operation. This is basically equivalent to speculative execution predicting that the next instruction isn't skipped. Notice that this isn't really correct if the skipped instruction is an li, but that wasn't handled correctly by doing +2 in our single-cycle design either! It is fairly easy to handle handle squashing an li in the pipelined version; the +2 ALU in the single-cycle design should really be a +2/+3.
Now that you know how to handle the sz instruction, it should be easy to see how the same type of "squashing" should work for the jz instruction. Simply speculate that the jump will not be taken and squash the instructions fetched if it turns out that the jz should be taken. In fact, it's not too much harder to guess taken instead of not-taken... neither would it be hard to make a BHB (branch history buffer) to guide your predictions, but I don't expect you to do that.
Ah, but there's still li. Basically, the trick is to keep a little state info to tell the machine if the previous instruction was one that is followed by a word of data. I think it is easiest to use a pair of consecutive clock cycles for each li, but to save the li info at the end of the first clock cycle and forward the first clock's data as a "squashed" null operation. Thus, the second cycle really becomes the load immediate instruction -- with the opcode and destination register name both coming from the instruction squashed at the end of the first cycle.
Remember those wacky little scheduling issues having to do with sequences like add $u0,$u1 followed by or $u2,$u0? There must be time for the pipeline to store the new value of $u0 before the or moves past the register read stage in the pipleline. You may handle this using forwarding, but you don't have to use that approach -- feel free to just implement a simple hardware interlock that prevents the or from advancing to the next pipe stage when there is a conflict. You do have to implement some type of hardware solution for this interlock/value forwarding problem -- you can't just say "don't write code like that" because your assembler doesn't insert the nops that would be needed to make the code work correctly without hardware help.

Let's be completely clear about what I expect: your submission should be a four-or-more-stage pipelined Verilog implementation of IDIOT's non-floating-point instruction set. I expect you to implement both sz and jz in a way where there is some circumstance (e.g., non-zero) in which the instruction does not cause a pipeline bubble. I also expect you to correctly handle li operations, but have no problem at all with each li taking two pipeline stages (the first being turned into a null operation).

Stuff You Can Reuse

Assignment 2 was scary, but that was because you had never done something like this before -- now you have. For this project, you are allowed to reuse any pieces from the Assignment 2 solutions that you, or any of your Assignment 3 teammates. helped create. You also may use any materials I give you here. Perhaps most importantly, you also are free to not use any of those things; in other words, you can combine any of those materials and make changes as your Assignment 3 team sees fit. For example, if you don't like the way instruction fields are encoded, feel free to re-write the AIK assembler (but make sure your Implementor's Notes documents how instructions are encoded and why).

In general, you are not allowed to use anything from another Assignment 3 team nor from an Assignment 2 team that none of your Assignment 3 team members were on. You can use things done by any of your Assignment 3 team members, including things their teams did on Assignment 2, and things provided as part of this assignment. So, let's start the giving....

Instruction Encoding And AIK Specification

Here's my sample Assignment 2 solution for an AIK specification:

; AIK definition of IDIOT assembler, 20160321 by H. Dietz
add  $.d,$.s := 0b0000:4 .d:6 .s:6
and  $.d,$.s := 0b0100:4 .d:6 .s:6
any  $.d,$.s := 0b0111:4 .d:6 .s:6
dup  $.d,$.s := 0b1000:4 .d:6 .s:6
jz   $.d,$.s := 0b1110:4 .d:6 .s:6
ld   $.d,$.s := 0b1100:4 .d:6 .s:6
li   $.d,.v  := 0b1111:4 .d:6  0:6 .v:16
or   $.d,$.s := 0b0101:4 .d:6 .s:6
sz   $.d     := 0b1110:4 .d:6  1:6
shr  $.d,$.s := 0b1001:4 .d:6 .s:6
st   $.d,$.s := 0b1101:4 .d:6 .s:6
sys          := 0b1110:4  0:6  0:6
xor  $.d,$.s := 0b0110:4 .d:6 .s:6
addf $.d,$.s := 0b0010:4 .d:6 .s:6
f2i  $.d,$.s := 0b1010:4 .d:6 .s:6
i2f  $.d,$.s := 0b1011:4 .d:6 .s:6
invf $.d,$.s := 0b0001:4 .d:6 .s:6
mulf $.d,$.s := 0b0011:4 .d:6 .s:6
.const {zero	one	sign	all	sp	fp	ra	rv
	u0	u1	u2	u3	u4	u5	u6	u7
        u8  	u9	u10	u11	u12	u13	u14	u15
        u16  	u17	u18	u19	u20	u21	u22	u23
        u24 	u25	u26	u27	u28	u29	u30	u31
        u32 	u33	u34	u35	u36	u37	u38	u39
        u40 	u41	u42	u43	u44	u45	u46	u47
        u48 	u49	u50	u51	u52	u53	u54	u55 }
.segment .text 16 0x10000 0 .VMEM
.segment .data 16 0x10000 0x8000 .VMEM ; default data start at 0x8000
.const 0 .lowfirst ; was wrong endian for li

It will encode instructions the way I discussed in class, which you are free to use or not use. There are several "clever" opcode grouping decisions embodied in it. All the ALU instructions have at least one of the top two opcode bits as 0 -- i.e., all the non-ALU instructions match the opcode pattern 4'b11xx. Within the ALU instructions, you'll see 4'b00xx is used for arithmetic, 4'b01xx for logical, and 4'b10xx for single-operand transformations. Similarly, load and store encode as 4'b110x. The control flow operations are all 4'b1110 and li (which is almost a control flow operation because it is two words long) is encoded by 4'b1111. The selection of how to encode the argument register numbers is obvious in the above AIK specification... with the key property being consistency of the hardware's use of the values in each of the fields.

Verilog Implementations

Although I strongly encourage you to make separate modules at least for the main memory and ALU, here's my scary-simple multi-cycle Verilog design for Assignment 2:

// basic sizes of things
`define WORD	[15:0]
`define Opcode	[15:12]
`define Dest	[11:6]
`define Src	[5:0]
`define STATE	[4:0]
`define REGSIZE [63:0]
`define MEMSIZE [65535:0]

// opcode values, also state numbers
`define OPadd	4'b0000
`define OPinvf	4'b0001
`define OPaddf	4'b0010
`define OPmulf	4'b0011
`define OPand	4'b0100
`define OPor	4'b0101
`define OPxor	4'b0110
`define OPany	4'b0111
`define OPdup	4'b1000
`define OPshr	4'b1001
`define OPf2i	4'b1010
`define OPi2f	4'b1011
`define OPld	4'b1100
`define OPst	4'b1101
`define OPjzsz	4'b1110
`define OPli	4'b1111

// state numbers only
`define OPjz	`OPjzsz
`define OPsys	5'b10000
`define OPsz	5'b10001
`define Start	5'b11111
`define Start1	5'b11110

// source field values for sys and sz
`define SRCsys	6'b000000
`define SRCsz	6'b000001

module processor(halt, reset, clk);
output reg halt;
input reset, clk;

reg `WORD regfile `REGSIZE;
reg `WORD mainmem `MEMSIZE;
reg `WORD pc = 0;
reg `WORD ir;
reg `STATE s = `Start;
integer a;

always @(reset) begin
  halt = 0;
  pc = 0;
  s = `Start;
  $readmemh0(regfile);
  $readmemh1(mainmem);
end

always @(posedge clk) begin
  case (s)
    `Start: begin ir <= mainmem[pc]; s <= `Start1; end
    `Start1: begin
             pc <= pc + 1;            // bump pc
	     case (ir `Opcode)
	     `OPjzsz:
                case (ir `Src)	      // use Src as extended opcode
                `SRCsys: s <= `OPsys; // sys call
                `SRCsz: s <= `OPsz;   // sz
                default: s <= `OPjz;  // jz
	     endcase
             default: s <= ir `Opcode; // most instructions, state # is opcode
	     endcase
	    end

    `OPadd: begin regfile[ir `Dest] <= regfile[ir `Dest] + regfile[ir `Src]; s <= `Start; end
    `OPand: begin regfile[ir `Dest] <= regfile[ir `Dest] & regfile[ir `Src]; s <= `Start; end
    `OPany: begin regfile[ir `Dest] <= |regfile[ir `Src]; s <= `Start; end
    `OPdup: begin regfile[ir `Dest] <= regfile[ir `Src]; s <= `Start; end
    `OPjz: begin if (regfile[ir `Dest] == 0) pc <= regfile[ir `Src]; s <= `Start; end
    `OPld: begin regfile[ir `Dest] <= mainmem[regfile[ir `Src]]; s <= `Start; end
    `OPli: begin regfile[ir `Dest] <= mainmem[pc]; pc <= pc + 1; s <= `Start; end
    `OPor: begin regfile[ir `Dest] <= regfile[ir `Dest] | regfile[ir `Src]; s <= `Start; end
    `OPsz: begin if (regfile[ir `Dest] == 0) pc <= pc + 1; s <= `Start; end
    `OPshr: begin regfile[ir `Dest] <= regfile[ir `Src] >> 1; s <= `Start; end
    `OPst: begin mainmem[regfile[ir `Src]] <= regfile[ir `Dest]; s <= `Start; end
    `OPxor: begin regfile[ir `Dest] <= regfile[ir `Dest] ^ regfile[ir `Src]; s <= `Start; end

    default: halt <= 1;
  endcase
end
endmodule

As discussed in class, the above Verilog code is very likely to result in a bigger circuit than if we carefully factored things into modules and created single instances of those modules. For example, a Verilog compiler processing the above might fail to map mainmem into a dedicated memory block within an FPGA, instead constructing a memory using thousands of logic cells. Using an instance of a memory module designed to comply with the FPGA-maker's guidelines (e.g., this dual-port RAM with a Single Clock from ALTERA) ensures that the vendor's Verilog toolchain will correctly infer use of the intended hardware modules inside the FPGA. Of course, in this class we are not rendering designs into physical circuits, so these issues of complexity (and timing analysis) are neither obvious nor critical... but you should always be aware of the potential hardware complexity you risk introducing by using a specification style that doesn't explicitly factor-out the desired modules.

So, does the above Verilog code work? Mostly... which means "no." What's wrong? Well, the same thing that is wrong with the abstract single-cycle diagram up above: a sz will not correctly skip over a li instruction. In fact, it is a pain to fix that in the above Verilog code... so consider this a good example of where documenting a bug might actually make more sense than trying to fix it. How could one fix it? Basically, you'd have to actually fetch the skipped instruction and look at the opcode field, skipping a second word if opcde field was `OPli. I'll leave the actual Verilog code as an exercise for the reader (that means you ;-) ).

Ok, so that was roughly an Assignment 2 solution... what about the single-cycle design? Well, it doesn't have to be much different. The biggest difference is chaining everything together as one state, whereas the above took three states. The main trick is to absorb `Start and `Start1. That's easily done, except for the little detail that you need to override the state choice if you're fetching the second word of an li.

Testing

Again, the test coverage plan and testbench from Assignment 2 are probably very close to what you want. However, you do need to seriously think about coverage again. Why? You are not testing the same Verilog code, so there may be some paths that didn't exist before -- and they might not be covered with a testbench that covered your old version.

In general, you'd need to write a lot of Verilog code to implement a testbench for a complex circuit. However, as I reminded you in class, here you are building a processor -- so it can actually execute much of its own testing. All we need to do is execute a test program and have code examine the results. This approach has the happy benefit of also being a viable way to test that the actual hardware correctly implements the design: you would just run the code on the actual hardware.

A testbench to execute whatever program until completion is:

module testbench;
reg reset = 0;
reg clk = 0;
wire halted;
processor PE(halted, reset, clk);
initial begin
  $dumpfile;
  $dumpvars(0, PE);
  #10 reset = 1;
  #10 reset = 0;
  while (!halted) begin
    #10 clk = 1;
    #10 clk = 0;
  end
  $finish;
end
endmodule

This just enables trace generation, intializes everything with a reset, and then keeps toggling the clk until the processor says it has reached a halted state.

Next, you need some code to run on it. For example:

  .data
test:
  .word 0x601
  .text
main:
  li   $u0,0x1003
  add  $u0,$one ; u0=0x1004
  li   $u1,0x4321
  li   $u2,0x1234
  and  $u1,$u2 ; u1=0x0220
  li   $u3,0x4321
  or   $u2,$u3 ; u2=0x5335
  li   $u4,0x1234
  xor  $u3,$u4 ; u3=0x5115
  any  $u4,$u4 ; u4=1
  any  $u5,$zero ; u5=0
  dup  $u6,$sign ; u6=0x8000
  li   $u7,0x8421
  shr  $u7,$u7 ; u7=0x4210
  li   $u8,test ; u8=0x8000
  li   $u9,0x0042 ; u9=0x0042
  st   $u9,$u8 ; test=42
  ld   $u10,$u8 ; u10=0x0042
  dup  $u11,$zero ; u11=0
  li   $u12,taken
  jz   $u11,$u12
  dup  $u11,$one ; should not happen
taken:
  dup  $u12,$one
  li   $u13,nottaken
  jz   $u12,$u13
  dup  $u12,$all ; u12=0xffff
nottaken:
  sz   $u13
  dup  $u13,$zero ; u13=0x0000
  dup  $u14,$zero ; u14=0x0000
  sz   $u14
  dup  $u14,$one
  dup  $u15,$zero ; u15=0x0000
  sys
good:
  dup  $u15,$all ; should not execute

This code tests quite a lot (although not sz skipping a li!), but we'd have to look at a lot of register contents to see if everything is right. That's a tad awkward because the registers are hidden away inside the processor. However, they are not hidden from code being simulated on the processor, so we could simply add code to check the register values... but then how would I know it succeeded? The answer is a very simple trick: look at where the PC was when it halted (i.e., hopefully when it executed the sys instruction). In the above sequence, it should stop with the PC pointing at good -- if it stopped anywhere else, something did not work right.

Ah, but that doesn't test the values, does it? Nope. Here's how we test. For example, u3 should end-up being 0x5335, so we simply compare and execute a sys instruction if the value is not as desired. Executing the sys instruction will halt the machine, leaving the PC pointing to the failed test case! The sequence would look like:

  li   $u55,0x5335
  xor  $u55,$u3     ;compare for equality
  sz   $u55
  sys               ;if the buck stopped here, bad!

Of course, you can write IDIOT processor code that tests anything -- not just values in registers.

Due Dates

The recommended due date is Wednesday, April 13. Final submissions will be accepted up to when class begins on Monday, April 18 (extended from 11:59PM on Sunday, April 17 as announced in class).

Submission Procedure

Each team will submit a project tarball (i.e., a file with the name ending in .tar or .tgz) that contains all things relevant to your work on the project. Minimally, each project tarball includes the source code for the project and a semi-formal "implementors notes" document as a PDF named notes.pdf. (Fairly obviously, the Implementor's Notes should also say who the implementors are -- list all team members as authors.) It also may include test cases, sample output, a make file, etc., but should not include any files that are built by your Makefile (e.g., no binary executables). For this particular project, name the Verilog source file pipe.v.

Submit your tarball below. The file can be either an ordinary .tar file created using tar cvf file.tar yourprojectfiles or a compressed .tgz file file created using tar zcvf file.tgz yourprojectfiles. Be careful about using * as a shorthand in listing yourprojectfiles on the command line, because if the output tar file is listed in the expansion, the result can be an infinite file (which is not ok).

Use the submission form below to submit your project as a single submission for your team -- you do not submit as individuals. The last submission before the final deadline is the one that will be graded.

Advanced Computer Architecture.