Assignment 4: nItebHa', Doch cha'
In Assignment 3, you defined the instruction encoding, built an
assembler, and wrote Verilog code for a pipelined implementation
of the loQ Don instruction set architecture, created a test
coverage plan, and tested your design. Just as when you went
from Assignment 2 to 3, you'll be able to reuse most of that.
However, this time you need to make your implementation do two
things together -- implement superscalar execution
capable of completing more than one instruction per
clock under at least some circumstances. Qapla'!
Where To Begin
The biggest problem you have in implementing superscalar
execution is that you have to be able to keep the system fed.
For this project, you must fetch more than one
instruction per clock cycle under at least some
circumstances. In particular, each instruction is 16 bits long,
but I suggest that you fetch a pair of instructions at a
time, with the pair always being from instruction
addresses that differ only in the least significant bit. For
example, if you want to fetch the instruction at 0x1234, you
actually fetch the instructions at 0x1234 and 0x1235. Logically,
you're fetching an instruction pair at a time as a single 32-bit
"word."
How you process each pair of instructions is entirely up to you,
but your design must be able to complete more than one instruction
per clock cycle under some circumstances -- which is not the same
as saying it can do that every clock cycle. Here are some of the
key ideas and requirements to mull over:
-
Fetching a pair of instructions should be implemented as
fetching a 32-bit "instruction parcel" from an apparent address
that ignores the low bit of the PC. In other words, it fetches
from both (PC & ~1) and (PC | 1) at the same time. In
straight-line code, that's not very strange... but keep in mind
that a jz or jnz is allowed to jump to either
an even or odd address. For example, if the code jumped to
0x1235, you would still fetch from both 0x1234 and 0x1235. In
other words, you will be fetching an instruction from 0x1234
that you don't use.
-
Implementing some type of out-of-order instruction
scheduling is worth 5% of the project grade. The
out-of-order scheduling can be any type of rearrangement. For
example, you might simply let instructions queue waiting for
function units such that if there was a sequence of load/store
instructions (assuming you have just one load/store data
memory), you might let a later ALU instruction execute before
one of the load/store instructions. You could do even more
complex things, such as implementing register renaming and using
that to do more aggressive out-of-order scheduling, but the 5%
only requires you to do something out of order....
-
If you implement out-of-order instruction
scheduling, you are not required to handle speculative
execution. In other words, you may implement
jz and jnz by disabling further instructions
being fetched. This can mean that your superscalar design would
often be outperformed by a simple speculative pipeline, but many
of the fancier schemes for out-of-order execution standardly
assume that only one basic block is being processed at a time.
Note that you are required to implement speculative
execution if you did not implement some form of out-of-order
scheduling, and will lose an additional 5% if you
implement neither speculation nor out-of-order execution.
Internal Hints
Here are some hints about the internal structure:
How many stages should you use? Well, it's up to you but I'll
suggest 3 is a good number to start thinking about. Stage 0
would be instruction fetch and decode, Stage 1 would be register
read, and Stage 2 would be ALU and data memory access and
register write. You may find it appropriate to add more stages.
For example, if you do register renaming, a 5-stage design might
be more natural -- adding a register rename stage after
instruction fetch and also adding a register "retirement" stage
at the end.
I strongly recommend that Stage 0 should decode what it
fetched. The loQ Don instruction format doesn't always
have the register names that serve similar purposes in the same
field; you want to fix as early as possible by converting
vertical instruction encoding to horizontal control words. Why?
Because then all the interlock/forwarding logic is always
looking at the same "fields" of the state passed from one stage
to the next. There are also some simplifications, such as the
unary neg, negv only uses 1 operand... so you'll want to make
the second operand slot be something that doesn't cause an extra
dependence, such as a copy of the first operand register number.
Each stage or other entity is an always @(posedge clk) if
(!halt)
Take advantage of the owner computes concept.
Each stage should own any "variables" it writes; never write the
same register in multiple stages. E.g., s0dest is the
destination written by stage 0, s1dest is the destination
written by stage 1; note s0dest is used by stage 1 and s1dest by
stage 2. In a 3-stage pipe, register write is done by s2 -- it
writes, so it owns the register file. Note that the register
file is most used by s1, which is fine. Also keep in mind that
assignment using = takes effect "immediately," leaving timing
of accesses in other always clauses a bit unclear;
the <= assignment distinguishes before and after.
If the owner needs to know stuff from other stages to update a
register, there should be one always that collects that
info into a simple signal. In other words, one always
should own each inter-stage communication signal. For example,
you'll probably want to have a separate "squash" flag that can
be tested by stages.
Register 0 shouldn't ever be written, so using 0 to mean "don't
write" leads to a clean implementation.
Register 1 is also special: it is the PC value from when the
instruction was fetched. In fact, it's not even a 32-bit
register, but 16-bit with 16 0's above. Thus, a read from
register 1 should simply get the saved PC value passed from
stage 0. In any case, memory addresses never exceed 16 bits.
If you make the ALU a separate module with all the mess contained in
it, you could easily instantiate it more than once to support more
parallel execution (i.e., two ALU instructions in the same clock).
For loading instructions into memory (initialization), build
yourself two instruction memories. Initialize the one that is an
array of 16-bit words normally, then use a loop to copy it into the
32-bit word memory that is really the instruction memory. This gets
around the issues of having 32-bit parcels of 16-bit instructions.
Due Dates
The recommended due date is TBA. Final submissions will be
accepted up to TBA.
Submission Procedure
Each team will submit a project tarball (i.e., a file
with the name ending in .tar or .tgz) that
contains all things relevant to your work on the project.
Minimally, each project tarball includes the source code for the
project and a semi-formal "implementors
notes" document as a PDF named
notes.pdf.
(Fairly obviously, the Implementor's Notes should also
say who the implementors are -- list all team members as
authors.)
It also may include test
cases, sample output, a make file, etc., but should not include
any files that are built by your Makefile (e.g., no binary
executables). For this particular project, name the Verilog
source file super.v.
Submit your tarball below. The file can be
either an ordinary .tar file created using tar cvf
file.tar yourprojectfiles or a compressed
.tgz file file created using tar zcvf
file.tgz yourprojectfiles. Be careful
about using * as a shorthand in listing
yourprojectfiles on the command line, because
if the output tar file is listed in the expansion, the result
can be an infinite file (which is not ok).
Use the submission form below to submit your project
as a single submission for your team --
you do not submit as individuals.
The last submission before the final deadline is the
one that will be graded.
Advanced Computer Architecture.