Assignment 4: Pipelined Tangled + Qat

In this project, your team is going to build a pipelined implementation of Tangled including the Qat coprocessor instructions, which means you'll also be building an ALU for Qat.

A Single-Cycle Starting Point

Just as you started with the concept of a single-cycle architectural model for Tangled, it makes sense to start with a single-cycle model for Tangled including Qat:

That diagram is not to be taken too literally. For example, somehow the imm4 field from the had instruction needs to get to the Qat ALU, but I didn't draw a path for it. In fact, it's really very far off in that the Qat register file and ALU both actually need 3 inputs and some Qat instructions write into two registers, not just one. Use the above just as a rough guide to help organize your thoughts, and that is how you should start: thinking about the overall structure, not writing Verilog code nor trying to modify somebody's Assignment 3. Think top down.

Although I've drawn things as if there were two stages of Qat ALUs, there's really just one worth of delay. The Qat ALU is very wide, but the operations it performs are no more than a few gate delays deep. In fact, the only thing that's really slow here is the Qat next operation and it's probably faster than the 16-bit integer add Tangled instruction.

Note that the 256 registers in the Qat register file are completely separate from the Tangled register file, and each register is 256 bits long. There is no way for Qat instructions to directly access data memory, nor even to directly transfer a value from a Tangled register into a Qat register.

The Qat Instructions

Some Qat instructions are harder to deal with than others. So, let's consider the issues in groups.

The 32-bit Qat Instructions

Before ever having to deal with execution of some Qat instructions, we have the annoying fact that some Qat instructions are not able to be executed within a single cycle no matter what: the ones encoded using 32 bits. It will take two clock cycles to execute them because you can't really do two instruction fetches in one clock (I'm not allowing you to create a second instruction fetch mechanism).

Thus, the following instructions can't really move past instruction fetch in one cycle. Instead, you'll need to do something like fetching the first instruction word but advancing a NOP through the pipe instead of the just-fetched instruction word. The next clock cycle, you simply fetch the next instruction word and, having completed the full instruction fetch, the instruction can then move to the next pipe stage as usual -- as long as nothing else has caused the pipeline to be blocked.

The two-word instructions are:

Instruction Bits Description Functionality

and @a, @b, @c 32 AND @a = @b & @c

ccnot @a, @b, @c 32 Controlled-controlled NOT (Toffoli) @a ^= (@b & @c)

cnot @a, @b 32 Controlled NOT @a ^= @b

cswap @a, @b, @c 32 Controlled swap (Fredkin) where (@c) swap(@a, @b)

or @a, @b, @c 32 OR @a = @b | @c

swap @a, @b 32 Swap swap(@a, @b)

xor @a, @b, @c 32 eXclusive OR @a = @b ^ @c

Instruction	Bits	Description	Functionality
`and @a, @b, @c`	32	AND	`@a = @b & @c`
`ccnot @a, @b, @c`	32	Controlled-controlled NOT (Toffoli)	`@a ^= (@b & @c)`
`cnot @a, @b`	32	Controlled NOT	`@a ^= @b`
`cswap @a, @b, @c`	32	Controlled swap (Fredkin)	`where (@c) swap(@a, @b)`
`or @a, @b, @c`	32	OR	`@a = @b \| @c`
`swap @a, @b`	32	Swap	`swap(@a, @b)`
`xor @a, @b, @c`	32	eXclusive OR	`@a = @b ^ @c`

The SIMD Bitwise Gate Operations

Most of the Qat instructions are SIMD-parallel bitwise operations on 256-bit vectors, and can be handled completely within the "Qat ALU" in the above single-cycle diagram. These operations have the functionality specified implemented on 256-bit vectors:

Instruction Bits Description Functionality

and @a, @b, @c 32 AND @a = @b & @c

ccnot @a, @b, @c 32 Controlled-controlled NOT (Toffoli) @a ^= (@b & @c)

cnot @a, @b 32 Controlled NOT @a ^= @b

cswap @a, @b, @c 32 Controlled swap (Fredkin) where (@c) swap(@a, @b)

had @a, imm4 16 HADamard initializer for entanglement @a = imm4

not @a 16 NOT (Pauli-X) @a = ~@a

or @a, @b, @c 32 OR @a = @b | @c

one @a 16 ONE @a = 1

swap @a, @b 32 Swap swap(@a, @b)

xor @a, @b, @c 32 eXclusive OR @a = @b ^ @c

zero @a 16 ZERO @a = 0

Instruction	Bits	Description	Functionality
`and @a, @b, @c`	32	AND	`@a = @b & @c`
`ccnot @a, @b, @c`	32	Controlled-controlled NOT (Toffoli)	`@a ^= (@b & @c)`
`cnot @a, @b`	32	Controlled NOT	`@a ^= @b`
`cswap @a, @b, @c`	32	Controlled swap (Fredkin)	`where (@c) swap(@a, @b)`
`had @a, imm4`	16	HADamard initializer for entanglement	`@a = imm4`
`not @a`	16	NOT (Pauli-X)	`@a = ~@a`
`or @a, @b, @c`	32	OR	`@a = @b \| @c`
`one @a`	16	ONE	`@a = 1`
`swap @a, @b`	32	Swap	`swap(@a, @b)`
`xor @a, @b, @c`	32	eXclusive OR	`@a = @b ^ @c`
`zero @a`	16	ZERO	`@a = 0`

There is really very little for you to do in implementing the above operations. Verilog has no problem handling operations on 256-bit vector registers declared as reg [255:0]. For example, given wire [255:0] a, b, c;, the Verilog code assign a=b&c; would implement the computation specified by the Qat and instruction.

Note that in most Qat operations, only a single register is written. However, that's not true for the swaps. Think of swap and cswap as allowing two registers to exchange values. However, cswap is really most like a pair of multiplexors that allow the operands to be swapped or not for each bit position separately. In other words, cswap does a=(c?b:a) and simultaneously b=(c?a:b).

The had instruction is also a little special. The values to be loaded into a Qat register by a had instruction are not obvious bit patterns, but have the general form of a repeated pattern that looks like {{k{1'b1}}, {k{1'b0}} where k is 2^imm4. Notice that's apparently in the reverse order from how the pattern bits were described in lecture -- but it's really not; the catch is that we like to have the high bits first in Verilog, so the higher-numbered entanglement channels come first. For example, had @5,0 would initialize @5 to the 2-bit Verilog value {{1{1'b1}}, {1{1'b0}}} repeated 128 times to fill the 256-bit register. Thus, @5 would get {128{{{1{1'b1}}, {1{1'b0}}}}. With 256-bit Qat registers, the biggest imm4 value you need to deal with is 7, which would essentially become {{128{1'b1}},{128{1'b0}}}.

The Tricky Qat Instructions

The remaining two Qat instructions are the tricky ones that need to be dealt with in that second Qat ALU -- the meas/next one:

Instruction Bits Description Functionality

meas $d, @a 16 Measure value of entanglement channel $d = @a[$d]

next $d, @a 16 Entanglement channel of next 1 $d = next($d, @a)

Instruction	Bits	Description	Functionality
`meas $d, @a`	16	Measure value of entanglement channel	`$d = @a[$d]`
`next $d, @a`	16	Entanglement channel of next 1	`$d = next($d, @a)`

Why are they tricky? Well, they aren't exactly Qat instructions. In fact, they each imply a "mix" of Qat and Tangled execution actions. Thus, they effectively couple the two otherwise independent instruction execution pipelines for Tangled and Qat (one could say their pipeline implementations are tangled).

The meas instruction is actually shockingly simple to implement: it's a 1-of-256 bitwise multiplexor. All it does is to return the one-bit value extracted from a Qat register. For example, meas $1,@42 where register $1 initially has the value 64 would mean that $1 would be set to either 1 or 0 depending on the value of bit 64 in @42. Easy. The measurement operation in a real quantum computer is very much like meas, except that here you get to say which entanglement channel you want to measure rather than getting one selected at random... oh, and meas doesn't change the contents of the Qat register you measure.

The next instruction is harder to implement. In fairness, what it does has no direct analog in quantum computers; it's a very sophisticated way to sample aggregate properties of the entangled superposed state. For example, if register $8 initailly has the value 17, the next $8,@123 instruction would return the number of the next entanglement channel after position 17 that held a 1. If none of the entanglement channels numbered 18 or higher held a 0, the result would be 0. That generic example is a bit confusing, so let's make it even more concrete by supposing that register @123 was initialized by executing had @123,7. That would mean entanglement channels 0 to 127 of register @123 are all 0 and channels 128 to 255 are all 1. Thus, the first etanglement channel after 17 to hold a 1 would be channel 128, and that's the result that would go into register $8.

As described above, the next instruction sounds pretty difficult to implement in one clock cycle: the obvious algorithm would be to loop through the bits until you find a 1, and that will not give us the performance we need. Instead, let's decompose next into two sub-problems:

Zero the first $d+1 entanglement channels. You can use a barrel shifter to literally remove the lowest $d+1 entanglement channels from the value. Thus, if $d is 0, you'd remove 1 entanglement channel. This can even be implemented using the Verilog shift operator. However, we don't really want those bits removed: we want them replaced by $d+1 0 bit values. If you think about it, this substitution can be done in a single line of Verilog code.
Count the number of trailing 0-valued bits. This is essentially the same problem as counting the leading 0-valued bits, as you've done to implement floating-point normalization. I expect you to create your own log-gate-delay algorithm that counts the trailing 0 bits. Just remember that, if you find no 1 bits, the answer is forced to be 0. Why 0? Because there is no way to have the next 1 bit be in position 0 when you start looking after position 0. This does mean you'd have to check entanglement position 0 separately, but meas could be used for that (in the factoring example later in this assignment, we don't actually care about the first two entanglement channels).

Thus, for 256-bit Qat registers, the total delay in executing a next operation can be on the order of log₂(256)... which means it's probably faster than Tangled's 16-bit integer add instruction! There's no need for additional pipe stages or a slower clock.

The Tangled+Qat Pipeline

The pipeline structure for Tangled with Qat should thus be a lot like the pipelined structure for Tangled without Qat. What's different?

You need to be able to fetch two-word Qat instructions and combine them into single instructions going through the pipe. This means there's another way the pipeline can stall....
Depending on details of how you design your pipeline, the two-word Qat instructions might also require some special-case handling when processing brf, brt, jumpr, or sys. Then again, it might not... so think about this.
Just as you had to check for dependences between $ register references, you'll need to check for @ register dependences. For example, you'll have to insert a pipeline bubble or do value forwarding to satisfy the dependence in the instruction sequence:
```
had @200,3
not @200
```
The dependences for $ registers are somewhat complicated by the fact that meas and next access $ registers too.

Of course, the issues you dealt with involving PC update, consistent processing of the system call instruction, etc., are still here. The Tangled instruction set didn't go away: we've just added a coprocessor.

To again be clear about what I expect: your submission should be a viable four-or-more-stage pipelined Verilog implementation of the Tangled instruction set implementing both integer and float instructions, with float NaN support, and implementing the Qat instructions with 256-bit @ registers. All significant design decisions made should be discussed in your Implementor's Notes.

Stuff You Can Reuse

The Tangled ISA should be familiar by now, and now the Qat coprocessor instructions too. Assignment 2 was scary mostly because you had never done something like that before. Assignment 3 was actually much more difficult, but it probably didn't feel that way because you were building on what you had experienced before -- and, because the team membership was shuffled, you were able to indirectly draw on the experience of multiple teams from Assignment 2. Well, Assignment 4 is literally having you implement a processor with features that nobody has ever implemented before. The expected result should actually be a publishable research contribution. However, relax. Assignment 4 builds on Assignment 3 in a way that should allow you to leverage even more understanding, so this originally unthinkably complicated processor design problem breaks into a bunch of subproblems, each of which you know how to solve. This is how engineering really works....

Of course, the same rules apply for reuse of previous materials as applied before. For example, I expect you'll be able to review the floating-point implementations from your previous teams and pick or synthesize the best from that. Just be sure to still start this project top down: do NOT just start with "the best" Assignment 3 and try to add things. It might seem simpler to go that route, but it certainly isn't likely to be easy debugging Verilog code that wasn't driven by first creating a good overall, integrated, design. Create the design first, then write or reuse/modify Verilog code to implement that design.

Testing

Again, the test coverage plan and testbench from Assignment 3 are probably very close to what you want. However, you do need to seriously think about coverage again because you're added new instructions and they also impact the types of pipeline issues that could happen.

Of course, there are lots of simple ways to test the Qat instructions. What I'd recommend is testing at least the had and next algorithms in isolation before merging it into your processor design; they are a bit tricky. In fact, you might want to test a version of it with shorter-than-256-bit vectors first to ease debugging. However, once you have a reasonable set of Verilog algorithms for each of the Qat instructions, you can test the instructions using them by things as simple as making all 256 bits of each vector the same (using zero and one) and using meas to read the result.

However, wouldn't it be nice to have a little computation that seems really quantum-ish? Ok. So, here's an actual Tangled Qat instruction sequence to compute the prime factors of 15:

;	Tangled factoring of 15
;
;	20201110 by H. Dietz
;
;	pint a = pint_mk(4, 15);
;	pint b = pint_h(4, 0x0f);
;	pint c = pint_h(4, 0xf0);
;	pint d = pint_mul(b, c);
;	pint e = pint_eq(d, a);
;
;	The entanglement channels in e that are 1 are the factors
;	(really corresponding to b values), so just read positions...
;
	had	@0,3
	had	@1,5
	and	@2,@0,@1
	had	@3,4
	and	@4,@0,@3
	had	@5,2
	and	@6,@5,@1
	and	@7,@4,@6
	and	@8,@5,@3
	had	@9,1
	and	@10,@9,@1
	and	@11,@8,@10
	and	@12,@9,@3
	had	@13,0
	and	@14,@13,@1
	and	@15,@12,@14
	xor	@16,@8,@10
	and	@17,@15,@16
	or	@18,@11,@17
	xor	@19,@4,@6
	and	@20,@18,@19
	or	@21,@7,@20
	and	@22,@2,@21
	had	@23,6
	and	@24,@0,@23
	and	@25,@22,@24
	xor	@26,@2,@21
	and	@27,@5,@23
	and	@28,@26,@27
	xor	@29,@18,@19
	and	@30,@9,@23
	and	@31,@29,@30
	xor	@32,@15,@16
	and	@33,@13,@23
	and	@34,@32,@33
	xor	@35,@29,@30
	and	@36,@34,@35
	or	@37,@31,@36
	xor	@38,@26,@27
	and	@39,@37,@38
	or	@40,@28,@39
	xor	@41,@22,@24
	and	@42,@40,@41
	or	@43,@25,@42
	had	@44,7
	and	@45,@0,@44
	and	@46,@43,@45
	xor	@47,@40,@41
	and	@48,@5,@44
	and	@49,@47,@48
	xor	@50,@37,@38
	and	@51,@9,@44
	and	@52,@50,@51
	xor	@53,@34,@35
	and	@54,@13,@44
	and	@55,@53,@54
	xor	@56,@50,@51
	and	@57,@55,@56
	or	@58,@52,@57
	xor	@59,@47,@48
	and	@60,@58,@59
	or	@61,@49,@60
	xor	@62,@43,@45
	and	@63,@61,@62
	or	@64,@46,@63
	xor	@65,@61,@62
	xor	@66,@58,@59
	xor	@67,@55,@56
	xor	@68,@53,@54
	xor	@69,@32,@33
	and	@70,@13,@3
	xor	@71,@12,@14
	and	@72,@70,@71
	and	@73,@69,@72
	and	@74,@68,@73
	or	@75,@74,@74
	not	@75
	or	@76,@67,@75
	or	@77,@66,@76
	or	@78,@65,@77
	or	@79,@64,@78
	or	@80,@79,@79
	not	@80
;
;	@80 is e, so make $0, $2 the prime factors
;
	lex	$0,31	; want first factor >=(16*2)
	next	$0,@80	; should be 5*(16*3)
	copy	$1,$0	; want first factor >$0
	next	$1,@80	; should be 3*(16*5)
	lex	$2,15	; mask to actual values
	and	$0,$2	; 5
	and	$1,$2	; 3

Yes, it only takes 80 gate-level instructions to factor 15!

In case you're wondering, the above instruction sequence was created by modifying the just-in-time parallel bit pattern compiler (which you saw demonstrated in the Nov. 9, 2020 class lecture) to spit-out a Tangled/Qat instruction sequence rather than execute the computation directly. The register allocation is rather sloppy, but that's because it tries to keep every intermediate value around as long as possible -- and Qat does have 256 registers, so we're not a tight fit. It is a little bit of a tight fit in that this uses 8-way entanglement, and that's the maximum that your Tangled/Qat implementation will directly support. Limited by the 256-bit vectors, the maximum value your processor could factor this way would be 15², or 225.

Just to be clear, I do not expect you to incorporate any design for testability features in your Verilog design. However, I do expect you to test that all the features that were implemented in Assignment 3, such as the floating point, still work.

Due Dates

I know that with COVID-19 disrupting everyones life, it is difficult to stick to a schedule. I've scaled things back to allow for that: my original plan was to have you executing up to two instructions every clock cycle as a superscalar, but this project is only asking for strictly in-order pipelined execution. The latest due date I can allow is the official time of the final exam, which is apparently 3:30PM on Friday, December 4, 2020.

Note that we will NOT be having a final exam in that timeslot. As for the midterm, you'll have the freedom to select any time within a window spanning several days for taking the final.

Submission Procedure

For each project, your team (NOT each person individually) will be submitting a tarball (i.e., a file with the name ending in .tar or .tgz) that contains all things relevant to your work on the project. Minimally, each project tarball includes the source code for the project and a semi-formal "implementors notes" document as a PDF named notes.pdf. It also may include test cases, sample output, a make file, etc., but should not include any files that are built by your Makefile (e.g., no binary executables). Be sure to make it obvious which files are which; for example, if the Verilog source file isn't tangled.v or the AIK file isn't tangled.aik, you should be saying where these things are in your implementor's notes. Don't forget to include any VMEM files and test assembly code.

Submit your tarball below. The file can be either an ordinary .tar file created using tar cvf file.tar yourprojectfiles or a compressed .tgz file file created using tar zcvf file.tgz yourprojectfiles. Be careful about using * as a shorthand in listing yourprojectfiles on the command line, because if the output tar file is listed in the expansion, the result can be an infinite file (which is not ok). Also note that zip is not compatible with tar.

Advanced Computer Architecture.