KySMet Reference: EE480 Advanced Computer Architecture

Instruction set design is hard. Prof. Dietz has designed dozens of instruction sets in the three decades he's been a professor, and it still isn't easy for him to get things right. Thus, rather than giving you complete freedom to design your own instruction set, we're going to walk through the design logic for a reasonably well-crafted one that he built specifically for Spring 2018 EE480. However, this design is not complete -- each student must devise their own encoding of the instructions implement their own assembler.

KySMet Overview

KySMet is a slightly strained acronym for KentuckY Simd Machine with --et added to mean "diminutive" -- a wimpy version of such a machine. SIMD is a nested acronym meaning Single Instruction stream, Multiple Data stream, which means performing the same instruction simultaneously on many data. Of course, KySMet is a misspelling of Kismet. Kismet means destiny or fate, but it comes from the Arabic word "qisma" which means portion, lot, or fragment... which seems appropriate enough for an instruction set in which one processing element is only a fragment of the final machine you are destined to build. Yeah, I spent too much time thinking about the name.... ;-)

The machine has sixteen 16-bit registers, 16-bit datapaths, and 16-bit addresses, and each address in memory holds one 16-bit word. It can operate only on 16-bit integers. The odd thing is the memory structure: there is one instruction memory, but there is a separate data memory for each processing element. Of course, it's not that odd because it still works for having just one processing element.

This instruction set is complete enough that I hope to be giving you a compiler (including full C source code) that translates programs written in a little SIMD dialect of C into KySMet code. It's not going to be a particularly smart compiler (ok, it's really dumb), but it will show you how KySMet can be used for complete programs.

The KySMet Instruction Set

The KySMet instruction set is quite straightforward, a general-register model encoding most instructions as a single 16-bit word:

Instruction Description Functionality

add $d, $s, $t ADD int $d = $s + $t

allen enable ALL processing elements en = (en | 1)

and $d, $s, $t bitwise AND integers $d = $s & $t

call addr CALL addr if (active) { stack(pc+1); pc = addr }

gor $d, $s Global bitwise OR $d = gor($s)

jump addr JUMP addr pc = addr

jumpf $d, addr JUMP False addr if ($d==0) en = (en & ~1); if (none_active) pc = addr

left $d, $s copy from LEFT processing element $d = PE[(IPROC+1)%NPROC].$s

li8 $d, i8 Load Immediate 8-bit integer $d = signed_extend(i8)

lnot $d, $s Logical NOT (zero becomes 1, non-zero becomes 0) $d = ! $s

load $d, $s LOAD $d = memory[$s]

lu8 $d, i8 Load Upper immediate 8-bit integer $d = ($d & 0x00ff) | (i8 << 8))

mul $d, $s, $t MULtiply int $d = $s * $t

neg $d, $s NEGate int $d = - $s

or $d, $s, $t bitwise OR integers $d = $s | $t

popen POP ENable state en = (en >> 1)

pushen PUSH ENable state en = ((en << 1) | (en & 1))

ret RETurn if (active) pc = unstack()

right $d, $s copy from RIGHT processing element $d = PE[(IPROC+NPROC-1)%NPROC].$s

sll $d, $s, $t Shift Left Logical $d = $s << ($t & 15)

slt $d, $s, $t Set Less Than int $d = ($s < $t)

sra $d, $s, $t Shift Right Arithmetic $d = $s >> ($t & 15)

store $d, $s STORE memory[$s] = $d

trap TRAP to operating system (end execution) halt the user program

xor $d, $s, $t bitwise eXclusive OR $d = $s ^ $t

Instruction	Description	Functionality
`add $d, $s, $t`	ADD int	`$d = $s + $t`
`allen`	enable ALL processing elements	`en = (en \| 1)`
`and $d, $s, $t`	bitwise AND integers	`$d = $s & $t`
`call addr`	CALL addr	`if (active) { stack(pc+1); pc = addr }`
`gor $d, $s`	Global bitwise OR	`$d = gor($s)`
`jump addr`	JUMP addr	`pc = addr`
`jumpf $d, addr`	JUMP False addr	`if ($d==0) en = (en & ~1); if (none_active) pc = addr`
`left $d, $s`	copy from LEFT processing element	`$d = PE[(IPROC+1)%NPROC].$s`
`li8 $d, i8`	Load Immediate 8-bit integer	`$d = signed_extend(i8)`
`lnot $d, $s`	Logical NOT (zero becomes 1, non-zero becomes 0)	`$d = ! $s`
`load $d, $s`	LOAD	`$d = memory[$s]`
`lu8 $d, i8`	Load Upper immediate 8-bit integer	`$d = ($d & 0x00ff) \| (i8 << 8))`
`mul $d, $s, $t`	MULtiply int	`$d = $s * $t`
`neg $d, $s`	NEGate int	`$d = - $s`
`or $d, $s, $t`	bitwise OR integers	`$d = $s \| $t`
`popen`	POP ENable state	`en = (en >> 1)`
`pushen`	PUSH ENable state	`en = ((en << 1) \| (en & 1))`
`ret`	RETurn	`if (active) pc = unstack()`
`right $d, $s`	copy from RIGHT processing element	`$d = PE[(IPROC+NPROC-1)%NPROC].$s`
`sll $d, $s, $t`	Shift Left Logical	`$d = $s << ($t & 15)`
`slt $d, $s, $t`	Set Less Than int	`$d = ($s < $t)`
`sra $d, $s, $t`	Shift Right Arithmetic	`$d = $s >> ($t & 15`)
`store $d, $s`	STORE	`memory[$s] = $d`
`trap`	TRAP to operating system (end execution)	`halt the user program`
`xor $d, $s, $t`	bitwise eXclusive OR	`$d = $s ^ $t`

Determining how to encode the above instructions as bit patterns is a key part of your project. However, there are a few rules:

You do not need to use .alias and other "fancy" features of AIK to build your assembler. Writing a separate pattern for each instruction type is fine.
Most instructions are one 16-bit word long. The exceptions are call, jump, and jumpf, which each have to be two words long because the addr operand to them is itself 16 bits long. It is up to you how you encode them in two 16-bit words; good options including having the 16-bit addr be the second word or having the addr split into two 8-bit pieces so that you can still have an opcode in each word along with 8 bits of the addr.
As for MIPS, $ signifies a register number. Unlike MIPS, Logick has 16 registers numbered 0 through 15. It takes a 4-bit field to hold a register number. A little math quickly reveals that some instructions name three registers, so 12 of the 16 instruction word bits will be needed for them, leaving just 4 bits for the opcode. There are also li8 and lu8 instructions that name one register and include an 8-bit immediate value, again leaving only 4 bits for opcode. Thus, your opcode field can handle 16 instruction types -- but there are 25 different types listed here. The key is to reuse some of the instruction bits that would hold operands to instead act as extensions of the opcode, much like MIPS uses the funct field to extend the opcode.
There are no variable-length instructions per se, but loading a constant takes either one or two instructions. If the immediate value is in [-128..127], then a single li8 can load the value. If not, the value v would be loaded using a two-instruction sequence: first li8 to load (v&0xff), then lu8 to overwrite the upper 8 bits with (v>>8). It isn't reasonable to expect a compiler to know which form to use, so your AIK specification must define an li $d, i16 pseudo-instruction that loads the i16 value using a single li8 whenever possible, but uses the li8, lu8 sequence otherwise.

The KySMet Registers

The KySMet processor actually has three different types of registers, but only one of them is random access. The other two are essentially shift registers.

The KySMet General Registers

There are 16 general-purpose registers, some of which have special purposes -- a lot like MIPS. They all have names as well as numbers. Perhaps the best way to give both is the following specification (formatted as an AIK specification):

.const {zero	IPROC	NPROC	sp	fp	rv	u0	u1
	u2	u3	u4	u5	u6	u7	u8	u9 }

Registers $u0 through $ub (aka, registers $4 through $15) are "user" registers to be used in any way the programmer sees fit. However, it is expected that the assembler or compiler would use registers starting at $u9 for "internal" things and starting at $u0 for normal coding. The first six registers have special meanings:

Register Number Register Name Read/Write? Use

$0 $zero Read Only ZERO; constant 0x0000

$1 $NPROC Read Only NPROC; number of processing elements

$2 $IPROC Read Only IPROC; number of this processing element, unique in [0..NPROC-1]

$3 $sp Read/Write the Stack Pointer

$4 $fp Read/Write the Frame Pointer

$5 $rv Read/Write the Return Value

Register Number	Register Name	Read/Write?	Use
`$0`	`$zero`	Read Only	ZERO; constant 0x0000
`$1`	`$NPROC`	Read Only	NPROC; number of processing elements
`$2`	`$IPROC`	Read Only	IPROC; number of this processing element, unique in [0..NPROC-1]
`$3`	`$sp`	Read/Write	the Stack Pointer
`$4`	`$fp`	Read/Write	the Frame Pointer
`$5`	`$rv`	Read/Write	the Return Value

Note that use of $0, $1, or $2 as the destination for a result is illegal -- they are read only registers. Thus, if you wish, you could use those bit patterns for other things. In other words, add $0, ..., ... is illegal, so you could use that bit pattern for something else, such as neg $d, $s. You'll have to be slightly clever to cram all these instruction formats into a structure that can't always reserve more than 4 bits for the opcode, but there are lots of ways to solve this problem. Simpler ways are better. :-)

The KySMet Enable Stack

It is useful for SIMD processing elements to have the ability to temporarily disable themselves for the duration of a construct. Since constructs can nest, there is actually an enable stack which is here called en. The enable stack should be at least 32 bits deep, initialized to all 1s. To be precise, there is one enable stack per processing element.

The current enable state is en[0]; if it is a 0, then most instructions are essentially treated as null operations. The exceptions are allen, call, jump, jumpf, popen, pushen, ret, and trap, which are always executed even if the processor is disabled. When a new construct is entered, pushen basically makes en = {en[30:0], en[0]}, pushing a copy of the current enable state on the stack. At the end of a construct, popen pops the enable stack: en = {en[31], en[31:1]}. The jumpf instruction makes en[0] = 0 if the register tested is false, while allen essentially makes en[0] = 1.

The KySMet Call Stack

One of the little details about SIMD code is that you really can't have processors disagree about what code to execute next. Thus, processing elements are not allowed to jump to a locally-computed address: all target addresses, addr, are constants. The catch is, then how do we do subroutine or function calls? Many SIMD machines (e.g., most GPUs) simply didn't allow it. However, we'll allow it using a hardware call stack, retaddr, containing 64 bits -- enough for four-level nested calls.

When a call instruction is executed, the return address is pushed into retaddr: retaddr = {retaddr[47:0], pc}. Similarly, the ret instruction pops an address: pc = retaddr[15:0]; retaddr = {retaddr[63:48], retaddr[63:16]}. Actually, it doesn't really matter what the top 16 bits of retaddr are left as after a ret because they are no longer valid.

You have noticed that there is no way for multiple processing elements to be executing different instructions simultaneously because there is no way to jump to an address that is computed to be different on different processing elements -- not even ret can cause such a "divergence of control flow." Thus, there could be only one call stack rather than one per processing element; however, there is nothing wrong with having one per processing element.

KySMet Data

The KySMet instruction set deals only with 16-bit int data which is generally treated as being signed.

In a SIMD machine with n processing elements, it is not uncommon to have n+2 memories. There is one instruction memory, n data memories (one local to each processing element), and one data memory for the control unit. Data structures that are "plural" have independent versions in each of the local memories while "singular" data structures exist only in the control unit data memory. Well, we don't have a separate control unit data memory -- instead, we simply replicate the "singular" data in each of the local data memories.

Normally, SIMD machines have a variety of instructions that allow communication between processors. Often this is done by allowing processing elements to directly access registers of adjacent processing elements. Here, we provide three communication mechanisms:

Global OR, gor, simply bitwise-ORs together the values taken from all processing elements. The source registers are used even if the processing elements that hold them are disabled, but the destination registers are only updated if enabled. Sometimes, that's exactly what you want. For example, you can broadcast a value from one processing element to all others by simply making all others put forth the value 0. Of course, with just one processing element, gor just copies the value from register $s to $d.
Left, left, copies the value from the processing element to the left (with wrap-around at the end). For example, processing element 5 would get the value from processing element 6. As for gor, it doesn't matter if the source processing element is enabled or not.
Right, right, copies the value from the processing element to the right (with wrap-around at the end). For example, processing element 5 would get the value from processing element 4. As for left, it doesn't matter if the source processing element is enabled or not.

Obviously, in the case of having just a single processing element, gor, left, and right all do exactly the same thing.

Also be sure to force .lowfirst to be 0 so that bits are packed into 16-bit words starting with the MSB working down.

Advanced Computer Architecture.