Spring 2021 CPE380 Assignment 2

Sample solution.

Given this processor hardware design, add control states to the following to implement an average instruction (as decoded by the when below), such that avg $rd,$rs,$rt gives rd the average of the other two values. You're probably used to averaging by adding and then dividing by 2, but that actually doesn't work -- because the add can go out of range. Instead, use the algorithm that rd=((rs&rt)+((rs^rt)>>1)), which is a trick from Prof. Dietz's MAGIC algorithms page.
when op() op(1) Avg Start: PCout, MARin, MEMread, Yin CONST(4), ALUadd, Zin, UNTILmfc MDRout, IRin Zout, PCin, JUMPonop HALT /* Should end here on undecoded op */ Avg: SELrs, REGout, Yin SELrt, REGout, ALUand, Zin Zout, MDRin /* MDR = rs&rt */ SELrt, REGout, ALUxor, Zin /* z = rs^rt */ CONST(1), Yin Zout, ALUsrl, Zin /* z = (rs^rt)>>1 */ MDRout, Yin Zout, ALUadd, Zin /* z = ((rs&rt)+((rs^rt)>>1)) */ Zout, SELrd, REGin, JUMP(Start)
You can test your code with:
```
MEM[0]=op(1)+rd(8)+rs(9)+rt(10)
MEM[4]=0
$8=42
$9=32
$10=64
```
Register $8 should end-up holding the value 48, which is 0x00000030.
A particular program expressed in a particular ISA executes 100 ALU instructions, 8 Loads, 4 Stores, and 2 Branches. A simple, non-pipelined, implementation of that ISA takes 8 CPI for each ALU instruction, 10 CPI for each load, 20 CPI for each Store, and 20 CPI for each Branch. The original clock period is 4ns. How many clock cycles would the program take to execute? How many microseconds would the program take to execute?
# CPI ALU 100 8 Load 8 10 Store 4 20 Branch 2 20 (100*8)+(8*10)+(4*20)+(2*20) = 800+80+80+40 = 1000 cycles 1000*4ns = 4000ns, which is 4us
For this question, check all that apply. Given the circumstances described in question 1 above, which of the following changes by itself would yield at least 2X speedup?
Using a bigger battery and fan, the clock can run at 1GHz
1GHz is 1ns, so 1000ns
Adding a cache memory reduces both Load and Store to 8 CPI
not even close -- ALU ops are more than half time and unaffected
An improved design reduces the CPI for ALU instructions from 8 to 4
speeds up ALU by 2X, but other stuff isn't sped up
A new compiler reduces the number of ALU instructions from 100 to 30
30*8=240, which gives a total of 440 cycles
A multi-cycle ALU makes ALU instructions take 10 CPI, but allows a 3ns clock period
3ns wouldn't be 2X even if number of cycles stayed the same
For this question, check all that apply. Which of the following statements about performance is/are true?
In computing, more FLOPS is a good thing
Maximizing throughput is the same as minimizing execution time
no, it's picking the shortest jobs to do first
Many processors now have a performance register that counts clock ticks
The time the processor spends executing your program's instructions is called the system time
it's called the user time; system time is executing OS code on your behalf
For any two processors using the same ISA, the one with the faster clock rate will complete a given program in less time
of course not! time = CPI * clock period
A synthetic benchmark is a program constructed to have performance very similar to that of the real application program it models. Which is more common: that the performance of the synthetic benchmark overestimates performance of the actual or underestimates it? Why?
Overestimation is much more common. Synthetic benchmarks tend to have simpler code structure than the real application, so it's common the compiler can optimize things better. Smaller code and data also tend to fit in cache and registers better.
You have written a program that currently takes 1 hour (60 minutes) to run on a high-end desktop computer: reading the data from the input file takes 10 minutes, the computation takes 45 minutes, and writing the output file takes 5 minutes. On a large enough parallel supercomputer, you can speed-up the computation a lot... but not the file I/O. According to Amdahl's law, what is the maximum speedup you could hope to get by using a parallel supercomputer?
60 = 10 + 45 + 5 In the limit, parallel execution makes the 45 approach 0 Thus, Amdahl's law says (10+45+5)/(10+5) = 4X max speedup

Computer Organization and Design.