Spring 2021 CPE380 Assignment 2

Sample solution.

  1. Given this processor hardware design, add control states to the following to implement an average instruction (as decoded by the when below), such that avg $rd,$rs,$rt gives rd the average of the other two values. You're probably used to averaging by adding and then dividing by 2, but that actually doesn't work -- because the add can go out of range. Instead, use the algorithm that rd=((rs&rt)+((rs^rt)>>1)), which is a trick from Prof. Dietz's MAGIC algorithms page.

    You can test your code with:

    MEM[0]=op(1)+rd(8)+rs(9)+rt(10)
    MEM[4]=0
    $8=42
    $9=32
    $10=64
    

    Register $8 should end-up holding the value 48, which is 0x00000030.

  2. A particular program expressed in a particular ISA executes 100 ALU instructions, 8 Loads, 4 Stores, and 2 Branches. A simple, non-pipelined, implementation of that ISA takes 8 CPI for each ALU instruction, 10 CPI for each load, 20 CPI for each Store, and 20 CPI for each Branch. The original clock period is 4ns. How many clock cycles would the program take to execute? How many microseconds would the program take to execute?
  3. For this question, check all that apply. Given the circumstances described in question 1 above, which of the following changes by itself would yield at least 2X speedup?
    Using a bigger battery and fan, the clock can run at 1GHz
    1GHz is 1ns, so 1000ns
    Adding a cache memory reduces both Load and Store to 8 CPI
    not even close -- ALU ops are more than half time and unaffected
    An improved design reduces the CPI for ALU instructions from 8 to 4
    speeds up ALU by 2X, but other stuff isn't sped up
    A new compiler reduces the number of ALU instructions from 100 to 30
    30*8=240, which gives a total of 440 cycles
    A multi-cycle ALU makes ALU instructions take 10 CPI, but allows a 3ns clock period
    3ns wouldn't be 2X even if number of cycles stayed the same
  4. For this question, check all that apply. Which of the following statements about performance is/are true?
    In computing, more FLOPS is a good thing
    Maximizing throughput is the same as minimizing execution time
    no, it's picking the shortest jobs to do first
    Many processors now have a performance register that counts clock ticks
    The time the processor spends executing your program's instructions is called the system time
    it's called the user time; system time is executing OS code on your behalf
    For any two processors using the same ISA, the one with the faster clock rate will complete a given program in less time
    of course not! time = CPI * clock period
  5. A synthetic benchmark is a program constructed to have performance very similar to that of the real application program it models. Which is more common: that the performance of the synthetic benchmark overestimates performance of the actual or underestimates it? Why?
  6. You have written a program that currently takes 1 hour (60 minutes) to run on a high-end desktop computer: reading the data from the input file takes 10 minutes, the computation takes 45 minutes, and writing the output file takes 5 minutes. On a large enough parallel supercomputer, you can speed-up the computation a lot... but not the file I/O. According to Amdahl's law, what is the maximum speedup you could hope to get by using a parallel supercomputer?


CPE380 Computer Organization and Design.