EE380 Assignment 4 Solution

This is the practice homework with solutions.


  1. In a typical desktop PC processor, which one of the following four memory subsystems is not found on the processor chip?
    TLB
    L1 Cache
    L2 Cache
    Registers
    All four of the above typically are on the processor chip
    Registers, Write buffer, TLBs, and Caches are all typically on chip now; main memory isn't, but even the classical NorthBridge functions have been brought on chip with AMD's HyperTransport and Intel's QuickPath
  2. Which one of the following four statements about the memory hierarchy is false?
    For comparable cache size, a direct mapped cache is easier to build (simpler logic) than set associative cache
    mostly because of replacement policy complications...
    Temporal Locality refers to an object being likely to be referenced again soon after being referenced once
    Modern processors often have separate caches for instructions and data
    using caches to fake Harvard architecture makes an easier pipeline structure...
    Larger cache line sizes take better advantage of Spatial Locality
    All four of the above statements are true
  3. Consider the following two MIPS subset implementations:


    Which of the following four statements about how pipelining changes the architecture is false?
    The ALU used for operations like add and xor could be the same circuit in both implementations
    The ALU used to add 4 to the PC could be the same circuit in both implementations
    The Instruction Memory module could be the same circuit in both implementations
    The Data Memory module could be the same circuit in both implementations
    None of the above four statements is false; in fact, all of the modules can be the same circuits in both implementations because pipelining only adds buffers, changes/adds some datapaths, and modifies the control logic
    this is precisely why we bother with a single-cycle implementation
  4. Pipelined designs generally achieve higher performance than similar single-cycle designs by allowing a higher clock rate, but the clock rate with a 5-stage pipeline is generally somewhat less than 5X the speed of the single-stage design it was derived from (e.g., compare the two MIPs implementations given in question 5). Give one reason why the clock rate is less than 5X.
  5. The first time a modern processor executes a particular branch instruction, it must compute the target address by adding the offset encoded within the branch instruction to the PC value. However, if the same instruction is executed again soon enough, the processor does not have to recompute the target address. Which hardware structure implements this feature?
    BLT
    Bacon, Lettuce, and Tomato?
    TLB
    Translation Lookaside Buffer -- cache for page table address translation entries
    BTB
    Branch Target Buffer -- cache for branch target addresses
    Data Cache
    Instruction Cache
    close but not quite; the instruction may be in cache, but we'd still have to fetch it, extract the offset field, and add that to the PC -- which usually takes about 1 extra clock cycle
  6. Consider executing each of the following code sequences on the pipelined MIPS implementation given below (which does not incorporate value forwarding):

    Incidentally, both code sequences produce the same final results. Which of the following statements best describes the execution times you would expect to observe?
    (A)  addi $t1,$t0,4
         lw   $t2,0($t0)
         xor  $t2,$t2,$t3 # depends on lw $t2
    
    (B)  lw   $t2,0($t0)
         addi $t1,$t0,4
         xor  $t2,$t2,$t3 # depends on lw $t2
    

    (A) would be faster than (B)
    (B) would be faster than (A)
    a bigger gap between the lw and xor means a smaller bubble
    (A) would take the same number of clock cycles as (B)
    Which is faster depends on the values being added and xored
    NO! Dependence interlocks have nothing to do with actual data values, but are a property of the pattern in which things are referenced
  7. Consider executing each of the following code sequences on the pipelined MIPS implementation given below:

    Also consider executing them on this design with value forwarding logic and datapaths added. Which of the following statements best describes how the forwarding logic would alter the execution times?
    (A)  lw   $t1,4($t0)
         sw   $t1,16($t2) # depends on lw $t1
         beq  $t1,$t3,lab # depends on lw $t1
    
    (B)  lw   $t1,4($t0)
         sw   $t2,16($t3)
         beq  $t0,$t3,lab
    

    Neither (A) nor (B) is affected by forwarding
    (A) is not affected, (B) would be faster using forwarding
    (A) would be faster using forwarding, (B) is not affected
    No dependences in (B) means forwarding isn't needed -- nor does it help
    Both (A) and (B) would be faster using forwarding
    The execution time improvements due to forwarding depend on the values in the registers, not on the instructions being executed; thus, it is impossible to say how execution times for (A) and (B) are affected
    NO! Forwarding reduces pipeline bubble size associated with dependences carrying values between instructions, so only the pattern of register references matters
  8. The Intel Pentium 4 has gone through several revisions; the following diagram shows the internals of the version known as Prescott. According to the diagram, which of the following techniques is not used in this design?

    Branch Target Buffer
    Set-associative cache
    8-way means set size of 8
    Separate L2 caches for code and data
    I only see one L2 Cache with both instruction and data access paths
    Superscalar execution of integer arithmetic
    I see three ALUs in the integer portion of this design
    Instruction scheduling with register renaming
    The "Allocator/Register Renamer" is pretty obvious, as are the "Scheduler uop Queues"
  9. Suppose that a simple system has a single cache with an access time of 2 clock cycles. Cache misses are satisfied with an average memory latency of 1000 clock cycles. Assuming a cache hit ratio of 0.9 (90%), how long does the average reference take? Show the formula that would give the answer.
  10. Given the declarations int a[N][N]; int i, j;, a C compiler would allocate N*N words in memory for a such that a[i][j] is (i*N)+j words after the memory location that holds a[0][0]. Given that N is large, which of the following two loop nests is likely to execute faster and why:
    (1) for (i=0; i<N; ++i) for (j=0; j<N; ++j) a[i][j] = 0;
    (2) for (j=0; j<N; ++j) for (i=0; i<N; ++i) a[i][j] = 0;
    


EE380 Computer Organization and Design.