References: EE380 Memories

The book does a very good job of explaining memory structures; read it. Emphasis here should be on cache and TLB concepts, because these are the things that are truly hardware... you'll see more about demand-paged virtual memory if you take an OS course. Here's a quick outline for what you should have a basic understanding of:

Memory Hierarchy levels and characteristics: Registers, L1 Cache, L2 Cache, Split I/D Caches, Main Memory, Disk (including RAID -- Redundant Array of Inexpensive Disks)
Spatial vs. Temporal Locality
Cache Concepts
- Hit, Miss, Hit Ratio
- Cache Organization: Line Size; Associativity (Direct Mapped, Set Associative, Fully Associative); Replacement Policy (Random, Least-Recently-Used (LRU), the fact that there are others)
- What happens for a read miss or for a write miss?
- The concept of "coherence" -- keeping entries in multiple caches that reference the same object consistent so that all see the same value for that object; invalidation and update protocols
Logical vs. Physical Addresses
- Page Tables; L1 and L2 Translation Lookaside Buffers (TLBs)
- Placement of TLBs (before L1 cache, so caches are physically mapped)
- Concept of Virtual Memory (pages not resident in memory, but on disk)
Current parameters: ~32B line size, 4KB page size, at least hundreds of KB of cache total in 2-3 levels, at least hundreds of MB of main memory (under 1us access time), at least hundreds of GB of disk (~10ms access time); why are TLBs getting to be a problem?
What this means for data access pattern speeds: potential benefit of a[i][j] vs. a[j][i], potential benefit for array of struct vs. multiple arrays

So, how much does memory access order matter? Consider the following program:

volatile double a[N][N];

main()
{
        register int i, j;

a:        for (i=0; i<N; ++i) {
b:	        for (j=0; j<N; ++j) {
        	        a[i][j] = 0;
	        }
        }
}

Let's compile four versions of this for N=4096. The first version, x0, is as shown above. The second version, x1, swaps lines a: and b:. The third and fourth versions, x2 and x3, are like the first two versions, except in that the loops run backwards, using for (i=N-1; i>=0; --i) { and for (j=N-1; j>=0; --j) {. On an Athlon 64 3200+ processor, the user times are:

Program User Time (seconds)

x0 0.136

x1 1.652

x2 0.148

x3 1.632

Program	User Time (seconds)
x0	0.136
x1	1.652
x2	0.148
x3	1.632

In summary, the best memory access order was 12X faster than the worst one!

Computer Organization and Design.