References: EE380 Memories
The book does a very good job of explaining memory structures;
read it. Emphasis here should be on cache and TLB concepts,
because these are the things that are truly hardware... you'll
see more about demand-paged virtual memory if you take an OS
course. Here's a quick outline for what you should have a basic
understanding of:
-
Memory Hierarchy levels and characteristics:
Registers, L1 Cache, L2 Cache, Split I/D Caches,
Main Memory, Disk (including RAID -- Redundant Array of Inexpensive
Disks)
-
Spatial vs. Temporal Locality
-
Cache Concepts
-
Hit, Miss, Hit Ratio
-
Cache Organization: Line Size; Associativity (Direct Mapped,
Set Associative, Fully Associative); Replacement Policy
(Random, Least-Recently-Used (LRU), the fact that there are others)
-
What happens for a read miss or for a write miss?
-
The concept of "coherence" -- keeping entries in multiple caches
that reference the same object consistent so that all see the same
value for that object; invalidation and update protocols
-
Logical vs. Physical Addresses
-
Page Tables; L1 and L2 Translation Lookaside Buffers (TLBs)
-
Placement of TLBs (before L1 cache, so caches are physically mapped)
-
Concept of Virtual Memory (pages not resident in memory, but
on disk)
-
Current parameters:
~32B line size,
4KB page size,
at least hundreds of KB of cache total in 2-3 levels,
at least hundreds of MB of main memory (under 1us access time),
at least hundreds of GB of disk (~10ms access time);
why are TLBs getting to be a problem?
-
What this means for data access pattern speeds:
potential benefit of a[i][j] vs. a[j][i],
potential benefit for array of struct vs. multiple arrays
So, how much does memory access order matter?
Consider the following program:
volatile double a[N][N];
main()
{
register int i, j;
a: for (i=0; i<N; ++i) {
b: for (j=0; j<N; ++j) {
a[i][j] = 0;
}
}
}
Let's compile four versions of this for N=4096.
The first version, x0, is as shown above.
The second version, x1, swaps lines a: and b:.
The third and fourth versions, x2 and x3,
are like the first two versions, except in that the loops
run backwards, using for (i=N-1; i>=0; --i) { and
for (j=N-1; j>=0; --j) {.
On an Athlon 64 3200+ processor, the user times are:
Program |
User Time (seconds) |
x0 |
0.136 |
x1 |
1.652 |
x2 |
0.148 |
x3 |
1.632 |
In summary, the best memory access order was 12X faster than
the worst one!
Computer Organization and Design.