References: EE380 Memories (Chapter 7)
The book does a very good job of explaining memory structures;
read it. Emphasis here should be on cache and TLB concepts,
because these are the things that are truly hardware... you'll
see more about demand-paged virtual memory if you take an OS
course. Here's a quick outline for what you should have a basic
understanding of:
-
Memory Hierarchy levels and characteristics:
Registers, L1 Cache, L2 Cache, Split I/D Caches,
Main Memory, Disk
-
Spatial vs. Temporal Locality
-
Cache Concepts
-
Hit, Miss, Hit Ratio
-
Cache Organization: Line Size; Associativity (Direct Mapped,
Set Associative, Fully Associative); Replacement Policy
(Random, Least-Recently-Used (LRU), the fact that there are others)
-
What happens for a read miss or for a write miss?
-
The concept of "coherence" -- keeping entries in multiple caches
that reference the same object consistent so that all see the same
value for that object; invalidation and update protocols
-
Logical vs. Physical Addresses
-
Page Tables; L1 and L2 Translation Lookaside Buffers (TLBs)
-
Placement of TLBs (before L1 cache, so caches are physically mapped)
-
Concept of Virtual Memory (pages not resident in memory, but
on disk)
-
Current parameters:
~32B line size,
4KB page size,
at tens tens of KB of cache,
at least hundreds of MB of main memory (under 1us access time),
at least hundreds of GB of disk (~10ms access time);
why are TLBs getting to be a problem?
-
What this means for data access pattern speeds:
potential benefit of a[i][j] vs. a[j][i],
potential benefit for array of struct vs. multiple arrays
So, how much does memory access order matter?
Consider the following program:
volatile double a[N][N];
main()
{
register int i, j;
a: for (i=0; i<N; ++i) {
b: for (j=0; j<N; ++j) {
a[i][j] = 0;
}
}
}
Let's compile four versions of this for N=4096.
The first version, x0, is as shown above.
The second version, x1, swaps lines a: and b:.
The third and fourth versions, x2 and x3,
are like the first two versions, except in that the loops
run backwards, using for (i=N-1; i>=0; --i) { and
for (j=N-1; j>=0; --j) {.
On an Athlon 64 3200+ processor, the user times are:
| Program |
User Time (seconds) |
| x0 |
0.136 |
| x1 |
1.652 |
| x2 |
0.148 |
| x3 |
1.632 |
In summary, the best memory access order was 12X faster than
the worst one!
Computer Organization and Design.