# oscarbonilla.com

## Fun with Caches

with one comment

A recent thread about latency numbers on Hacker News reminded me that I had meant to write a few blog posts about caches that I never did get around to. So I took a trip down to the cellar (what I call my “Drafts” folder), sniffed this post and decided it’s ready.

I first saw the technique used for generating these latency numbers in exercise 5.2 on page 476 of Computer Architecture: A Quantitative Approach[1] by Hennessy and Patterson.

The basic idea is that you write a program to walk a contiguous region of memory using different strides and time how long accessing the memory takes. The idea for this exercise comes from the Ph.D. dissertation of Rafael Héctor Saavedra-Barrera, where he describes the following approach:

Assume that a machine has a cache capable of holding D 4-byte words, a line size of b words, and an associativity a. The number of sets in the cache is given by D/ab. We also assume that the replacement algorithm is LRU, and that the lowest available address bits are used to select the cache set.

Each of our experiments consists of computing many times a simple floating-point function on a subset of elements taken from a one-dimensional array of N 4-byte elements. This subset is given by the following sequence: 1, s + 1, 2s + 1, …, N – s + 1. Thus, each experiment is characterized by a particular value of N and s. The stride s allows us to change the rate at which misses are generated, by controlling the number of consecutive accesses to the same cache line, cache set, etc. The magnitude of s varies from 1 to N/2 in powers of two.

He goes on to note

Depending on the magnitudes of N and s in a particular experiment, with respect to the size of the cache (D), the line size (b), and the associativity (a), there are four possible regimes of operations; each of these is characterized by the rate at which misses occur in the cache. A summary of the characteristics of the four regimes is given in table 5.1.

And table 5.1 helpfully summarizes the regimes.

Size of ArrayStrideFrequency of MissesTime per Iteration
1 ≤ N ≤ D1 ≤ s ≤ N/2no missesT
D < N1 ≤ s < bone miss every b/s elementsT + Ms/b
D < Nb ≤ s < N/aone miss every elementT + M
D < NN/a ≤ s ≤ N/2no missesT

I thought it would be fun to try it, so I wrote a program to do that. If you want to download it, go to github, but the guts of it is this function:

```void sequential_access(u32 cache_max) { u32 register i, j, stride; u32 steps, csize, limit; double sec0, sec;   for (csize=CACHE_MIN; csize <= cache_max; csize*=2) { for (stride=1; stride <= csize/2; stride=stride*2) { sec = 0.0; limit = csize - stride + 1; steps = 0; do { sec0 = timestamp(); for (i=SAMPLE*stride; i > 0; i--) { for (j=1; j <= limit; j+=stride) { buffer[j] = buffer[j] + 1; } } steps++; sec += timestamp() - sec0; } while (sec < 1.0);   printf("%lu\t%lu\t%.1lf\n", stride * sizeof(u32), csize * sizeof(u32), sec*1e9/(steps*SAMPLE*stride*((limit-1)/stride+1))); fflush(stdout); } printf("\n\n"); fflush(stdout); } }```

As you can see, it’s a very simple program that times a loop accessing a cache in different strides. Just like the exercise in Hennessy and Patterson and just like the description in Saavedra-Barrera’s dissertation.

I ran this program on my machine after rebooting in single-user mode like Hennessy and Patterson suggest so that virtual addresses track physical addresses a little closer, and with a little help from gnuplot, I got this:

Sequential Access

You can tell a lot by just glancing at that graph. You can see what the cacheline size is, what the sizes of the L1 and L2 caches are, what the page size is, the associativity of the cache, the TLB size.  Here is the data used to create that graph and here is the script.

One interesting tidbit is that modern processor architectures cache aggressively, so you need to randomize the accesses to memory to get accurate timings.

Random Access

Another thing that might be of mild interest to people is that I wrote an EFI version of the program that bypasses the Operating System (but not the EFI VM). There are instructions about how to run it in the ‘efi’ subdirectory.

1. the second edition []

Written by ob

June 5th, 2012 at 5:32 pm

## Another useless C99 tidbit

without comments

From page 18 of the C99 standard:

All occurrences in a source ﬁle of the following sequences of three characters (called trigraph sequences) are replaced with the corresponding single character.

```??= #    ??( [     ??/ \
??) ]      ??' ^     ??< {
??! |      ??> }    ??- ~
```

No other trigraph sequences exist. Each ? that does not begin one of the trigraphs listed above is not changed.

Ok, so take the following C program:

```??=include <stdio.h> int main(int argc, char *argv??(??)) ??< printf("hello world\n"); ??>```

And compile it with the `-trigraphs` switch to gcc:

```dirac src \$ gcc -trigraphs -o trigraphs trigraphs.c dirac src \$ ./trigraphs hello world```

Combined with this you could seriously obfuscate your C code.

Written by ob

April 13th, 2008 at 11:50 pm

Posted in C

Tagged with ,