Before we talk about cache-friendly data structures, let’s briefly recap on what a cache is. In general computing terms, a cache is a (usually) transient store of data, that allows a program to access that data quickly. A cache is typically deployed in front of some slower storage medium. Some examples of caches are:
- On-disk cache to speed up HDD access
- Kernel page-cache to reduce accesses to storage
- Web application cache of mostly read data
- Multi-level CPU cache to avoid fetching data from main memory
In each of these examples, the cache provides faster access than that obtained by going directly to the source of the data. Caches can help to reduce response time in many parts of a system. Of course, with a cache comes the problem of invalidation and the possibility of stale data, but that is a topic for another time.
In this article, we are primarily interested in the multi-level CPU cache present in all modern processors. We will be looking at the large caches available to multi-core server-class processors.
The Cache Hierarchy
Server-class CPUs contain varying levels of cache. A common configuration is to have three levels of cache: level 1 (L1), level 2 (L2), and level 3 (L3 or LLC). The L1 cache is closest to the processor and has the lowest access times, the L3 cache is farthest from the processor, closest to main memory, and has the highest access times of the three caches.
With greater speed comes greater expense, and so the fastest caches are the smallest. Intel’s Comet Lake CPUs contain 64 KB of L1 cache, 256 KB of L2 cache (per core), and up to 20 MB of L3 cache, shared between all cores.
The L1 cache is divided into two banks, one for instructions (L1I), and one for data (L1D).
Data Access Patterns
Memory is transferred between the caches in units called cache lines. These 64-byte chunks of data are the atomic unit of transfer. If our program wants to alter the data at memory address 4096, then the CPU will arrange for the 64 bytes (address range 4096-4160) to be present in the L1 cache of the requesting core.
Processors employ different strategies to try to improve memory access speed; one of these is pre-fetching, where the CPU will load in adjacent cache lines to those being accessed, in the hope that they will be required in the near future.
Our programs can take advantage of this pre-fetching by attempting to make predictable, sequential memory accesses. CPUs are also good at recognising regular “strides” in access patterns (e.g. accessing every 256th byte). We can see that to take advantage of these effects, our data needs to be laid out in memory in a way that is sympathetic to the pre-fetcher, giving us better cache locality.
Example: A HashMap
The HashMap is a standard data-structure in use in many programs, with implementations in almost all language standard libraries. Maps are used to map a key to a value. In this example, we will look at a map that maps 64-bit integer keys to 64-bit integer values.
HashMaps must deal with map collisions, where more than one key hashes to the same slot in the data structure. One approach to this is to maintain a chain of values at a given key:
The problem here is that performing lookups or inserts involves first finding the correct index in the HashMap, then iterating through a list of entries to determine whether the key is present. Each of these operations requires memory accesses to non-predictable addresses.
A more cache-friendly method is to use open-address hashing, where only one value is stored per array index. When a collision occurs, the next available slot is used for the incoming key. Now that only a single value is stored per key index, we can use an array to store the values:
When performing a lookup, adjacent key entries are scanned to find the matching slot. Since the primitive array will be laid out contiguously in memory, it is likely that multiple keys will already be in the L1 data cache after the first comparison. This makes scanning for values where collisions occur much faster.
Now we can consider further optimisation. Since our use-case is a map of two 64-bit values, we can model the entire data structure as a single array, where keys and values are interleaved:
In this case, accessing a key slot will cause a 64-byte cache line to be loaded into the L1 data cache. The corresponding value will be on the same line (in most cases), so retrieval is incredibly cheap. In the case of a collision, the adjacent keys and values are also likely to be already in the L1 cache, or being pre-fetched by the CPU.
Cache-friendly data structures exploit spatial locality in memory to take advantage of modern CPU features such as pre-fetching. By increasing locality, and reducing the effects of pointer-chasing, these structures can outperform standard solutions by orders of magnitude. One term for this approach is “Data Orientated Design” – the process of optimising the memory layout of data structures.
Subscribe to receive the Four Steps Newsletter: