Cache-Oblivious Algorithms

The idea behind cache-oblivious algorithms is efficient usage of processor caches and reduction of memory bandwidth requirements. Both things are equally important for single-threaded algorithms, but especially crucial for parallel algorithms, because available memory bandwidth is usually shared between hardware threads and frequently becomes a bottleneck for scalability. The name may be somewhat misleading, because they are not oblivious of the fact of presence of caches (just the opposite), they are oblivious of particular cache hierarchy and cache parameters and make efficient use of whatever cache hierarchy/parameters.

In order to understand what it's all about, let's consider a typical cache structure in modern computers:

Yes, in some sense memory is no more than a cache for slow disks (while disks are caches for network or whatever). Each lower level is more expensive and thus has lower capacity, however it's also faster and in particular has significantly lower access latency. L1 cache typically has latency of a few processor cycles, and at large each subsequent level has roughly order of magnitude larger access latency.

There is an another dimension in the structure - some caches are private to cores, while others are shared between several cores. There is usually an L1 cache (and potentially L2) per each core in modern processors. While L3 cache currently is usually shared between all cores in a processor.

So, without loss of generality let's consider a bit simpler structure:

Here I use terms "cache" and "memory" in a generalized sense, they may represent real L3 cache and memory, or L1 cache and L2 cache, or memory and disk and so on (in reality the scheme is actually applied several times recursively). So, we assume that "cache" is fast and private, while "memory" is slow and shared, and hereinafter we will analyze and optimize accesses to slow shared memory.

Now let's consider the following problem. We have an array of N elements, and want to apply a pairwise predicate to each possible pair of elements:

struct item_t

{

uint64_t x;

uint64_t y;

};

bool predicate(item_t const* i1, item_t const* i2)

{

return ((i1->x + i2->y) * i2->x == (i1->y + i2->x) * i2->y);

}

std::vector<item_t> data (N);

The straightforward way to accomplish the task is to use doubly nested loops, and the straightforward way to parallelize it is to put "#pragma omp parallel for" over the outer loop:

size_t kernel(item_t const* begin1, item_t const* end1,

item_t const* begin2, item_t const* end2)

{

size_t count = 0;

for (item_t const* pos1 = begin1; pos1 != end1; pos1 += 1)

{

for (item_t const* pos2 = begin2; pos2 != end2; pos2 += 1)

{

if (predicate(pos1, pos2))

count += 1;

}

}

return count;

}

size_t calculate_straightforward_parallel(item_t const* begin, size_t count,

unsigned thread_count)

{

size_t res = 0;

# pragma omp parallel for reduction(+:res) num_threads(thread_count)

for (int i = 0; i < (int)count; i += 1)

res += kernel(begin + i, begin + i + 1, begin, begin + count);

return res;

}

Let's analyze number of memory transfers MT(N) between memory and cache. It's quite evident that MT(N)=(N2/B), where B - is size of the block used for transfers (cache line size), that is, we are reloading whole array on each iteration of the inner loop (I assume that data do not fit into cache entirely). So, is it good or not? If we compare it to computational complexity O(N2), we may conclude that it is not all that bad. In the end we are doing O(1/B) memory transfers per unit of computations, that is, we perfectly exploit spatial locality - once a cache line is loaded we use all data contained in it.

Let's take a look at the performance graph on a 4 processors x 4 cores AMD machine (dataset size is 150K elements):

As can be seen, scalability beyond 3 threads is mediocre, and no scalability beyond 9 threads. What is the reason? The reason is that all threads constantly access memory which is a shared resource (read - potential bottleneck). Indeed, each thread must do a memory transfer each B computations.

So, what we've missed during design of our algorithm? We've missed possibility of data reuse in cache based on temporal locality. That is, once we've loaded some data into cache we want do as much computations on it as possible (ideally, all). And here cache-oblivious algorithms kick into action.

Cache-oblivious algorithms work by recursively dividing a problem's dataset into smaller parts and then doing as much computations of each part as possible. Eventually subproblem dataset fits into cache, and we can do significant amount of computations on it without accessing memory:

Note that we don't need to know exact cache hierarchy/parameters for it to work. We just recursively divide a dataset, and it inevitably eventually fits into L3 cache, then into L2 and then into L1. Also note that when we've done with a part that fits into L1, we need to load a new part into L1, but the load is satisfied from L2 rather than from memory (and the same for L2). So such recursive division automatically optimally exploits whatever cache hierarchy we have.