So, how to apply cache-oblivious recursive division to our pairwise predicate problem?
We divide our dataset into 2 equal parts, and then recursively proceed with 4 subproblems: application of a predicate to elements from parts (I, I), (I, II), (II, I) and (II, II):
And so on recursively. Note that "upper" and "lower" datasets on the picture are different in the general case (consider the case when we recursively proceed with parts (I, II)).
Generally we do no need to do recursive decomposition down to the bitter end of 1 element partitions. We may stop the recursion once subproblem dataset size guaranteedly fits into the lowest level cache that we want to account for. And then we can reuse our straightforward computational kernel which is potentially highly optimized (with vector SSE instructions, for example).
Let's put it all together into the code; for parallelization I use Cilk Plus technology here (note that recursive decomposition plays very nicely with parallelization - we can execute each subproblem in parallel):
size_t calculate_recursive_parallel(item_t const* begin1, size_t count1,
if (count1 + count2 > 256)
size_t res1 = _Cilk_spawn calculate_recursive_parallel
size_t res2 = _Cilk_spawn calculate_recursive_parallel
size_t res3 = _Cilk_spawn calculate_recursive_parallel
size_t res4 = _Cilk_spawn calculate_recursive_parallel
return res1 + res2 + res3 + res4;
return kernel(begin1, begin1 + count1, begin2, begin2 + count2);
}Yes, so many words but the code is that simple. Now let's analyze number of memory transfers MT for this algorithm. Main recursive relationship is MT(N)=4*MT(N/2), and we stop recursion when data fits into cache, that is, MT(C/2)=C/B (where C is cache size expressed in cache lines, B is cache line size, and C is divided by 2 because we need to fit 2 arrays into cache) (note that we stop the recursion only in the analysis of MT, computations proceed further, but it's irrelevant here because we don't do any memory accesses starting from that point). My math shows that the solution is MT(N)=O(N2/CB). That is, we reduce amount of memory access by C (which is actually quite large constant, for 64B cache line size and 64KB L1 cache C=1K cache lines, and for 6MB L3 cache C=96K cache lines) as compared to the straightforward algorithm. So now we are doing one memory access per only CB computations.
Let's take a look at the performance graph in the same setup:
As can be seen, cache-oblivious version is not only faster on single thread, it also shows perfect linear scalability (because memory access bottleneck is eliminated).
I've showed how to construct the cache-oblivious algorithm for the simple each-to-each O(N2) problem. It's not always that easy, however, the general pattern of recursive decomposition always holds. For example, below is a hint of how you may approach matrix multiplication in a cache-oblivious way:
Note, for it to work optimally you need to layout data in a recursive way as well so that it follows recursive algorithm structure:
Hope now you get the overall idea. You may also watch the following MIT OpenCourseware lectures on the topic: