The rule of thumb says: prefer the highest possible level for parallelization. This tends to minimize synchronization overheads, that can easily dominate in execution time of improperly parallelized program. Path classes represent the natural partitioning of work for parallelization. Due to the arithmetical equation (1) we can calculate number of paths in each class, and length of paths in each class is also known. So we can determine position of each class in the output array, and fire off generation of paths for each class as a separate and completely independent parallel task.

However, analysis shows that this is insufficient partitioning. For example, for the point (12,12) we have following path classes:

12-0-12 paths: 2704156

11-1-11 paths: 16224936

10-2-10 paths: 42678636

9-3-9 paths: 64664600

8-4-8 paths: 62355150

7-5-7 paths: 39907296

6-6-6 paths: 17153136

5-7-5 paths: 4900896

4-8-4 paths: 900900

3-9-3 paths: 100100

2-10-2 paths: 6006

1-11-1 paths: 156

0-12-0 paths: 1

That's only 13 tasks with highly uneven work per task.

Another rule of thumb says: task granularity must be between ~2 microseconds and ~100 microseconds. Provided finer-grained partitioning (even smaller tasks), task scheduling overheads dominate in execution time. Provided coarser-grained partitioning (even larger tasks), program may not yield enough parallelism to load all available threads. Important consequence: partitioning (task granularity) should not be driven by input data (otherwise one looses control over task granularity).(for details see Granularity and Parallel Performance chapter from Intel Guide for Developing Multithreaded Applications)

So I employed another level of partitioning: each path class is further divided into smaller tasks. Size of tasks does not depend on input data this time, instead it's roughly constant in absolute size to satisfy the aforementioned constraints. I choose size of the generated output as task size metric, i.e., for example, each task generates ~1MB of output data (this is quite accurate characteristic of execution time for this problem) . Here is the pseudo-code of partitioning:

void partition_path_class(task_t task)


size_t total_output_memory = 0;

size_t prev_entry_index = 0;

std::vector<dispatch_entry> const& entries = get_dispatch_entries(task.x, task.y, task.d);

// iterate over all dispatch entries and split them into ranges,

// so that each range generates ~partitioning_threshold of output.

// fire off each range as separate task

for (size_t i = 0; i != entries.size(); i += 1)


size_t output_memory = calculate_output_memory(task, entries[i]);

total_output_memory += output_memory;

if (total_output_memory >= partitioning_threshold || i == entries.size() - 1)


// spawn a task to process entries [prev_entry_index, i]

spawn_task(prev_entry_index, i);

prev_entry_index = i + 1;




When parallelizing a program there are always some obligatory things to check. Namely, loсk contention, heap contention and excessive data sharing (including false sharing). Each of which, if not resolved, can easily prevent scaling. Analysis shows that my program in not amenable to all 3 things: there is not loсks, there is no dynamic memory allocation during work, and there is no data sharing.

Next page: Fighting The Memory Bandwidth Problem