Speeding Up

The implementation is quite good. However we can do better.

The recursive generation procedure generate_recursive() contains a lot of unpredictable branching and miserable amount of work per function call. Both things are bad for performance. Unpredictable branching puts out deep pipelines of modern microprocessors, while miserable amount of work per function call makes function call overheads visible.

The second insight is that we can dispatch several moves at once by means of helper dispatch table. For example, if we have 2 “x” moves and 2 “y” moves and want to dispatch 3 moves at once we have following variants:







So what we need is a helper table which is indexed by amount available x, y and d moves and contains all possible permutations for the combination. This way we can dispatch up to N moves at a time (all but last dispatch dispatches exactly N moves, while last dispatch dispatches remaining moves). This allows us to eliminate all unpredictable branching, plus increase amount of work per function call, plus express code as array processing (and modern processors are very good at array processing). Here is the code:

// count of moves dispatched at a time

size_t const dispatch_count = 8;

typedef uint64_t dispatch_t;

// dispatch table entry contains number of "consumed" x, y, and d moves

// and a path part which is a combination of "consumed" moves

struct dispatch_entry


size_t x;

size_t y;

size_t d;

dispatch_t path_part;


// recursive enumeration of all possible permutations

// by means of a helper dispatch table

void generate_recursive2(size_t x, size_t y, size_t d, std::vector<dispatch_t>& path, std::vector<dispatch_t>*& path_pos)


bool const is_last_dispatch_hop = (x + y + d <= dispatch_count);

std::vector<dispatch_entry> const& entries = get_dispatch_entries(x, y, d);

for (size_t i = 0; i != entries.size(); i += 1)


dispatch_entry const& ent = entries[i];


if (is_last_dispatch_hop)

*path_pos++ = path;


generate_recursive2(x - ent.x, y - ent.y, d - ent.d, path, path_pos);




Next page: Parallelization