[This is my write-up for Intel Threading Challenge 2009. You may download the code for Microsoft Visual C++ at the bottom of the page]

Radix sort is a sorting algorithm that sorts integers by processing individual digits. Because integers can represent strings of characters and specially formatted floating point numbers, radix sort is not limited to integers. Most digital computers internally represent all of their data as electronic representations of binary numbers, so processing the digits of integer representations by groups of binary digit representations is most convenient. Two classifications of radix sorts are least significant digit (LSD) radix sorts and most significant digit (MSD) radix sorts. LSD radix sorts process the integer representations starting from the least significant digit and move towards the most significant digit. MSD radix sorts work the other way around. MSD sorting algorithm has particular application to parallel computing, as each of the subdivisions can be sorted independently of the rest.

Radix sort is not a comparison-based sort, so theoretical limit of O(NlgN) is not applicable. Computational complexity of radix sort is O(NK), where N is the number of values and K is the number of subdivisions. This complexity holds for worst, best and mean cases. Space complexity is O(NK).

typedef std::vector<std::vector<byte> > data_t;size_t const byte_values = 256;typedef unsigned char byte;void radix_sort(data_t& data, size_t position = 0){ // recursion stop conditions if (data.size() <= 1 || position == data.size()) return; std::vector<data_t> radix (byte_values); // radix split for (size_t i = 0; i != data.size(); ++i) { size_t idx = data[i][position]; radix[idx].push_back(data[i]); } size_t out_pos = 0; for (size_t i = 0; i != byte_values; ++i) { // recursive sort of lesser significant digits radix_sort(radix[i], position + 1); // copyback for (size_t j = 0; j != radix[i].size(); ++j, ++out_pos) { data[out_pos] = radix[i][j]; } }}

Parallelization

I use 2 types of parallelization. First type is the parallelization of the radix split (intra-radix parallelization), this parallelization is especially useful for initial radix split (most significant digit). Input data is split into several parts (fork), each processor picks up a part and makes radix split (parallel processing). When all parts have split partial radix arrays are aggregated (join) and directed to the next level of recursion. This parallelization may help also with sorting of not-so-randomly distributed data.

Second type is the parallelization on inter-radix level. Processor completely sorts whole array on lower levels of recursion. This parallelization helps mitigate overheads of thread synchronization.

When size of the input array reaches some threshold, thread switches to single-threaded mode, i.e. no further sub-tasks are split (this also helps mitigate synchronization overheads).

Here is pseudo-code of the parallel algorithm:

{

// partial results

size_t position;

//...

};

{

data_t input;

//...

void execute()

{

for (size_t i = 0; i != input.size(); ++i)

{

size_t idx = input[i][desc.position];

}

{

for (size_t i = 0; i != byte_values; ++i)

{

// aggregate partial results

data_t result;

for (size_t j = 0; j != thread_count; ++j)

{

}

}

}

}

};

Scheduling

I've implemented custom task-based scheduler on top of the Win32 threading API. In main part it's similar to classical Cilk-style work-stealing scheduler, though I've made some improvements on it. In particular I've added system-topology awareness, hyper-threading awareness, affinity-awareness, batch-spawn capability and manual task-depth control. All worker threads are strictly binded to EUs (execution units), stealing conducted based on the “distance” between EUs, i.e. worker thread tries to steal from neighbor threads first, then from threads running on different NUMA node (system-topology awareness). This allows to efficiently reuse data in shared L3 cache of the processors.

The scheduler is able to support affinity of tasks. Though I didn't have enough time to exploit the feature.

When thread completes radix split it submits up to 96 (number of printable characters in US-ASCII) sub-tasks, scheduler allows to submit all the tasks in single enqueue operation. This reduces synchronization overheads to some degree.

Regarding Threading Building Blocks. Another possibility would be to use TBB's task scheduler. Usage of the TBB would not affect main logic of the program in any way, because it supports exactly the same task concept. On one hand TBB would allow to reduce amount of written code (no need to implement scheduler manually). On the other hand TBB's scheduler is not system-topology aware, not HT aware, does not provide batch spawn capability, and does not provide manual control over task depths (not relevant w/o HT awareness) (TBB's scheduler is affinity aware to some degree, i.e. it supports task affinities however does not supports thread affinities). Also TBB's scheduler has somehow bigger task spawn/consume overheads: some 600 cycles, while my scheduler some 200 cycles (on my hardware). Since the contest is about raw performance I've decided to implement own scheduler.

Avoiding copyback. Naïve radix sort implementation makes K (number of digits) copies of the whole data set in the copyback phase. In order to eliminate those copies I use following optimization. On start I allocate array for the sorted data:

struct output_cell

{

int count_;

uint32_t* data_;

};

size_t const output_size = 96*128*128;

output_cell* g_output = new output_cell [output_size];

3 most significant digits of the value determine index in that array:

size_t output_index(uint64_t val)

{

byte* v = (byte*)&val;

return ((size_t)v) | ((size_t)v << 7) | (((size_t)v - 32) << 14);

}

4 least significant digits of the value are stored in the inner array:

void store_result(uint64_t val, size_t position)

{

size_t idx = output_index(val);

uint32_t v = (uint32_t)(val >> 32);

g_output[idx].data_[position] = v;

}

This way all copies of the data in the copyback phase are eliminated, sorted data are placed directly to the final destination.

Counting sort. When values reduced to 2 bytes (by 5 previous radix splits) I use counting sort (which is a special case of the radix sort with special intermediate representation of the values). Counting sort has the same computational complexity as the radix sort, however has lower space complexity and can be implemented more efficiently. Since I expect very few values will be sorted with counting sort at a time (i.e. counter array will be very sparse), I add bitmask to optimize search over counter array.

Pseudo-code of the counting sort:

void counting_sort(uint16_t* begin, uint16_t* end, uint32_t* output, uint32_t prefix)

{

uint32_t counter [256*256] = {};

for (uint16_t* pos = begin; pos != end; pos += 1)

{

uint16_t v = pos;

counter[v] += 1;

}

{

do

{

uint32_t val = prefix;

val |= v;

output = val;

output += 1;

}

while (--counter[v]);

}

}

bitmask_t::get_and_reset_bit() operation is implemented with the BSF instruction (_BitScanForward64() intrinsic). Bitmask optimization reduces computational complexity of the counting sort from 65536*N to 2*N.

Counting sort is not parallelized in my implementation. Since input data is uniformly distributed, I expect this to not affect performance. Though this is a possible further optimization which will allow better handling of not-so-randomly distributed data.

Template code generation. I heavily use C++ template programming in order to allow efficient code generation. Value is represented by the following class:

template<size_t digits_t> struct data_layout;

template<> struct data_layout<7> {typedef uint64_t value_t;};

template<> struct data_layout<6> {typedef uint64_t value_t;};

template<> struct data_layout<5> {typedef uint64_t value_t;};

template<> struct data_layout<4> {typedef uint32_t value_t;};

template<> struct data_layout<3> {typedef uint32_t value_t;};

template<> struct data_layout<2> {typedef uint16_t value_t;};

template<size_t digits_t>

struct value

{

typedef typename data_layout<digits_t>::value_t value_t;

value_t val;

value& operator = (value<digits_t + 1> const& r)

{

val = (value_t)(r.val >> (8 * (sizeof(r) - sizeof(*this))));

return *this;

}

char prefix() const

{

return ((char*)&val)[sizeof(*this) - digits_t];

}

};

All functions and classes related to radix sorting are also template parametrized by number of digits, and act accordingly to particular value layout, location of the radix prefix in the value, etc.