Radix Sort
[This is my write-up for Intel Threading Challenge 2009. You may download the code for Microsoft Visual C++ at the bottom of the page]
Radix sort is a sorting algorithm that sorts integers by processing individual digits. Because integers can represent strings of characters and specially formatted floating point numbers, radix sort is not limited to integers. Most digital computers internally represent all of their data as electronic representations of binary numbers, so processing the digits of integer representations by groups of binary digit representations is most convenient. Two classifications of radix sorts are least significant digit (LSD) radix sorts and most significant digit (MSD) radix sorts. LSD radix sorts process the integer representations starting from the least significant digit and move towards the most significant digit. MSD radix sorts work the other way around. MSD sorting algorithm has particular application to parallel computing, as each of the subdivisions can be sorted independently of the rest.
Radix sort is not a comparison-based sort, so theoretical limit of O(NlgN) is not applicable. Computational complexity of radix sort is O(NK), where N is the number of values and K is the number of subdivisions. This complexity holds for worst, best and mean cases. Space complexity is O(NK).
Single-Threaded Implementation
Naïve single-threaded implementation of MSD radix sort is quite straightforward:
typedef std::vector<std::vector<byte> > data_t;size_t const byte_values = 256;typedef unsigned char byte;void radix_sort(data_t& data, size_t position = 0){ // recursion stop conditions if (data.size() <= 1 || position == data[0].size()) return; std::vector<data_t> radix (byte_values); // radix split for (size_t i = 0; i != data.size(); ++i) { size_t idx = data[i][position]; radix[idx].push_back(data[i]); } size_t out_pos = 0; for (size_t i = 0; i != byte_values; ++i) { // recursive sort of lesser significant digits radix_sort(radix[i], position + 1); // copyback for (size_t j = 0; j != radix[i].size(); ++j, ++out_pos) { data[out_pos] = radix[i][j]; } }}
Parallelization
I use 2 types of parallelization. First type is the parallelization of the radix split (intra-radix parallelization), this parallelization is especially useful for initial radix split (most significant digit). Input data is split into several parts (fork), each processor picks up a part and makes radix split (parallel processing). When all parts have split partial radix arrays are aggregated (join) and directed to the next level of recursion. This parallelization may help also with sorting of not-so-randomly distributed data.
Second type is the parallelization on inter-radix level. Processor completely sorts whole array on lower levels of recursion. This parallelization helps mitigate overheads of thread synchronization.
Parallelization is guided at run-time. I.e. threads prefer to do inter-radix parallelization, however if some threads are out of work they help other threads on intra-radix level.
When size of the input array reaches some threshold, thread switches to single-threaded mode, i.e. no further sub-tasks are split (this also helps mitigate synchronization overheads).
Here is pseudo-code of the parallel algorithm:
struct radix_desc
{
// partial results
std::vector<data_t> radix [thread_count];
size_t radix_pending_count;
size_t position;
//...
};
struct radix_task
{
data_t input;
radix_desc& desc;
//...
void execute()
{
// partial radix split
for (size_t i = 0; i != input.size(); ++i)
{
size_t idx = input[i][desc.position];
desc.radix[thread_id][idx].push_back(input[i]);
}
if (0 == atomic_decrement(desc.radix_pending_count))
{
// spawn sub-tasks
for (size_t i = 0; i != byte_values; ++i)
{
// aggregate partial results
data_t result;
for (size_t j = 0; j != thread_count; ++j)
{
result.insert(result.end(), desc.radix[j][i]);
}
radix_desc desc = new radix_desc (...);
spawn_some_subtasks(desc, result);
}
}
}
};
Scheduling
I've implemented custom task-based scheduler on top of the Win32 threading API. In main part it's similar to classical Cilk-style work-stealing scheduler, though I've made some improvements on it. In particular I've added system-topology awareness, hyper-threading awareness, affinity-awareness, batch-spawn capability and manual task-depth control. All worker threads are strictly binded to EUs (execution units), stealing conducted based on the “distance” between EUs, i.e. worker thread tries to steal from neighbor threads first, then from threads running on different NUMA node (system-topology awareness). This allows to efficiently reuse data in shared L3 cache of the processors.
Sibling HT threads share single work-stealing deque (HT awareness), this allows them to keep as close to each other as possible in terms of working sets. Resources of single core (L1D cache, L1 DTLB, etc) are not capable to accommodate 2 distinct radix sorts, HT awareness allows HT sibling threads to work on single radix sort, so to say. Assume first HT thread completes radix split and spawns a bunch of sub-tasks. Then it picks up some sub-task to process, while second HT thread picks up another sub-task, data for that another sub task is already in L1D cache (as well as in L1 DTLB) of the core.
The scheduler is able to support affinity of tasks. Though I didn't have enough time to exploit the feature.
When thread completes radix split it submits up to 96 (number of printable characters in US-ASCII) sub-tasks, scheduler allows to submit all the tasks in single enqueue operation. This reduces synchronization overheads to some degree.
When thread submits new tasks to the scheduler it explicitly passes so called tasks depth as a parameter. Task depth relates to the task level in the work DAG. When thread pops task from own work-stealing deque it picks up task with the highest available level (the smallest piece of work), when thread steals task from remote work-stealing deque it picks up task the lowest available level (the biggest piece of work). This reduces number of steal operations.
Regarding Threading Building Blocks. Another possibility would be to use TBB's task scheduler. Usage of the TBB would not affect main logic of the program in any way, because it supports exactly the same task concept. On one hand TBB would allow to reduce amount of written code (no need to implement scheduler manually). On the other hand TBB's scheduler is not system-topology aware, not HT aware, does not provide batch spawn capability, and does not provide manual control over task depths (not relevant w/o HT awareness) (TBB's scheduler is affinity aware to some degree, i.e. it supports task affinities however does not supports thread affinities). Also TBB's scheduler has somehow bigger task spawn/consume overheads: some 600 cycles, while my scheduler some 200 cycles (on my hardware). Since the contest is about raw performance I've decided to implement own scheduler.
Single-threaded Optimizations
Avoiding copyback. Naïve radix sort implementation makes K (number of digits) copies of the whole data set in the copyback phase. In order to eliminate those copies I use following optimization. On start I allocate array for the sorted data:
struct output_cell
{
int count_;
uint32_t* data_;
};
size_t const output_size = 96*128*128;
output_cell* g_output = new output_cell [output_size];
3 most significant digits of the value determine index in that array:
size_t output_index(uint64_t val)
{
byte* v = (byte*)&val;
return ((size_t)v[3]) | ((size_t)v[2] << 7) | (((size_t)v[1] - 32) << 14);
}
4 least significant digits of the value are stored in the inner array:
void store_result(uint64_t val, size_t position)
{
size_t idx = output_index(val);
uint32_t v = (uint32_t)(val >> 32);
g_output[idx].data_[position] = v;
}
This way all copies of the data in the copyback phase are eliminated, sorted data are placed directly to the final destination.
Counting sort. When values reduced to 2 bytes (by 5 previous radix splits) I use counting sort (which is a special case of the radix sort with special intermediate representation of the values). Counting sort has the same computational complexity as the radix sort, however has lower space complexity and can be implemented more efficiently. Since I expect very few values will be sorted with counting sort at a time (i.e. counter array will be very sparse), I add bitmask to optimize search over counter array.
Pseudo-code of the counting sort:
void counting_sort(uint16_t* begin, uint16_t* end, uint32_t* output, uint32_t prefix)
{
uint32_t counter [256*256] = {};
bitmask_t bitmask;
for (uint16_t* pos = begin; pos != end; pos += 1)
{
uint16_t v = pos[0];
counter[v] += 1;
bitmask.set_bit(v);
}
for (uint16_t v; bitmask.get_and_reset_bit(v);)
{
do
{
uint32_t val = prefix;
val |= v;
output[0] = val;
output += 1;
}
while (--counter[v]);
}
}
bitmask_t::get_and_reset_bit() operation is implemented with the BSF instruction (_BitScanForward64() intrinsic). Bitmask optimization reduces computational complexity of the counting sort from 65536*N to 2*N.
Counting sort is not parallelized in my implementation. Since input data is uniformly distributed, I expect this to not affect performance. Though this is a possible further optimization which will allow better handling of not-so-randomly distributed data.
Template code generation. I heavily use C++ template programming in order to allow efficient code generation. Value is represented by the following class:
template<size_t digits_t> struct data_layout;
template<> struct data_layout<7> {typedef uint64_t value_t;};
template<> struct data_layout<6> {typedef uint64_t value_t;};
template<> struct data_layout<5> {typedef uint64_t value_t;};
template<> struct data_layout<4> {typedef uint32_t value_t;};
template<> struct data_layout<3> {typedef uint32_t value_t;};
template<> struct data_layout<2> {typedef uint16_t value_t;};
template<size_t digits_t>
struct value
{
typedef typename data_layout<digits_t>::value_t value_t;
value_t val;
value& operator = (value<digits_t + 1> const& r)
{
val = (value_t)(r.val >> (8 * (sizeof(r) - sizeof(*this))));
return *this;
}
char prefix() const
{
return ((char*)&val)[sizeof(*this) - digits_t];
}
};
All functions and classes related to radix sorting are also template parametrized by number of digits, and act accordingly to particular value layout, location of the radix prefix in the value, etc.
Also radix task is template parametrized by parameters is_single_threaded and is_parent_single_threaded. When is_single_threaded==true, task allocates subtasks on the stack and executes them directly. When is_parent_single_threaded==true, task avoids atomic counting of pending siblings, since parent allocates sub-tasks on the stack they all will complete when parent completes.
Memory allocation. Efficient memory allocation is crucial for single-threaded as well as multi-threaded (standard Windows allocator uses single mutex which significantly reduces scalability) performance of the implementation. I implement distributed region memory allocator, there is a pool of 2 MB pages per NUMA node, a thread privatizes a page from that pool and then uses region allocation on the page. When page exhausted thread privatizes another page, and so on. No memory is freed to the OS during radix sort, though some memory is reused internally. Also I implement simple caching memory allocator for objects of a particular size; the allocator is based on a per-thread lifo freelist. When object is freed it’s pushed onto the freelist, when object must be allocated it’s popped from the freelist.
Tools
I was considering Microsoft Visual C++ (MSVC) and Intel C++ (ICC) compilers. In 32-bit mode ICC showed impressive 30% speedup over MSVC (even more with profile-guided optimizations). However in 64-bit mode ICC showed wicked 20% slowdown (with maximum possible optimizations turned on, including /QxHost, /Qunroll, etc), profile-guided optimizations improve situation somehow but ICC still was behind MSVC. I didn't have time to investigate the problem, so I've decided to use MSVC for final submission.
As a profiler I used AMD CodeAnalyst, it's a simple profiler which allows to easily capture and analyze profile of the program. Profiling was crucial for single-threaded optimizations. Also it allowed me to verify that profile of the multi-threaded version is mainly identical to that of the single-threaded version, and that overheads for synchronization and scheduling are not greater than several percents – all this is a good sign of successful parallelization. Another option would be to use Intel PTU, it's somehow more complicated however would allow to capture processor performance events which is crucial for single-threaded optimization (for example it would answer what causes excessive pipeline stalls – L1D cache misses or L1 DTLB misses).
Another great tool I used is Windows Task Manager. I allowed me to track virtual memory consumption, CPU utilization, working set and number of page faults. The goal was to keep virtual memory consumption in expected bounds (~1.5 * input data size in my case), 100% utilization of the CPUs in parallel phase and 0 page faults (i.e. working set == virtual memory).