I've parallelized multi-list population on a list basis, i.e. only 1 thread can populate a list. Population of a single list is basically impossible because of the dependencies in the linear congruental pseudo-random number generation algorithm.
Parallelization is implemented with OpenMP technology:
#pragma omp parallel for schedule(guided, 1) num_threads(fill_thread_count)
for (size_t i = 0; i < list_count; i += 1)
fill_list(lists, rnds[i], elem_count, i);
Number of threads employed in population is determined as:
min(number_of_lists, number_of_hardware_threads, number_of_numa_nodes * 4)
There is no sense in employing more threads than there are hardware threads of lists. NUMA related component is the formula deals with memory bandwidth – 4 threads are perfectly able to exhaust node available memory bandwidth.
KMP_AFFINITY=scatter is set because we need to occupy all NUMA nodes (we need their memory bandwidth). Scatter affinity distributes threads as far from each other as possible (refer to Intel OpenMP Thread Affinity Interface for details).
Problem analysis reveals quite tight data dependencies between processing of list elements. Namely, processing of i-th element in a list depends on processing of 0..i-1 elements in all other lists. Thus, it's impossible to process lists one-by-one independently of each other (which would be an ideal variant from parallelization point of view). However, what can be processed in parallel is an "i-th slice of a multi-list", i.e. i-th elements in all lists.
So, I've employed a phased algorithm for parallelization. Namely, each worker thread is assigned with a batch of lists. During each phase a thread iterates over the batch, and for each list searches for a first non ranked element and ranks it. All worker thread synchronize on a barrier [[url]] at the end of each phase. If during a phase worker thread detects that some list runs out, then the thread schedules rebalance procedure for the end of the phase. Rebalance procedure redistributes remaining lists evenly among worker threads.
Here is pseudo-code that's executed by worker threads:
bool list_removed = false;
for (size_t b = 0; b != my_batch.size(); b += 1)
size_t const list_idx = my_batch[b];
// actual processing is omitted for brevity
list_removed |= handle_list(list_idx);
if (barrier.wait(thread_idx, list_removed))
Rebalance procedure is executed only by main thread (thread_idx == 0) iff any of the threads signaled exhaustion of one or more lists. During rebalance all other threads (except main) are parked on the barrier. When main thread finishes with rebalancing, it releases the barrier. This algorithm ensures mutual exclusion for accesses to batches.
Multi-threading is implemented by means of OpenMP, however it's used only to start threads:
#pragma omp parallel for schedule(static, 1) num_threads(rank_thread_count)
for (int thread_idx = 0; thread_idx < tcount; thread_idx += 1)
rank(..., thread_idx, ...);
KMP_AFFINITY=compact is set because of the frequent synchronization between threads. Compact affinity places threads closer to each other, so that synchronization is less costly (refer to Intel OpenMP Thread Affinity Interface for details).
Next page: Performance Results