For performance testing I used Intel MTL service, which features 64 hardware threads (4 NUMA nodes x 8 cores x 2 HT threads). As input data I used "30000 30000 0 1103515245 12345 9999991", i.e. 30'000 lists each with 30'000 elements (900'000'000 elements total).
Below is the performance graph for the population phase only ('ideal' line represents perfect linear scaling, times are in milliseconds):
Population scales nearly perfectly up to 8 threads (i.e. 2 threads per NUMA node). Maximum performance is achieved with 16 threads. Beyond that point memory subsystem is completely saturated and performance degrades due to thread over-subscription.
Below is the performance graph for the ranking phase only ('ideal' line represents perfect linear scaling, times are in milliseconds):
Ranking scales reasonably good up to 4 threads. Then performance starts degrade due to frequent synchronization between threads (at the end of each phase). The best performance is achieved with 32 threads.
Below is a table with complete execution times for different LIST_COUNT and ELEM_COUNT parameters. Execution times are captured for executions with optimal thread number for each phase (thread number that yield minimal execution time) (execution times are in milliseconds):
For example, for LIST_COUNT=100'000 and ELEM_COUNT=10'000 execution time is 4397 ms, which means that time per single element (insertion and ranking) is below 5 ns.