Here is multi-list interface synopsis:
concurrent_multilist(size_t list_count, size_t list_size_hint);
void insert(size_t list_idx, T value);
bool erase(size_t list_idx, T value);
bool find(size_t list_idx, T value);
bool peek(size_t list_idx, size_t position, T* value);
size_t list_count() const;
size_t list_size(size_t list_idx) const;
I've added list_size_hint parameter to the constructor (approved change).
I've added list_count() and list_size(size_t list_idx) methods (approved change).
The only non officially approved change is a slight extension of the peek() method – if the argument 'value' is equal to 0, then the method prefetches the requested element w/o actually returning it (this way a user can declare that she may want the element in the near future). However in this case the method returns 'false', so for unaware user it looks like invalid argument handling and does not affect semantics. Note that the extension produces no visible action (state of the list is not affected, and no information returned to a user).
My ranking algorithm uses only insert() and peek() methods, so performance of find() and erase() methods can be sacrificed (however they still need to be correct and thread-safe). What we want to achieve for insert() and peek() methods is virtually zero synchronization overheads, i.e. no atomic RMW operations (LOCK prefixed instructions), no #StoreLoad style memory fences (MFENCE instruction), no write-sharing. Let's call that property atomic-free for short.
Another design consideration is that population phase is evidently memory bandwidth bound, so multi-list must employ the tightest memory layout possible. And the tightest layout possible is evidently an array of values.
Find() method imposes the same requirements (read-only scanning) and implemented in the same way as peek(), so it's not further considered.
Here is a graphical scheme of the multi-list structure layout:
The key to achievement of atomic-free property for fast-paths is the asymmetric reader-writer mutex algorithm (which is invented by me, and based on the prior work by David Dice et al). In short, the asymmetric reader-writer mutex algorithm provides virtually zero overhead synchronization for a plurality of readers at the expense of significantly more expensive writer access. In the scheme it's represented by 'arw_mutex' blocks, part of it contained within list, and another is scattered within per-thread TLS blocks (this distribution is the key to scalability). Algorithm details are a way too involved for this write-up, please refer to the links.
As you may guess, erase() method plays the role of a writer in the asymmetric reader-writer protocol; insert(), peek() and find() are all play the role of readers. Thus, erase() operations are executed completely exclusively, so they do not need any additional synchronization with other methods.
Insert() and peek() methods make heavy use of thread-local data. Insert() method caches insert position (pointer to current superblock) in per-thread/list storage, and uses per-thread caching memory allocator. While peek() method caches current scan position in per-thread/list storage, this is required for efficient scanning of lists.
Insert() and peek() operations are synchronized with each by means of careful memory operation ordering. Namely, insert() method first writes the values, and then moves 'last' pointer of the superblock:
tail_superblock->last = value;
tail_superblock->last += 1; // store-release
While, peek() method first loads 'tail' pointer and then reads the value:
for (T* pos = current_superblock->first;
pos != current_superblock->last; // load-acquire
pos += 1)
T value = pos;
This weak form of synchronization has basically no cost on IA-32/Intel64 architectures because of the implicit release/acquire fences associated with each store/load.
Next page: Parallelization