Case Study: MultiLane - a concurrent blocking multiset

This is about the MultiLane : a concurrent blocking multiset paper. Go and read it first, it's just 4 pages including references and an appendix.

Ok, I did not expect that you do that anyway :)

It describes a way to improve scalability of a producer-consumer system by means of partitioning. Basically, it presents a way to wrap any producer-consumer queue in order to get a queue with the same properties but better scalability. Here is how it does it:

There is basically no restrictions of underlying queue, it can be any sort of queue: bounded/unbounded, support one or several producers/consumers, etc. The resulting queue will share the same properties.

Input demultiplexer and output multiplexer are no more than atomic variables, and tickets are obtained with atomic_fetch_add(1):

void multilane_enqueue(multilane_t* m, void* data)

{

unsigned ticket = atomic_fetch_add(&m->enqueue_cursor, 1, memory_order_relaxed);

underlying_queue_t* queue = &m->queues[ticket % m->count];

underlying_enqueue(queue, data);

}

void* multilane_dequeue(multilane_t* m)

{

unsigned ticket = atomic_fetch_add(&m->dequeue_cursor, 1, memory_order_relaxed);

underlying_queue_t* queue = &m->queues[ticket % m->count];

return underlying_dequeue(queue);

}

By the way, Intel TBB concurrent queue uses exactly the same trick.

And note that the technique is quite general, so that one can use it, for example, to balance workload between external web-services:

I have mixed feelings about the technique. On one hand, it uses partitioning (decentralization) which is a good thing, and generally a right way to approach design of concurrent systems. Moreover, it actually improves performance. Below is a performance report of a synthetic microbenchmark running on UltraSPARC T2 machine (8 cores, 16 pipelines, 64 hardware threads):

So what's wrong with it?

Provided one has a heavy contended mutable data structure, it's a false route to try to patch the hotspot in place. There is no way to make a centralized heavy contended data structure scalable. Period. In some sense it resembles elimination-backoff stack algorithm. Basically, if one has that level of contention (there are always a few threads trying do something with a data structure), sorry, the architecture is just incompatible with parallelism. Yes, of course, there are some ways to improve the situation somewhat locally, and they may even show good results... if you compare them to the previous bad-bad results. But impartially they are still bad.

The only way to solve the problem is to escalate it to the architecture level. Let's see what we can do.

The authors mention resource pools. A better, really distributed, way to create a pool may be to establish per-thread pools, so that a thread pushes and pops resources to own pool. Only if the pool is empty on pop, a thread randomly queries pools of other threads:

An alternative design is to use per-thread pools along with a centralized one which is used solely for resource "surplus". That is, per-thread pools are strictly single-threaded; if a per-thread pool reaches a threshold, some resources are offloaded to the centralized pool; if a thread is out of resources it queries the centralized pool:

As for producer-consumer scenarios, we can introduce a notion of persistent binding between producers and consumers. That is, each producer has an associated SPMC queue (which is already better than MPMC queue) and always pushes messages to it; each consumer has one or several associated producers, and checks their queues first; if they are empty if falls back to random/sequential checking of other queues.

However, it's frequently a bad idea to strictly divide threads to producer and consumers, it's usually a better idea to have just worker threads and let them do whatever is the most important job at the moment, that is, let them produce, consume or whatever. And then we can apply the same approach with per-thread queues as I suggested for resource pooling - when a thread acts as a producer it pushes a message to own queue, then it switches to consuming and pops a message from the same queue.

There is one more interesting point in the paper that I want to focus your attention on:

Interestingly, we see that LinkedTransferQueue is faster than its MultiLane counterpart with 16 producer threads and 4 consumer threads. Further investigation showed that many messages were simultaneously inflight under the MultiLane form, and that garbage collection activity dominated the measurement interval. This behavior arises because the underlying collection is unbounded and the underlying implementation allocates ”container” nodes for each message...

Never ever use unbounded queues in producer-consumer scenarios. The question deserves a separate article, but long story short is: a bounded queue behaves the same as an unbounded one while everything goes as supposed, and saves you from severe performance degradation or OOM otherwise.

The bottom line.

The MultiLane technique can be used to somewhat reduce contention on a centralized data structure, and from this point of view it's a useful trick in your concurrent toolbox. However, in an ideal world it's better to approach the problem on the architectural level.

You may also read some my comments regarding details of the approach in the Dave's blog.