Priority Queues
I'm going to share my thoughts on concurrent priority queues (CPQ) in general and tbb::concurrent_priority_queue (Intel Threading Building Blocks) in particular.
For information on tbb::concurrent_priority_queue check out the blog post. First of all, it's not all that "concurrent" at all. It uses mutual exclusion, it uses blocking, and in the end at most 1 thread does useful work at any given moment in time. There are nice pictures in the blog that show that it does not scale.
I think there are some ways to improve the implementation (and most likely TBB developers do not consider the first implementation as a final destination point). For example, producers do not actually need to wait (block) for the operation to complete, the only source of failure is a memory allocation error, so a producer can ensure there is enough space reserved in the queue, and immediately return once he offloaded the operation.
Another possible improvement is to increase batching. Namely, if a "handler" thread notices more work when he is about to retire, he can process more work up to to some predefined limit, I think it will improve locality somehow.
Hey, here I assume that you have skimmed through the implementation. Don't get upset if you don't, I believe it's not a way to go anyway.
Another potential route to go is concurrent ordered skip-lists. Skip-lists are randomized, so no need for global repair/rebalance operations that are required for heaps/trees. Enqueues are naturally distributed (=scalable) over the whole skip-list; however dequeue operations are all happen at list head (=!scalable) (can be made lockfree, though, but that's not a panacea). A possible improvement for skip-list based CPQ would be to embed a concurrent queue into each list node; the queue will contain elements with equal priority. So, for example, if a head node contains a queue with, say, 10 element, 9 subsequent CPQ dequeue operations will just pop an element from the queue (no need to do anything with skip-list itself). I would consider such design as the best choice (from available choices, not quite good anyway) for general CPQ for now.
General priority queues inherently non-scalable because of claimed properties - all producers and consumers must achieve strong global consensus on what is the highest priority element. Fear such things like death in a concurrent environment. Rare CPQ achieves at least 2x speedup under load.
So, if one *really* needs a CPQ, I would suggest 3 approaches: either give up on total consistency (strong priorities), or use specialized CPQ design, or both.
By giving up on total consistency I mean a distributed design (on a par with the distributed work-stealing Cilk-like scheduler). That is, each thread has it's own CPQ, a thread enqueues and dequeues elements from the own queue, and if it is empty he tries to steal a high priority element (or a batch of elements) from another thread. Such design is naturally distributed (=scalable). However, of course, threads will work on not the globally highest priority elements. As to whether it is a problem for a particuar system or not depends.
By specialized CPQ design I mean the following. A lot of systems are Ok with only, say, 5 or even 3 priorities - low, normal, high. So, you can just setup 3 concurrent queues - one for each priority level, and threads will poll the queues starting from high prio queue to low prio queue (checking few empty high prio queues is not a big deal performance-wise). Such design allows a lot of flexibility. For example, if you have single-producer/single-consumer (SPSC) scenario you can setup SPSC queues (which are a way faster and more scalable that general MPMC queues) (similarly for multi-producer/single-consumer and single-producer/multi-consumer scenarios). Or you can use array-based queues, or linked-list based queues, or whatever queues are the most suitable for you in a particular situation.
Or, you are able to prevent starvation of low prio elements, that is, if there is a constant flow of high and normal prio elements, you still may want to process some low prio elements in between (it's a good idea for most systems, your OS definitely does this for threads). Separate queues allow you to do this, while monolithic CPQ does not.
Moreover, if you have priorities (-inf..+inf) but expect to handle elements with priorities -1, 0, +1 most of the time, you can setup 3 concurrent queues for -1, 0, +1 and 2 CPQs for (-inf..-1) and (+1..+inf) (which are expected to be mostly idle).
As I've said, you can combine both approaches - a distributed setup of quantized queues.
I believe it's the way to go. By the way, most modern OS employ exactly such a design for thread scheduling not without a reason. Don't expect spicing up your system with a magical general "concurrent" priority queue to fix all your problems.