Creation of efficient scalable concurrent data structures is a kind of black magic. There is no recipes for that. What we should do is carefully analyze usage patterns for a data structure, and then try to satisfy all user requirements and scalability prerequisites with all possible means.
So what are the scalability prerequisites?
First, no mutexes on fast-paths ever (for slow-paths they are Ok, and even recommended because of the usage simplicity). There are several problems with mutexes. Mutexes limit and sacrifice concurrency to provide simplicity (anti-threads, sort of). Then, they provoke write-sharing (cache-coherence traffic) on every operation (even otherwise read-only). So, they just does not scale, forget about them.
Second, logically read-only operations must be implemented as a physically read-only operations. So, what does it mean? During logically read-only operation one should not do any single write to a shared memory location. Note that writes may be hidden inside of some component, in particular most of the reader-writer mutexes do writes to internal state in read_lock()/read_unlock(), such writes are no less harmful. Writes to a shared state provoke cache-coherence traffic, large amounts of which quickly brings any concurrent system to it's knees.
Due to specifics of implementation of cache-coherence in modern concurrent hardware (see MOESI protocol), reads to a shared state have 100% scalability (i.e. any number of threads can read from a memory location simultaneously); while writes to a shared state have zero scalability (i.e. at most 1 thread can write to a memory location at any given moment in time).
Third, no writes to a centralized shared state on fast-paths. Writes to a shared state are generally unavoidable for most concurrent data structures. However we can distinguish 4 kinds of a shared state for our needs:
1. Mostly private state. A statistics counter held in thread-local storage is a good example. Such counter is frequently written by an owner thread, and very infrequently read by some other thread. This kind of a shared state generally is of no danger for scalability.
2. Mostly read-only state. That's a state with a very high read-to-write ratio (some real-world data-structures actually have read-to-write ratio of 10^7 and higher). Such state also is of no danger for scalability.
3. Decentralized shared state. That's a shared state which is frequently written to, but is physically distributed. A good example is a hash map with an array of independent buckets. When threads concurrently work with such data structure their activity is physically scattered across the array. And consequently collisions during accesses to a memory location are less frequent. This kind of shared state may or may not represent a danger for scalability depending on distribution factor, number of threads, access patterns and other details.
4. Centralized shared state. That's a shared state which is frequently written to, and is physically centralized. A typical example is a counter of elements in a container, which is mutated on every insert and remove operation. That's a scalability killer number one, there is no way to make it scalable. A typical mistake is to maintain such a state with atomic RMW (read-modify-write) operations (InterlockedXXX(), __builtin_sync_XXX(), atomic_XXX()), and think that since there is no mutexes, they should be scalable. It does not work that way, just say no to a centralized shared state.
Fourth, be aware of false sharing. Due to performance reasons cache-coherence protocols work with whole cache lines, rather than with separate bytes, words or C-language variables. I.e if two variables are contained within a single cache line, for the hardware they look like a single variable with all implications on scalability. So everything said above must be actually extended from distinct memory locations to cache lines. Size of a cache-line is architecture dependent, there are/was architectures with cache line sizes from 16 bytes to 4 kilobytes. However for modern Intel x86 processors (IA-32, Intel 64) cache-line size is fixed to 64 bytes, i.e. 16 consecutive words for IA-32 and 8 consecutive words for Intel 64.
Fifth, atomic RMW operations have some fixed associated costs. For modern Intel x86 processors cost of a single atomic RMW operation (LOCK prefixed instruction) is some 40 cycles (depends on a particular model, and steadily decreases). The cost comes mostly from frequently-unneeded embed full memory fence (so when Intel will separate atomicity and memory ordering, you may cross out this rule). However, the cost is fixed and does not affect scalability, so is far less important than above-outlined scalability-affecting points. Of course, all well-known single-threaded optimization rules are still applicable too. So we may include into the fifth point also algorithmic optimality and general implementation quality.
If we summarize we get the following scalability mantra:
The most important aggregate metric for a concurrent data structure is a mean number of cache line transfers between threads per operation.
All possible means must be employed to reduce the value as much as possible.
That's “why” and “what” need to be achieved. Compliance with the above guidelines ensures linear scalability if possible, and just maximum possible scalability otherwise. It's worth noting that some data structures inherently can't be implemented with the linear scalability, for example producer-consumer queue with the strict FIFO ordering requirement (it's 'strict FIFO ordering' part that is problematic, because it inherently requires communication between threads on every enqueue operation).
Move on to False-sharing