Ok, so what is the most important thing regarding synchronization algorithm's performance and scalability? I frequently hear the answer that it's a number of atomic RMW (read-modify-write) instructions (like Compare-And-Swap or Fetch-And-Add) per operation. It's dead wrong. The most important thing is amount of write sharing per operation. Numerical measure of write sharing is number cache-line transfers per operation, and the ideal value is 0. If there is 0 cache-line transfers per operations amortized (we are perfectly Ok with amortization here), then the algorithm is perfectly scalable. Anything other than 0, even 1, is a serious problem for scalability.
To not sound proofless, let's look at some graphs. The first graph is the scalability of plain write operations (x86 MOV instruction) in two different setups. In the first "private" setup each thread writes to a dedicated variable situated in a separate cache-line. In the second "shared" setup each thread writes to a dedicated variable, but all the variables are situated in a single cache-line:
The second graph is the same for atomic RMW operations (x86 LOCK XADD instruction):
Don't you notice any regularity yet? Let's also that a look at the same graph for plain loads (x86 MOV instruction):
And not let's put it all together onto a single graph:
So, what conclusions can we make?
First, if there is write sharing system ungracefully degrades, the more threads we add the slower it becomes.
Second, if there is no write sharing system linearly scales. Yes, atomic RMW operations are slower than plain stores and loads, but they do scale linearly in itself (by the way, cost of atomic RMW operations becomes smaller and smaller with each new processor generation, there are no fundamental reasons why they must be any slower and similar non-atomic read-modify-write sequence).
Third, loads are always scalable. Several threads are able to read a memory location simultaneously. Read-only accesses are your best friends in a concurrent environment.
If I would be asked to choose a single thing that you will get away, I undoubtedly would choose exactly this - the negative effect of write sharing on scalability.
Move on to Your Arsenal