Introduction
There are 3 levels in every software system: high-level architecture, mid-level design and low-level implementation. You know, sometimes I hear statements like "My system is not scalable. But I only need to spice it with cool lockfree containers, and it will become perfectly scalable". Sorry, it does not work that way. Scalability starts with good high-level architecture. It's trivially easy to kill scalability on the lowest level with excessive amounts of write sharing, but it does not belittle value of good high-level architecture. In the end, all levels are equally important.
In the context of scalability a good analogy to software system is an enterprise. Assume that you need to design workrooms and production processes that will allow lots of personnel to work efficiently (read concurrently). Will you require all tasks to pass via a single man (logging)? How many toilets you need, and how will you place them (memory management)? Do you want to maintain certain ratio between programmers, testers and managers, or you will leave it to chance (various stages of processing)? The principles are all the same - excessive communications, dependencies, waits and centralization kill concurrency.
Another good method is to think about a multicore processor as it is a distributed cluster, so that each communication between threads results in a network transfer (and it's actually roughly that way on physical level). So do you want that processing of each transaction in your server will cause 20 transfers over a network? Or you will try to do with 2? Of course things are no that bad - communications inside of a multicore processor are cheaper, however it will drive you in the right direction, and will allow you to build forward scalable solutions.
So, the key aspects of scalable architectures are Decentralization and Independence.
Decentralization is important because centralized systems inherently do not scale. For example, consider thread scheduler inside of an OS. While we have a single core processor, centralized scheduler works perfectly – it makes optimal decisions w/o any additional overheads. Now consider that we have a quad-core processor, each scheduling decision now requires some amount of communication (in the form of physical cache-line transfers between cores) and there are also some amount of idling due to mutual exclusion; but the system overall works acceptably because the scheduler is activated once per, let's say, 5 ms (20 ms time slice / 4 cores). And now consider that we have 8 processors each with 32 cores, a centralized scheduler can seriously negatively impact system's scalability – communication is more costly (because components are more physically distributed), scheduler is activated each 80 mks (20 ms time slice / 256 cores) so there is inevitable contention and idling.
What all modern OS do is employing of distributed thread schedulers. That is, each hardware thread has it's own queues of runnable threads and other required state. However, it has 2 downsides: (1) sub-optimal decisions and (2) increased complexity.
Sub-optimal decisions are inherent to distributed systems. Since each thread is now guided by only a part of information (otherwise it would be a centralized system) it can't potentially make optimal decisions. In the context of thread scheduling sub-optimality will result in non-strict priorities. For example, the highest priority thread in local queue of the current processor has priority 5, however, another processor has runnable thread with priority 10. Optimal decision would be to execute the thread with priority 10, but a decentralized scheduler will execute the thread with priority 9.
I think it's evident that it's much more difficult to create a distributed scheduler (as opposed to a centralized one): you need to cope with state distribution, state aggregation, use more involved algorithms, tolerate sub-optimal decisions, etc.
Independence is important because each additional dependency has associated costs (some communication between threads must happen in order to satisfy a dependency) and limits available concurrency (a computation can't start until the dependency is satisfied). For example, consider a network router, it has a centralized routing table which is updated periodically (each several seconds), and a group of worker threads which route network packets according to the routing table. A naïve approach would be to use a reader-writer mutex to protect the table during updates. However, it would introduce dependencies between worker threads and updater thread (updater needs to wait while workers finish with the table, workers need to wait while updater finish with the table) (click for larger version):
We can break the dependencies by means of MVCC (multi-version concurrency control). Namely, instead of waiting for workers updater thread creates a new version of the routing table, applies necessary changes to it and publishes it (this pattern can be implemented with Differential Reference Counting). From this moment forward workers use updated version for handling of new requests (click for larger version):
As you may see, now concurrency is improved - workers and updater thread waits are eliminated; also synchronization overheads are smaller, because there is no need for costly mutual exclusion algorithms.
Move on to General Recipe