General Recipe
The general recipe for scalable architecture looks roughly as follows.
First, you need to create enough threads. Well, without that you can't possibly exploit hardware concurrency. Generally, you need K*P threads, where P – number of processors (hardware threads), and K – application specific coefficient. Ideal value of K is 1 (number of threads == number of processors). Less values of K will inevitably result in under-subscription of processors, that is, some hardware threads will be unutilized. Larger values of K will result in over-subscription of processors, however, modest over-subscription is not very harmful (OS is perfectly able to dispatch several threads per processor w/o significant overheads). But modest over-subscription can help to mitigate episodic blocking of threads, which can be unpractical to eliminate from complexity point of view. So, reasonable value of K is, let's say, 1..4.
However, what you should not do is to make number of threads depending on data set size, or transaction rate, or subsystem/plugin count. Because then you are loosing control over it, and you get either under-subscription or over-subscription.
Also keep in mind that a thread is not an application-level abstraction, it's an abstraction of a processor (thing that can execute code, any code). So don't create threads to do particular functions – any thread in your program can execute any function (however there can be exception from the rule – for example, frequently it's convenient to have a dedicated GUI thread). Don't create threads only to track timers. Don't create threads only to do IO. Threads should generally be one level below application logic.
Second, you need work distribution/balancing mechanisms. It makes no sense to have P threads, when work is distributed very unevenly between them, and some threads are mostly idle. This is also not an application level. You need some work distribution/balancing machinery, and it must be dynamic and feedback based. Why? Because you don't how many threads there will be, data set is generally unknown, application logic and details are in constant change, hardware is changing, plus there can other processes that contend for CPU and IO. The only way handle that is some reactive feedback based mechanism.
Third, don't extensively use mutexes and other explicit or implicit forms of mutual exclusion. Think of mutexes as anti-threads :) They are not means for concurrency, they are means for suppressing concurrency. I hope it's evident that if most of your processing happens under a mutex, then only 1 thread at a time can do useful work.
Fourth, eliminate mutable shared state to the extent possible. It does not scale now, it won't scale tomorrow, and there is no way to make it scalable. Mutation of shared state causes cache-coherence traffic, which is slow (as compared to local computations), and cache-coherence interconnects have limited bandwidth, so they can become a bottleneck in a system.
In short, each core must be supplied with own work and own data and work on it independently.