Compiler vs. Hardware
As I said before, there are 2 levels of each memory model property - compiler level and hardware level.
For example, let's consider visibility. Consider the following C/C++ program:
g_var = 0;
for (...)
{
...
g_var = ...;
...
}
Hardware visibility guarantees do not relevant for us yet, because C/C++ compiler can transform the program as:
register = 0;
for (...)
{
...
register = ...;
...
}
g_var = register;
And our stores to 'g_var' won't be visible to other threads until loop end. So, first we need to ensure that our high-level language memory access will be compiled to proper machine code memory accesses, and only then hardware guarantees will come into play.
The same for ordering - if a compiler will reorder 2 memory accesses during compilation, hardware can't help - there is no way to restore broken ordering. So first we need to ensure ordering of memory accesses on compiler level, and then ensure ordering on hardware level.
And the same for atomicity - if in C/C++ program we have a load of 64-bit variable, in 32-bit mode it will most likely result in 2 separate 32-bit loads, and there is no way how hardware can restore atomicity - they are not already atomic.
However, if your language provides solid abstract memory model (like C1x, C++0x, Java or CLI/.NET) then you are working on that level only. For example, in Java/CLI loads and stores of object references are atomic, and you basically do not care on what hardware the program will run - compiler will ensure "end-to-end" atomicity by whatever means it finds it more efficient or whatever. Or if you use C1x atomic_store_explicit(memory_order_release), you also do not care about underlying mechanics - compiler have to ensure claimed guarantees on all levels.
One last note about visibility and abstract memory models. As far as I know, both C1x/C++0x/Java/CLI memory models do not guarantee visibility formally, that is, most multi-threaded programs are perfectly allowed to hang and make no forward progress - changes just won't propagate between threads. It's very involved question related to cooperative scheduling, fairness and variations in hardware, it's very difficult to specify formally, and it's farmed out to QOI (Quality Of Implementation). Don't afraid that, all sane practical language implementations do guarantee best-effort visibility.
Move on to Scalability Prerequisites