Per-processor Data

Data sharing is bad and it kills scalability, so considerable amount of scalable synchronization algorithms rely on per-thread data. For example, most scalable memory allocators keep per-thread caches of memory blocks. Another good example is task schedulers which keep per-thread task pools. However, sometimes it may be beneficial to keep per-processor rather than per-thread data (here by a processor I mean any notion of a hardware thread - core of a multicore processor or a HyperThread). Why? Because number of processors may be significantly less than number of threads. For example, consider a machine with 16 hardware threads, and a program that creates, let's say, 4 software threads per each hardware threads (quite reasonable for occasional IO blocking masking). So our imaginary scalable memory allocator needs to keep 48 memory caches less if uses per-processor data. Another advantage is that number of processors usually can be considered as constant, this greatly simplifies algorithm design. While threads are generally come and go, so algorithms that use per-thread data need to use dynamic thread registration/deregistration as well as some form of offloading (what if a thread terminates while not all of it's memory blocks are freed?).

Since a processor executes only one software thread at a time, contention and cache coherence is not a problem. Two successive threads running on a single processor do not contend with each other and able to reuse allocator's data in processor caches. However, the problem is that a thread can be re-scheduled to another processor in the middle of an operation on per-processor data, so generally you need all the synchronization in place as if it's a plain global data (contrast it with per-thread data which allows to eliminate all synchronization).

Per-processor data is not so widely used in user-space, but OS kernels use per-processor (or per-CPU in terms of Linux kernel) data on a routine basis. Kernel code is able to mask and unmask interrupts (in order to prevent re-scheduling to another processor) and to obtain number of current CPU easily. That makes per-CPU data almost ideal (at least for resource pooling).

Masking of interrupts is an impermissible luxury for user-space code, but at least we may try to learn how to obtain current CPU number efficiently. So what is available for that?

On Windows there is GetCurrentProcessorNumber(), however it's available only since Vista. On Linux there is vgetcpu(), however I am not sure as to whether it has made it into main kernel or not. sched_getcpu() seems to be quite heavyweight previously, but it become cheaper nowadays. IA-32/Intel64 instruction set contains CPUID instruction that returns, among others, an APIC ID, which can be mapped to a CPU number (it seems that cores and HyperThreading siblings have the same APIC ID, though).

Latest x86 processors feature RDTSCP instruction which returns processor number as well (Intel introduced it in Core i7, and AMD some earlier). You can determine it's presence with CPUID instruction. RDTSCP seems to be quite heavyweight (as well as CPUID), but we can try to amortize accesses to it in the following way:

struct current_processor_t

{

unsigned number;

unsigned timestamp;

unsigned period;

};

__declspec(thread) current_processor_t current_processor = {0, 0, 1};

unsigned get_current_processor()

{

current_processor_t& cp = current_processor;

if (0 == --cp.period

|| cp.timestamp != *(unsigned*)0x7FFE0000) // compare tick count

update_current_processor();

return cp.number;

}

__declspec(noinline) void update_current_processor()

{

current_processor_t& cp = current_processor;

cp.timestamp = *(unsigned*)0x7FFE0000; // update tick count

cp.period = 20; // subject to tweaking

__rdtscp(&cp.number);

}

Because of the amortization it works quite fast and gives a reasonable precision. However, on Windows a better solution is the so called "red pill" trick based on SIDT instruction (which is accidentally available in user-space). SIDT instructions returns a value of the interrupt descriptor table register (IDTR), the value itself is irrelevant for us, however, IDT is different on each processor (at least on Windows). So we just need to map it to a processor number, the mapping can be trivially constructed by probing:

struct idt_proc_cache_t

{

unsigned cached_idt;

unsigned cached_proc;

};

#pragma pack(push, 1)

struct idt_desc_t

{

unsigned short size;

unsigned base;

};

#pragma pack(pop)

unsigned idt_count;

unsigned* idt_table;

__declspec(thread) idt_proc_cache_t idt_proc_cache;

unsigned get_current_processor()

{

// obtain IDT value

idt_desc_t idt;

__sidt(&idt);

// compare it with the cached value,

// if they are equal, then return cached processor number

idt_proc_cache_t& cache = idt_proc_cache;

if (cache.cached_idt != idt.base)

{

// otherwise map current IDT to processor number,

// and cache results

for (unsigned i = 0; i != idt_count; ++i)

{

if (idt_table[i] == idt.base)

{

cache.cached_idt = idt.base;

cache.cached_proc = i;

break;

}

}

}

return cache.cached_proc;

}

int init_idt_mapping()

{

SYSTEM_INFO info;

GetSystemInfo(&info);

idt_count = info.dwNumberOfProcessors;

idt_table = (unsigned*)_aligned_malloc(idt_count * sizeof(unsigned), 128);

if (idt_table == 0)

return -1;

HANDLE th = CreateThread(0, 0, init_idt_mapping_impl, 0, 0, 0);

WaitForSingleObject(th, INFINITE);

CloseHandle(th);

return 0;

}

DWORD __stdcall init_idt_mapping_impl(void*)

{

for (unsigned i = 0; i != idt_count; i += 1)

{

SetThreadAffinityMask(GetCurrentThread(), 1 << i);

idt_desc_t idt;

__sidt(&idt);

idt_table[i] = idt.base;

}

return 0;

}

It seems that on pre-Vista Windows current processor can be obtained with LSL(59h)>>14 for 32-bit Windows and LSL(53h)>>14 for 64-bit Windows (I can't test it right now):

__declspec (naked) unsigned get_current_proc_32 ()

{

__asm

{

mov ecx, 59h

lsl eax, ecx

shr eax, 0Eh

retn

}

}

Note that if you are in doubt how to implement the function, you can always return just a random number. It should not compromise correctness (remember in user-space it's no more than an optimization - a thread can be rescheduled to another CPU straight after it has obtained the number), and will give you a randomized distributed data structure, which is not that bad after all. And for testing you may implement is as always returning 0, it will provide sort of worst-case behavior (which is good to testing).

Here you may see an example of usage of the technique: Distributed Reader-Writer Mutex