QuickThread (C++/Fortran)

QuickThread* is a runtime library and programming paradigm for writing multithreaded applications in 32-bit and 64-bit environments using C++, Fortran and mixed language programs.

QuickThread* is affinity capable supporting thread affinity, data binding affinity and NUMA support.

QuickThread* is a tasking system using thread pools. Providing exceptional control over task scheduling with respect to cache levels, core placement, and thread availability.

The design goal of QuickThread* is to produce a minimal overhead mechanism for distributing work in a multi-threaded environment.

Conceptual programming technique

The above figure depicts an idealized system with eight threads (T0-T7), running on two processors, each processor with four cores, three level cache, two memory systems. Two core pairs within each processor sharing one of two L2 cache within the processor, all cores within each processor sharing a processor common L3 cache. And each processor with direct access to a local RAM (M0) and one hop access to local RAM of the other processor (M1 obversely). The above diagram can be expanded to include additional processor packages and memory systems and additional memory hop levels (M2, M3).

In the idealized system, each thread has independent data distributed amongst the various cache and memory levels and where the programming goal is to keep as many of the thread’s data (and instruction) accesses as close to its L1 as possible. When the programmer has the means to control the execution of the application in a manner that approaches this idealized system, then the application will experience maximum performance. In practice, the generally used threading tools do not provide the programmer with the means to control the program execution towards this idealized system.

One of the techniques employed by most of the threading tools which provides a limited measure of this control, was the switch in programming practice of using one software thread per task into using a pool of threads, typically with one software thread per hardware thread, and a task scheduler that schedules tasks to available threads from the thread pool. This technique exchanged a costly operating system thread context switch with a comparatively low cost task context switch.

Additionally, when using the thread pool tasking technique, the programmer can use thread affinity to pin the software thread to a specific hardware thread (or set of threads). Using thread pinning, then when the operating system interrupts an application thread, or context switches to another application or system task, then upon resumption of application thread, there is the benefit of a higher probability of some portion of the previously cached application data still being present in its cache system.

The remaining control technique for the programmer to approach the idealized system is the means to choreograph not only the task scheduling but also the task placement, interaction with other tasks, and data placement control. QuickThread* offers this level of control.

QuickThread* offers the programmer the means to:

    • Allocate data objects from a particular NUMA node (e.g. with most available RAM or least estimated computation load).

    • Direct execution of task or task slices for data objects allocated with placement to be restricted to, or have preference to, run on threads within the NUMA node of that data object.

    • Hot-in-cache programming considerations to direct execution of task or task slices to be restricted to, or have preference to, run on threads sharing a specific cache level with the current thread (thread issuing the task en-queue).

    • Not-in-cache programming considerations to direct execution of task or task slices to be restricted to, or have preference to, run on threads sharing a specific cache level on the processor with the most idle hardware threads at that cache level.

    • Opportunistic-in-cache task scheduling whereby loops can be conditionally split-up into multiple task slices only when, and to the extent of, threads sharing a specific cache level with the current thread are available (else the loop is run as a single task or diminished number of tasks and by the current thread).

    • Include (by direct call of task as function call by current thread) or exclude current thread in task slice-up.

    • Slice-up and distribute a task to a primary thread slice, one each, per requested cache level.

    • Slice-up primary thread slice into secondary thread slices within the cache level of the primary thread slice.

    • Opportunistic, as threads become available scheduling to reduce unnecessary thread scheduling calls.

With QuickThread* the programmer can exert this level of control by the inclusion of a single placement directive on the parallel_for and other parallel directives.

The conceptual programming technique for QuickThread* is a messaging system whereby you to throw objects and arguments at functions (C++) or subroutines (Fortran). These throw requests are placed into a queue (one of several queues). The queued subroutines, when run, can throw (en-queue) additional subroutines and arguments or perform work or do both.

The general queuing technique is neither strictly LIFO nor FIFO. QuickThread will en-queue the work requests in a manner that defaults to be hot-in-cache friendly (the programmer optionally can use cache directed en-queuing of work requests).

When thread affinity is not used, the application programmer can select between a compute class queue and an I/O class queue. There is a third queue called the compute class overflow queue which may be used in rare circumstances (i.e. to not block en-queue requests by other threads while the primary compute class queue is being allocated additional free nodes performed by the first thread to cause the overflow).

Affinity, when enabled, will, at programmer’s direction, pin compute class threads to one or more execution cores. Then the programmer has the choice of using affinity directed task en-queuing of compute class tasks, or non-affinity directed task en-queuing of compute class tasks. I/O class threads are not affinity pinned.

For further information refer to http://www.quickthreadprogramming.com/