So we have to choose between work-stealing and work-requesting. Work-stealing has a serious advantage over work-requesting due to it's asynchronous nature: a thief thread is able to get some work, even if a victim thread is busy processing a user task or even de-scheduled by an OS. With work-requesting a thief thread is able to get some work only if a victim thread condescends to send it (which it is just unable to do if it is de-scheduled by an OS). However, in our case it's not a problem, because tasks are small and limited in size, plus we are not going to over-subscribe processors with threads.
There are also 2 problems with work-stealing due to it's asynchronous nature. First, it inherently incurs some observable per-task overhead, because every pop operation from a thread's work deque must be synchronized with a potential asynchronous steal operation from another thread. Stealing is rare, but one still has to pay that price on every pop operation. The price is at least a single store-load style memory fence (MFENCE instruction for x86 architecture), or a single atomic RMW operation (LOСK prefixed instruction on x86). Here is an illustration of the problem:
Work-requesting is free of the problem. Task deque is completely local to a thread and requires no synchronization.
The second problem has similar nature and relates to a join phase of parallel algorithms (game state backtracking in our case). Traditional handling of task completion involves decrement of a pending child counter of a parent task. Due to asynchronous nature of work-stealing, the decrement has to be synchronized with other potential concurrent decrement operations. Here is an illustration of the problem:
Work-requesting is free of the problem. During execution of work-requesting protocol a victim thread can mark a parent task with a 'has_stolen_children' flag, and synchronization will be used only for such tasks. While great bulk of tasks will proceed without any synchronization.
Due to afore-mentioned problems I've chosen a work-requesting protocol for task scheduling. Basically, it allows us to cut 2 atomic RMW operations per task.
Move on to Work-Distribution and Work-Balancing