Thread Design Considerations

Kernel vs User-Level Threads

Kernel-level threads: OS kernel is multithreaded; kernel maintains thread data structures and performs scheduling, synchronization, and resource management.

User-level threads: A user-level library linked with the application provides thread management and runtime support. Different processes may use entirely different threading libraries with different scheduling policies.

User-to-kernel thread mappings: one-to-one, many-to-one, many-to-many.

Thread Data Structures

Single CPU, Single-Threaded Process

A process is described by its address space (virtual-to-physical mappings), stack, and registers. All state lives in the process control block (PCB). System calls execute in the context of a kernel thread.

Many-to-One Model

Multiple user-level threads map to one kernel-level thread. The user-level library maintains its own user-level thread data structures to track resource use and make scheduling/synchronization decisions.

Splitting the PCB

To support multiple kernel-level threads per process without replicating the entire PCB:

  • PCB retains virtual address mappings and process-wide state

  • Kernel thread structures hold per-thread execution state (stack pointer, registers)

From the user-level library’s perspective, kernel-level threads act as virtual CPUs. Unix systems provide setjmp/longjmp for saving and restoring user-level thread context.

Scaling to Multiple Processes and CPUs

Relationships that must be maintained:

  • User-level library tracks all user-level threads for a given process

  • Each process tracks its kernel-level threads; each kernel thread knows its address space

  • CPU data structure tracks current thread, thread affinity, and dispatch/interrupt procedures

  • Context switching between kernel threads of different processes requires full PCB save/restore (address mappings invalidated)

  • Context switching between kernel threads of the same process only requires saving/restoring the lightweight per-thread state

Hard and Light Process State

The PCB is split into:

  • Hard process state: relevant to all threads in the process (virtual address mappings, credentials)

  • Light process state: specific to a particular kernel-level thread and the user-level thread currently mapped to it (signals, syscall arguments)

Benefits of Multiple Data Structures

Single monolithic PCB has problems: large private copies per thread, expensive context switches (full save/restore), inflexible updates affecting multiple OS services.

Multiple smaller data structures connected by pointers provide:

  • Scalability: smaller structures, shared where possible

  • Lower overhead: no redundant copies

  • Better performance: only changed state is saved/restored on context switch

  • Flexibility: modifications impact only relevant data elements

Modern OSes use this approach universally.

SunOS/Solaris Threading Model

Based on the Eykholt paper (Beyond Multiprocessing: Multithreading the SunOS Kernel) and the Stein & Shah paper (Implementing Lightweight Threads).

Solaris 2.0 supported multiprocessor systems with a multi-threaded kernel. Both many-to-many and one-to-one mappings were available. Each kernel-level thread executing a user-level thread has a lightweight process (LWP) data structure; LWPs serve as virtual CPUs for the user-level library.

User-Level Thread Structures (Stein & Shah)

  • Thread ID is an index into a table of pointers (not a direct pointer), enabling meaningful error reporting for corrupt threads

  • Thread data structure contains: registers, signal mask, priority, stack pointer, thread-local storage (compiler-allocated per-thread private variables)

  • Stack size is set by library defaults or user specification; structures are laid out contiguously for locality

  • Red zone: unallocated virtual address region between thread stacks; attempts to write cause a fault, making stack overflow debugging easier (fault is attributed to the offending thread)

Kernel-Level Data Structures

Process structure: kernel-level thread list, virtual-to-physical mappings, user credentials, signal handlers.

Lightweight process (LWP): per-thread user-level registers, syscall arguments, resource usage. Similar to user-level thread data but visible to the kernel scheduler. Aggregate process resource usage requires walking all LWPs. LWP data is swappable (can be paged out under memory pressure).

Kernel thread structure: kernel registers, stack pointer, scheduling class, pointers to LWP/address-space/CPU. This data is not swappable — always memory-resident (needed by scheduler even when thread is inactive).

CPU structure: current thread, dispatch procedures, interrupt handlers. On SPARC, a dedicated register always points to the current thread for fast access.

Thread Management Interactions

Kernel and user-level library lack mutual visibility, causing coordination challenges.

set_concurrency System Call

A process can request a specific number of kernel-level threads via set_concurrency. Example scenario:

  • Process has 4 user-level threads, actual concurrency level is 2

  • Kernel initially assigns 1 kernel thread; process requests 2 via set_concurrency

  • If both kernel threads block on I/O, process stalls despite having runnable user-level threads

  • Solution: kernel notifies user-level library before blocking; library requests additional kernel threads

  • When I/O completes and a kernel thread becomes idle, kernel can reclaim it

The pthread_setconcurrency function sets the concurrency level; passing 0 lets the implementation decide.

Visibility and Scheduling Problems

The kernel cannot see user-level thread state (mutexes, wait queues, scheduling decisions). The user-level library cannot see kernel scheduling decisions.

Priority inversion example: User-level thread holding a lock is preempted at the kernel level. Other user-level threads needing the lock are starved until the kernel reschedules the lock holder. This motivates one-to-one models to eliminate the visibility gap.

The user-level library scheduler is invoked on: explicit yield, timer expiration, synchronization operations (lock/unlock), and blocked threads becoming runnable.

Bound threads: A user-level thread permanently associated with a kernel-level thread (analogous to CPU thread pinning). In a one-to-one model, all threads are bound.

Multi-CPU Issues

On multi-CPU systems, a user-level library running on one CPU may need to affect scheduling on another CPU (e.g., preempting a low-priority thread to run a newly-unblocked high-priority thread). This requires sending inter-processor signals since registers on one CPU cannot be directly modified from another.

Adaptive Mutexes

On multi-CPU systems, if the mutex owner is running on another CPU and the critical section is short, the requesting thread spins instead of blocking — avoids the overhead of context switching and queue management. For long critical sections, the thread blocks normally.

Adaptive mutexes dynamically choose spin vs. block based on:

  • Whether the mutex owner is currently running on another CPU

  • Expected critical section length

Only useful on multi-CPU systems. Requires tracking the mutex owner.

Thread Destruction and Reuse

Exited threads are placed on a death row rather than immediately destroyed. A reaper thread periodically performs garbage collection. If a new thread is requested before reaping, the data structures and stack are reused, avoiding allocation overhead.

Interrupts and Signals

Interrupts

  • Generated externally by I/O devices, timers, or other CPUs

  • Platform-dependent (determined by hardware configuration)

  • Always asynchronous — not in direct response to a CPU action

  • Identified by a unique number; looked up in the interrupt handler table to find the handler’s starting address

  • Can be masked via a per-CPU interrupt mask (bitmask; disabled interrupts remain pending)

Signals

  • Generated by software running on the CPU or by CPU hardware faults

  • OS-dependent (identical hardware with different OS will have different signals)

  • Can be synchronous (e.g., SIGSEGV from illegal memory access, SIGFPE from divide-by-zero) or asynchronous (e.g., kill from another process, timer expiration)

  • Identified by a unique number; looked up in a per-process signal handler table

  • Can be masked via a per-process signal mask

  • OS provides default actions (terminate, core dump, stop, continue, ignore); processes can install custom handlers for most signals (some, like SIGKILL, cannot be caught)

Common POSIX signals: SIGINT (terminal interrupt), SIGURG (high-bandwidth socket data), SIGTTOU (background write attempt), SIGXFSZ (file size limit exceeded).

Why Disable Interrupts/Signals

Interrupt/signal handlers execute on the interrupted thread’s stack. If the handler needs a mutex already held by the interrupted thread, deadlock occurs.

Solution: use masks to dynamically disable specific interrupts/signals before acquiring a mutex. Re-enable after releasing the mutex. Pending disabled events are delivered once re-enabled, but the handler executes only once regardless of how many instances occurred while masked.

  • Interrupt masks: per-CPU

  • Signal masks: per-execution-context (thread)

On multi-core systems, interrupts can be routed to a single designated CPU, avoiding perturbation on other cores.

Signal Types

  • One-shot signals: handler called at least once regardless of how many instances occurred; custom handler must be re-registered after each invocation (otherwise reverts to OS default)

  • Real-time signals (Linux): if raised n times, handler is guaranteed to be called n times (queuing behavior)

Interrupts as Threads (SunOS/Solaris)

To avoid deadlocks in interrupt handlers that need locks, Solaris allows interrupt handlers to become full threads:

  • If the handler contains no locking, it runs on the interrupted thread’s stack (fast path)

  • If the handler may block, it runs as a separate thread with its own stack and context

  • Threads for interrupt handlers are pre-created and pre-initialized to eliminate dynamic creation cost from the interrupt fast path

Top half / Bottom half (Linux terminology):

  • Top half: minimal, non-blocking processing; runs immediately on interrupt

  • Bottom half: arbitrary complexity; runs as a schedulable thread that can block

Performance tradeoff: handling interrupts as threads adds ~40 instructions per interrupt, but saves ~12 instructions per mutex lock/unlock (no need to toggle interrupt masks). Since mutex operations vastly outnumber interrupts, this is a net win. This exemplifies optimizing for the common case.

Threads and Signal Handling (Solaris)

Each user-level thread has a signal mask (user-level, cheap to update). Each kernel-level thread/LWP has a signal mask (kernel-level, requires syscall to update).

The threading library wraps all signal handlers. When a signal arrives, the library handler checks masks and routes the signal:

  • Case 1 — Both user-level and kernel-level masks enabled: signal delivered directly to the current thread

  • Case 2 — Kernel mask enabled, current user-level thread’s mask disabled, but another runnable user-level thread has it enabled: library scheduler switches to that thread

  • Case 3 — Same as Case 2, but the capable thread is running on another CPU: library sends a directed signal to that kernel-level thread

  • Case 4 — All user-level threads have the signal disabled: library issues a syscall to disable the kernel-level mask, reissues the signal; process repeats across all kernel threads until all kernel masks reflect the disabled state. When any user-level thread later re-enables its mask, a syscall updates the kernel-level mask.

This approach optimizes the common case: mask updates (frequent) are cheap user-level operations; actual signal delivery (rare) incurs the extra complexity.

Linux Threading Model

Linux uses the task as the kernel-level execution context, represented by a task_struct (~1.7 KB).

  • PID field: actually a task ID (per-thread); misleading name for historical reasons

  • TGID (task group ID): the PID of the first task created for the process; identifies the process as a whole

  • Task list: links all tasks (threads) belonging to a single process

  • Process state is represented via pointers to separate structures (memory management, file management, etc.), enabling easy sharing between tasks

``clone()`` system call creates new tasks. A sharing flags bitmap controls what is shared:

  • All flags set → new thread (shared address space, files, etc.)

  • All flags cleared → fork() behavior (separate copy of everything)

  • Selective flags → partial sharing (e.g., shared files only)

fork() is internally implemented via clone() with flags cleared. For multithreaded processes, fork() creates a single-threaded child replicating only the calling thread’s visible state.

NPTL (Native POSIX Threads Library): current Linux threading implementation using a one-to-one model (one kernel task per user-level thread). Replaced the earlier LinuxThreads library (many-to-many model). Made viable by:

  • Cheaper kernel traps (fast user-to-kernel transitions)

  • More available memory (no pressure to minimize kernel threads)

  • Sufficient ID range for most workloads

For extreme scale (exascale computing, heterogeneous platforms), user-level library support and custom thread management policies remain relevant.

Linux Kernel Quiz

  • Kernel thread structure: kthread_worker (in kthread.h)

  • Process descriptor contained within it: task_struct

  • Minimum threads to boot: 20 (max_threads variable, set in fork.c)