Thread Design Considerations¶

Kernel vs User-Level Threads¶

Kernel-level threads: OS kernel is multithreaded; kernel maintains thread data structures and performs scheduling, synchronization, and resource management.

User-level threads: A user-level library linked with the application provides thread management and runtime support. Different processes may use entirely different threading libraries with different scheduling policies.

User-to-kernel thread mappings: one-to-one, many-to-one, many-to-many.

Thread Data Structures¶

Single CPU, Single-Threaded Process¶

A process is described by its address space (virtual-to-physical mappings), stack, and registers. All state lives in the process control block (PCB). System calls execute in the context of a kernel thread.

Many-to-One Model¶

Multiple user-level threads map to one kernel-level thread. The user-level library maintains its own user-level thread data structures to track resource use and make scheduling/synchronization decisions.

Splitting the PCB¶

To support multiple kernel-level threads per process without replicating the entire PCB:

PCB retains virtual address mappings and process-wide state
Kernel thread structures hold per-thread execution state (stack pointer, registers)

From the user-level library’s perspective, kernel-level threads act as virtual CPUs. Unix systems provide setjmp/longjmp for saving and restoring user-level thread context.

Scaling to Multiple Processes and CPUs¶

Relationships that must be maintained:

User-level library tracks all user-level threads for a given process
Each process tracks its kernel-level threads; each kernel thread knows its address space
CPU data structure tracks current thread, thread affinity, and dispatch/interrupt procedures
Context switching between kernel threads of different processes requires full PCB save/restore (address mappings invalidated)
Context switching between kernel threads of the same process only requires saving/restoring the lightweight per-thread state

Hard and Light Process State¶

The PCB is split into:

Hard process state: relevant to all threads in the process (virtual address mappings, credentials)
Light process state: specific to a particular kernel-level thread and the user-level thread currently mapped to it (signals, syscall arguments)

Benefits of Multiple Data Structures¶

Single monolithic PCB has problems: large private copies per thread, expensive context switches (full save/restore), inflexible updates affecting multiple OS services.

Multiple smaller data structures connected by pointers provide:

Scalability: smaller structures, shared where possible
Lower overhead: no redundant copies
Better performance: only changed state is saved/restored on context switch
Flexibility: modifications impact only relevant data elements

Modern OSes use this approach universally.

SunOS/Solaris Threading Model¶

Based on the Eykholt paper (Beyond Multiprocessing: Multithreading the SunOS Kernel) and the Stein & Shah paper (Implementing Lightweight Threads).

Solaris 2.0 supported multiprocessor systems with a multi-threaded kernel. Both many-to-many and one-to-one mappings were available. Each kernel-level thread executing a user-level thread has a lightweight process (LWP) data structure; LWPs serve as virtual CPUs for the user-level library.

User-Level Thread Structures (Stein & Shah)¶

Thread ID is an index into a table of pointers (not a direct pointer), enabling meaningful error reporting for corrupt threads
Thread data structure contains: registers, signal mask, priority, stack pointer, thread-local storage (compiler-allocated per-thread private variables)
Stack size is set by library defaults or user specification; structures are laid out contiguously for locality
Red zone: unallocated virtual address region between thread stacks; attempts to write cause a fault, making stack overflow debugging easier (fault is attributed to the offending thread)

Kernel-Level Data Structures¶

Process structure: kernel-level thread list, virtual-to-physical mappings, user credentials, signal handlers.

Lightweight process (LWP): per-thread user-level registers, syscall arguments, resource usage. Similar to user-level thread data but visible to the kernel scheduler. Aggregate process resource usage requires walking all LWPs. LWP data is swappable (can be paged out under memory pressure).

Kernel thread structure: kernel registers, stack pointer, scheduling class, pointers to LWP/address-space/CPU. This data is not swappable — always memory-resident (needed by scheduler even when thread is inactive).

CPU structure: current thread, dispatch procedures, interrupt handlers. On SPARC, a dedicated register always points to the current thread for fast access.

Thread Management Interactions¶

Kernel and user-level library lack mutual visibility, causing coordination challenges.

`set_concurrency` System Call¶

A process can request a specific number of kernel-level threads via set_concurrency. Example scenario:

Process has 4 user-level threads, actual concurrency level is 2
Kernel initially assigns 1 kernel thread; process requests 2 via set_concurrency
If both kernel threads block on I/O, process stalls despite having runnable user-level threads
Solution: kernel notifies user-level library before blocking; library requests additional kernel threads
When I/O completes and a kernel thread becomes idle, kernel can reclaim it

The pthread_setconcurrency function sets the concurrency level; passing 0 lets the implementation decide.

Visibility and Scheduling Problems¶

The kernel cannot see user-level thread state (mutexes, wait queues, scheduling decisions). The user-level library cannot see kernel scheduling decisions.

Priority inversion example: User-level thread holding a lock is preempted at the kernel level. Other user-level threads needing the lock are starved until the kernel reschedules the lock holder. This motivates one-to-one models to eliminate the visibility gap.

The user-level library scheduler is invoked on: explicit yield, timer expiration, synchronization operations (lock/unlock), and blocked threads becoming runnable.

Bound threads: A user-level thread permanently associated with a kernel-level thread (analogous to CPU thread pinning). In a one-to-one model, all threads are bound.

Multi-CPU Issues¶

On multi-CPU systems, a user-level library running on one CPU may need to affect scheduling on another CPU (e.g., preempting a low-priority thread to run a newly-unblocked high-priority thread). This requires sending inter-processor signals since registers on one CPU cannot be directly modified from another.

Adaptive Mutexes¶

On multi-CPU systems, if the mutex owner is running on another CPU and the critical section is short, the requesting thread spins instead of blocking — avoids the overhead of context switching and queue management. For long critical sections, the thread blocks normally.

Adaptive mutexes dynamically choose spin vs. block based on:

Whether the mutex owner is currently running on another CPU
Expected critical section length

Only useful on multi-CPU systems. Requires tracking the mutex owner.

Thread Destruction and Reuse¶

Exited threads are placed on a death row rather than immediately destroyed. A reaper thread periodically performs garbage collection. If a new thread is requested before reaping, the data structures and stack are reused, avoiding allocation overhead.

Interrupts and Signals¶

Interrupts¶

Generated externally by I/O devices, timers, or other CPUs
Platform-dependent (determined by hardware configuration)
Always asynchronous — not in direct response to a CPU action
Identified by a unique number; looked up in the interrupt handler table to find the handler’s starting address
Can be masked via a per-CPU interrupt mask (bitmask; disabled interrupts remain pending)

Signals¶

Generated by software running on the CPU or by CPU hardware faults
OS-dependent (identical hardware with different OS will have different signals)
Can be synchronous (e.g., SIGSEGV from illegal memory access, SIGFPE from divide-by-zero) or asynchronous (e.g., kill from another process, timer expiration)
Identified by a unique number; looked up in a per-process signal handler table
Can be masked via a per-process signal mask
OS provides default actions (terminate, core dump, stop, continue, ignore); processes can install custom handlers for most signals (some, like SIGKILL, cannot be caught)

Common POSIX signals: SIGINT (terminal interrupt), SIGURG (high-bandwidth socket data), SIGTTOU (background write attempt), SIGXFSZ (file size limit exceeded).

Why Disable Interrupts/Signals¶

Interrupt/signal handlers execute on the interrupted thread’s stack. If the handler needs a mutex already held by the interrupted thread, deadlock occurs.

Solution: use masks to dynamically disable specific interrupts/signals before acquiring a mutex. Re-enable after releasing the mutex. Pending disabled events are delivered once re-enabled, but the handler executes only once regardless of how many instances occurred while masked.

Interrupt masks: per-CPU
Signal masks: per-execution-context (thread)

On multi-core systems, interrupts can be routed to a single designated CPU, avoiding perturbation on other cores.

Signal Types¶

One-shot signals: handler called at least once regardless of how many instances occurred; custom handler must be re-registered after each invocation (otherwise reverts to OS default)
Real-time signals (Linux): if raised n times, handler is guaranteed to be called n times (queuing behavior)

Interrupts as Threads (SunOS/Solaris)¶

To avoid deadlocks in interrupt handlers that need locks, Solaris allows interrupt handlers to become full threads:

If the handler contains no locking, it runs on the interrupted thread’s stack (fast path)
If the handler may block, it runs as a separate thread with its own stack and context
Threads for interrupt handlers are pre-created and pre-initialized to eliminate dynamic creation cost from the interrupt fast path

Top half / Bottom half (Linux terminology):

Top half: minimal, non-blocking processing; runs immediately on interrupt
Bottom half: arbitrary complexity; runs as a schedulable thread that can block

Performance tradeoff: handling interrupts as threads adds ~40 instructions per interrupt, but saves ~12 instructions per mutex lock/unlock (no need to toggle interrupt masks). Since mutex operations vastly outnumber interrupts, this is a net win. This exemplifies optimizing for the common case.

Threads and Signal Handling (Solaris)¶

Each user-level thread has a signal mask (user-level, cheap to update). Each kernel-level thread/LWP has a signal mask (kernel-level, requires syscall to update).

The threading library wraps all signal handlers. When a signal arrives, the library handler checks masks and routes the signal:

Case 1 — Both user-level and kernel-level masks enabled: signal delivered directly to the current thread
Case 2 — Kernel mask enabled, current user-level thread’s mask disabled, but another runnable user-level thread has it enabled: library scheduler switches to that thread
Case 3 — Same as Case 2, but the capable thread is running on another CPU: library sends a directed signal to that kernel-level thread
Case 4 — All user-level threads have the signal disabled: library issues a syscall to disable the kernel-level mask, reissues the signal; process repeats across all kernel threads until all kernel masks reflect the disabled state. When any user-level thread later re-enables its mask, a syscall updates the kernel-level mask.

This approach optimizes the common case: mask updates (frequent) are cheap user-level operations; actual signal delivery (rare) incurs the extra complexity.

Linux Threading Model¶

Linux uses the task as the kernel-level execution context, represented by a task_struct (~1.7 KB).

PID field: actually a task ID (per-thread); misleading name for historical reasons
TGID (task group ID): the PID of the first task created for the process; identifies the process as a whole
Task list: links all tasks (threads) belonging to a single process
Process state is represented via pointers to separate structures (memory management, file management, etc.), enabling easy sharing between tasks

``clone()`` system call creates new tasks. A sharing flags bitmap controls what is shared:

All flags set → new thread (shared address space, files, etc.)
All flags cleared → fork() behavior (separate copy of everything)
Selective flags → partial sharing (e.g., shared files only)

fork() is internally implemented via clone() with flags cleared. For multithreaded processes, fork() creates a single-threaded child replicating only the calling thread’s visible state.

NPTL (Native POSIX Threads Library): current Linux threading implementation using a one-to-one model (one kernel task per user-level thread). Replaced the earlier LinuxThreads library (many-to-many model). Made viable by:

Cheaper kernel traps (fast user-to-kernel transitions)
More available memory (no pressure to minimize kernel threads)
Sufficient ID range for most workloads

For extreme scale (exascale computing, heterogeneous platforms), user-level library support and custom thread management policies remain relevant.

Linux Kernel Quiz¶

Kernel thread structure: kthread_worker (in kthread.h)
Process descriptor contained within it: task_struct
Minimum threads to boot: 20 (max_threads variable, set in fork.c)