CPU Scheduling¶
Scheduling Overview¶
The CPU scheduler decides how and when processes and threads access shared CPUs. The term task refers interchangeably to processes or threads.
The scheduler selects a task from the ready queue (runqueue) and dispatches it onto a CPU via context switch, setting the program counter to the next instruction.
When the scheduler runs:
CPU becomes idle (e.g., running thread blocks on I/O)
New task becomes ready (I/O completion, interrupt, or creation)
Timeslice expires for the currently running task
Key design questions:
Which task to select? — determined by the scheduling policy/algorithm
How to select efficiently? — determined by the runqueue data structure
The scheduling algorithm and runqueue design are tightly coupled; certain algorithms demand specific data structures.
Run-to-Completion Scheduling¶
Assumes a task runs on the CPU until it finishes (no preemption). Initial simplifying assumptions: known execution times, single CPU.
Comparison metrics: throughput, average completion time, average wait time, CPU utilization.
First-Come First-Serve (FCFS)¶
Tasks scheduled in arrival order. Runqueue is a simple FIFO queue — enqueue at tail, dequeue from head, O(1) operations.
Example (T1=1s, T2=10s, T3=1s, arriving in order):
Throughput: 3/12 = 0.25 tasks/s
Avg completion time: (1+11+12)/3 = 8s
Avg wait time: (0+1+11)/3 = 4s
FCFS is simple but yields poor wait times when long tasks precede short ones.
Shortest Job First (SJF)¶
Tasks scheduled in order of execution time. Runqueue can be an ordered queue or a tree (ordered by execution time; leftmost node = shortest job).
Same example, SJF order (T1, T3, T2):
Throughput: 3/12 = 0.25 tasks/s (unchanged)
Avg completion time: (1+2+12)/3 = 5s (vs. 8s FCFS)
Avg wait time: (0+1+2)/3 = 1s (vs. 4s FCFS)
Preemptive Scheduling¶
SJF with Preemption¶
Tasks can arrive at different times; a newly arrived shorter job preempts the running task. The scheduler must re-evaluate on every new arrival.
Estimating execution time: since true execution time is unknown, use heuristics based on past behavior — last execution time, or a windowed average over recent runs.
Priority Scheduling¶
Tasks have different priority levels. The scheduler always runs the highest-priority ready task, preempting lower-priority ones as needed.
Runqueue options:
Per-priority queues — scheduler picks from the highest non-empty queue
Ordered tree — ordered by priority
Starvation is a danger: low-priority tasks may never run. Solution: priority aging — priority increases as a function of time spent waiting in the runqueue, guaranteeing eventual execution.
Priority Inversion¶
Occurs when a high-priority thread (T1) is blocked waiting for a lock held by a low-priority thread (T3), while a medium-priority thread (T2) runs instead — effectively inverting the priority order.
Solution: priority inheritance — temporarily boost the lock owner’s priority to that of the highest-priority waiter. This is why mutex data structures track the current owner. The owner’s priority is restored upon lock release.
Round Robin Scheduling¶
Tasks take turns on the CPU in FIFO order, but may yield (e.g., for I/O) before completion. Generalizes with priorities: higher-priority arrivals preempt; same-priority tasks round-robin.
Timesharing and Timeslices¶
A timeslice (time quantum) is the maximum uninterrupted time given to a task. Tasks may run less than the timeslice if they block on I/O, synchronize, or are preempted by higher-priority tasks.
Round robin with timeslice=1s (T1=1s, T2=10s, T3=1s):
Avg wait time: (0+1+2)/3 = 1s
Avg completion time: (1+12+3)/3 = 5.33s
Close to SJF performance without needing to know execution times. Short tasks finish sooner and the system is more responsive.
Overhead: context switching (interrupt → scheduler → context switch) is pure overhead. Timeslice must be significantly larger than context-switch time to minimize this cost.
Timeslice Length¶
CPU-bound tasks prefer longer timeslices:
Fewer context switches → lower overhead
Better throughput and completion time
Wait time is less important for non-interactive work
I/O-bound tasks prefer shorter timeslices:
Issue I/O requests sooner → better CPU and device utilization
More responsive to user interaction
Example — single CPU, 10 I/O-bound + 1 CPU-bound task:
I/O-bound: issue I/O every 1ms, I/O takes 10ms; context switch = 0.1ms
Timeslice=1ms → CPU utilization 91%
Timeslice=10ms → CPU utilization 95% (CPU-bound task runs full 10ms uninterrupted)
Larger timeslice favors CPU utilization; smaller timeslice favors I/O device utilization.
Runqueue Data Structures¶
The runqueue need not be a literal queue. To assign different timeslice values to I/O-bound vs. CPU-bound tasks, two approaches exist:
Single queue with per-task metadata for the scheduler to apply different policies
Separate queues for different task types, each with its own policy
Multi-Level Feedback Queue (MLFQ)¶
A multi-queue structure with different timeslice values per level:
Top level (shortest timeslice, e.g., 8ms) — most I/O-intensive tasks, highest scheduling priority
Middle level (medium timeslice, e.g., 16ms) — mixed tasks
Bottom level (longest/infinite timeslice, FCFS) — CPU-intensive tasks, lowest priority
Feedback mechanism:
New tasks enter the top level (assumed I/O-intensive)
If a task yields before timeslice expires → stays at current level
If a task uses its entire timeslice → pushed down one level
If a task at a lower level starts yielding early repeatedly → promoted up
This adaptively classifies tasks without prior knowledge. Fernando Corbatò received the Turing Award for this design.
Linux O(1) Scheduler¶
Named for constant-time task selection and insertion regardless of total task count.
Design:
Preemptive, priority-based with 140 priority levels (0=highest, 139=lowest)
Priorities 0–99: real-time class; 100–139: timesharing class
Default user process priority: 120, adjustable via
nicevalues (−20 to +19)Lower-priority tasks get smaller timeslices; higher-priority tasks get larger timeslices
Feedback: based on sleep time — longer sleep → interactive → priority boosted (−5); shorter sleep → compute-intensive → priority lowered (+5).
Runqueue: two arrays of 140 task queues — Active and Expired:
Scheduler selects from Active using hardware find-first-set-bit instructions → O(1)
Tasks yielding before timeslice expires remain in Active
Tasks consuming full timeslice move to Expired
When Active is empty, swap Active/Expired pointers (also acts as anti-starvation aging)
History: introduced in Linux 2.5 by Ingo Molnár. Replaced due to poor interactive task performance (jitter) as workloads shifted to latency-sensitive applications (Skype, streaming, gaming).
Linux CFS (Completely Fair Scheduler)¶
Default scheduler since Linux 2.6.23 (non-real-time tasks). Also by Ingo Molnár.
Key idea: uses a Red-Black Tree as the runqueue, ordered by virtual runtime (vruntime) tracked at nanosecond granularity.
Leftmost node = task with least CPU time → scheduled next
Periodically, the running task’s vruntime is updated and compared to the leftmost node; preempted if its vruntime exceeds the leftmost
Priority handling via vruntime rate:
Lower-priority tasks → vruntime progresses faster → lose CPU sooner
Higher-priority tasks → vruntime progresses slower → keep CPU longer
Complexity: task selection is O(1) (leftmost node); insertion is O(log n).
Why O(1) was replaced: once tasks moved to the Expired list, they waited until all Active tasks consumed their timeslices — causing unpredictable latency for interactive tasks.
Scheduling on Multiprocessors¶
Cache Affinity¶
In SMP (shared memory multiprocessor) and multicore systems, the OS sees each core as a schedulable CPU.
A thread’s performance depends on cache state. Re-scheduling a thread on the same CPU where it previously ran exploits hot caches (cache affinity). Scheduling it elsewhere forces cold-cache operation.
Approach: hierarchical scheduler with a load balancer distributing tasks across CPUs, plus per-CPU runqueues that keep tasks on the same CPU.
NUMA-Aware Scheduling¶
On NUMA (Non-Uniform Memory Access) platforms, memory access latency depends on CPU-to-memory-node proximity (e.g., Intel QPI interconnect). The scheduler should bind tasks to CPUs closest to the memory node holding their state.
Hyperthreading (SMT)¶
Hyperthreading (simultaneous multithreading / SMT) provides multiple hardware execution contexts (register sets) on a single CPU. Context switching between hardware threads takes order of cycles (vs. hundreds of cycles for memory access), effectively hiding memory latency.
Hardware today: typically 2 hardware threads per core (up to 8 on high-end servers)
Can be enabled/disabled at boot; OS sees each hardware thread as a virtual CPU
Co-scheduling decisions (from Fedorova et al., “Chip Multithreaded Processors Need a New OS Scheduler”):
Two CPU-bound threads → contend for pipeline → each degrades by ~2×, memory idle
Two memory-bound threads → wasted CPU cycles during memory stalls
Mix of CPU + memory-bound → best utilization; CPU-bound runs while memory-bound waits on memory
Hardware Counters and CPI¶
To determine if a thread is CPU-bound vs. memory-bound, use hardware performance counters (cache misses, IPC, power usage, etc.) accessed via tools like oprofile or perf.
Fedorova proposed Cycles Per Instruction (CPI) as a scheduling metric:
Memory-bound threads → high CPI (many stall cycles per instruction)
CPU-bound threads → CPI ≈ 1
Experiment: 4 cores × 4 SMT threads, synthetic workload with CPI values {1, 6, 11, 16}. Mixing diverse CPI values per core → high aggregate IPC. Grouping similar CPI → low IPC due to pipeline contention or wasted cycles.
Practical limitation: real-world benchmarks show CPI values clustered together — insufficient spread for CPI-based scheduling to be effective.
Key takeaways:
Schedulers should consider resource contention (pipeline, cache, memory controller), not just load balancing
Last-level cache contention is a particularly important performance factor in multicore/SMT systems
Hardware counters are valuable for informing resource-aware scheduling decisions