CPU Scheduling

Scheduling Overview

The CPU scheduler decides how and when processes and threads access shared CPUs. The term task refers interchangeably to processes or threads.

The scheduler selects a task from the ready queue (runqueue) and dispatches it onto a CPU via context switch, setting the program counter to the next instruction.

When the scheduler runs:

  • CPU becomes idle (e.g., running thread blocks on I/O)

  • New task becomes ready (I/O completion, interrupt, or creation)

  • Timeslice expires for the currently running task

Key design questions:

  • Which task to select? — determined by the scheduling policy/algorithm

  • How to select efficiently? — determined by the runqueue data structure

The scheduling algorithm and runqueue design are tightly coupled; certain algorithms demand specific data structures.

Run-to-Completion Scheduling

Assumes a task runs on the CPU until it finishes (no preemption). Initial simplifying assumptions: known execution times, single CPU.

Comparison metrics: throughput, average completion time, average wait time, CPU utilization.

First-Come First-Serve (FCFS)

Tasks scheduled in arrival order. Runqueue is a simple FIFO queue — enqueue at tail, dequeue from head, O(1) operations.

Example (T1=1s, T2=10s, T3=1s, arriving in order):

  • Throughput: 3/12 = 0.25 tasks/s

  • Avg completion time: (1+11+12)/3 = 8s

  • Avg wait time: (0+1+11)/3 = 4s

FCFS is simple but yields poor wait times when long tasks precede short ones.

Shortest Job First (SJF)

Tasks scheduled in order of execution time. Runqueue can be an ordered queue or a tree (ordered by execution time; leftmost node = shortest job).

Same example, SJF order (T1, T3, T2):

  • Throughput: 3/12 = 0.25 tasks/s (unchanged)

  • Avg completion time: (1+2+12)/3 = 5s (vs. 8s FCFS)

  • Avg wait time: (0+1+2)/3 = 1s (vs. 4s FCFS)

Preemptive Scheduling

SJF with Preemption

Tasks can arrive at different times; a newly arrived shorter job preempts the running task. The scheduler must re-evaluate on every new arrival.

Estimating execution time: since true execution time is unknown, use heuristics based on past behavior — last execution time, or a windowed average over recent runs.

Priority Scheduling

Tasks have different priority levels. The scheduler always runs the highest-priority ready task, preempting lower-priority ones as needed.

Runqueue options:

  • Per-priority queues — scheduler picks from the highest non-empty queue

  • Ordered tree — ordered by priority

Starvation is a danger: low-priority tasks may never run. Solution: priority aging — priority increases as a function of time spent waiting in the runqueue, guaranteeing eventual execution.

Priority Inversion

Occurs when a high-priority thread (T1) is blocked waiting for a lock held by a low-priority thread (T3), while a medium-priority thread (T2) runs instead — effectively inverting the priority order.

Solution: priority inheritance — temporarily boost the lock owner’s priority to that of the highest-priority waiter. This is why mutex data structures track the current owner. The owner’s priority is restored upon lock release.

Round Robin Scheduling

Tasks take turns on the CPU in FIFO order, but may yield (e.g., for I/O) before completion. Generalizes with priorities: higher-priority arrivals preempt; same-priority tasks round-robin.

Timesharing and Timeslices

A timeslice (time quantum) is the maximum uninterrupted time given to a task. Tasks may run less than the timeslice if they block on I/O, synchronize, or are preempted by higher-priority tasks.

Round robin with timeslice=1s (T1=1s, T2=10s, T3=1s):

  • Avg wait time: (0+1+2)/3 = 1s

  • Avg completion time: (1+12+3)/3 = 5.33s

Close to SJF performance without needing to know execution times. Short tasks finish sooner and the system is more responsive.

Overhead: context switching (interrupt → scheduler → context switch) is pure overhead. Timeslice must be significantly larger than context-switch time to minimize this cost.

Timeslice Length

CPU-bound tasks prefer longer timeslices:

  • Fewer context switches → lower overhead

  • Better throughput and completion time

  • Wait time is less important for non-interactive work

I/O-bound tasks prefer shorter timeslices:

  • Issue I/O requests sooner → better CPU and device utilization

  • More responsive to user interaction

Example — single CPU, 10 I/O-bound + 1 CPU-bound task:

  • I/O-bound: issue I/O every 1ms, I/O takes 10ms; context switch = 0.1ms

  • Timeslice=1ms → CPU utilization 91%

  • Timeslice=10ms → CPU utilization 95% (CPU-bound task runs full 10ms uninterrupted)

Larger timeslice favors CPU utilization; smaller timeslice favors I/O device utilization.

Runqueue Data Structures

The runqueue need not be a literal queue. To assign different timeslice values to I/O-bound vs. CPU-bound tasks, two approaches exist:

  1. Single queue with per-task metadata for the scheduler to apply different policies

  2. Separate queues for different task types, each with its own policy

Multi-Level Feedback Queue (MLFQ)

A multi-queue structure with different timeslice values per level:

  • Top level (shortest timeslice, e.g., 8ms) — most I/O-intensive tasks, highest scheduling priority

  • Middle level (medium timeslice, e.g., 16ms) — mixed tasks

  • Bottom level (longest/infinite timeslice, FCFS) — CPU-intensive tasks, lowest priority

Feedback mechanism:

  • New tasks enter the top level (assumed I/O-intensive)

  • If a task yields before timeslice expires → stays at current level

  • If a task uses its entire timeslice → pushed down one level

  • If a task at a lower level starts yielding early repeatedly → promoted up

This adaptively classifies tasks without prior knowledge. Fernando Corbatò received the Turing Award for this design.

Linux O(1) Scheduler

Named for constant-time task selection and insertion regardless of total task count.

Design:

  • Preemptive, priority-based with 140 priority levels (0=highest, 139=lowest)

  • Priorities 0–99: real-time class; 100–139: timesharing class

  • Default user process priority: 120, adjustable via nice values (−20 to +19)

  • Lower-priority tasks get smaller timeslices; higher-priority tasks get larger timeslices

Feedback: based on sleep time — longer sleep → interactive → priority boosted (−5); shorter sleep → compute-intensive → priority lowered (+5).

Runqueue: two arrays of 140 task queuesActive and Expired:

  • Scheduler selects from Active using hardware find-first-set-bit instructions → O(1)

  • Tasks yielding before timeslice expires remain in Active

  • Tasks consuming full timeslice move to Expired

  • When Active is empty, swap Active/Expired pointers (also acts as anti-starvation aging)

History: introduced in Linux 2.5 by Ingo Molnár. Replaced due to poor interactive task performance (jitter) as workloads shifted to latency-sensitive applications (Skype, streaming, gaming).

Linux CFS (Completely Fair Scheduler)

Default scheduler since Linux 2.6.23 (non-real-time tasks). Also by Ingo Molnár.

Key idea: uses a Red-Black Tree as the runqueue, ordered by virtual runtime (vruntime) tracked at nanosecond granularity.

  • Leftmost node = task with least CPU time → scheduled next

  • Periodically, the running task’s vruntime is updated and compared to the leftmost node; preempted if its vruntime exceeds the leftmost

Priority handling via vruntime rate:

  • Lower-priority tasks → vruntime progresses faster → lose CPU sooner

  • Higher-priority tasks → vruntime progresses slower → keep CPU longer

Complexity: task selection is O(1) (leftmost node); insertion is O(log n).

Why O(1) was replaced: once tasks moved to the Expired list, they waited until all Active tasks consumed their timeslices — causing unpredictable latency for interactive tasks.

Scheduling on Multiprocessors

Cache Affinity

In SMP (shared memory multiprocessor) and multicore systems, the OS sees each core as a schedulable CPU.

A thread’s performance depends on cache state. Re-scheduling a thread on the same CPU where it previously ran exploits hot caches (cache affinity). Scheduling it elsewhere forces cold-cache operation.

Approach: hierarchical scheduler with a load balancer distributing tasks across CPUs, plus per-CPU runqueues that keep tasks on the same CPU.

NUMA-Aware Scheduling

On NUMA (Non-Uniform Memory Access) platforms, memory access latency depends on CPU-to-memory-node proximity (e.g., Intel QPI interconnect). The scheduler should bind tasks to CPUs closest to the memory node holding their state.

Hyperthreading (SMT)

Hyperthreading (simultaneous multithreading / SMT) provides multiple hardware execution contexts (register sets) on a single CPU. Context switching between hardware threads takes order of cycles (vs. hundreds of cycles for memory access), effectively hiding memory latency.

  • Hardware today: typically 2 hardware threads per core (up to 8 on high-end servers)

  • Can be enabled/disabled at boot; OS sees each hardware thread as a virtual CPU

Co-scheduling decisions (from Fedorova et al., “Chip Multithreaded Processors Need a New OS Scheduler”):

  • Two CPU-bound threads → contend for pipeline → each degrades by ~2×, memory idle

  • Two memory-bound threads → wasted CPU cycles during memory stalls

  • Mix of CPU + memory-bound → best utilization; CPU-bound runs while memory-bound waits on memory

Hardware Counters and CPI

To determine if a thread is CPU-bound vs. memory-bound, use hardware performance counters (cache misses, IPC, power usage, etc.) accessed via tools like oprofile or perf.

Fedorova proposed Cycles Per Instruction (CPI) as a scheduling metric:

  • Memory-bound threads → high CPI (many stall cycles per instruction)

  • CPU-bound threads → CPI ≈ 1

Experiment: 4 cores × 4 SMT threads, synthetic workload with CPI values {1, 6, 11, 16}. Mixing diverse CPI values per core → high aggregate IPC. Grouping similar CPI → low IPC due to pipeline contention or wasted cycles.

Practical limitation: real-world benchmarks show CPI values clustered together — insufficient spread for CPI-based scheduling to be effective.

Key takeaways:

  • Schedulers should consider resource contention (pipeline, cache, memory controller), not just load balancing

  • Last-level cache contention is a particularly important performance factor in multicore/SMT systems

  • Hardware counters are valuable for informing resource-aware scheduling decisions