Thread Performance Considerations¶

Based on Vivek Pai’s paper: Flash: An Efficient and Portable Web Server.

Choosing a Threading Model¶

Boss-worker vs Pipeline example (6 workers, 11 toy orders):

Boss-worker: 360 ms total execution time, 196 ms average completion time per order
Pipeline (6 stages × 20 ms): 320 ms total execution time, 220 ms average completion time per order

Conclusion: the “better” model depends on the metric. Pipeline wins on total execution time; boss-worker wins on average per-order completion time. Different workloads and worker counts can reverse these conclusions.

Performance Metrics¶

A metric is a measurable, quantifiable property that captures system behavior and enables comparison between implementations.

Key metrics for concurrent systems:

Execution time: total time to complete a workload
Throughput: tasks completed per unit time (processes/sec, requests/sec)
Response time: time from request submission to completion
Wait time: time before a job begins execution
CPU utilization: fraction of time the CPU is doing useful work
Platform efficiency: throughput relative to resource cost (throughput/dollar, performance/watt)
SLA violations: percentage of requests exceeding service-level guarantees (e.g., response time > 3s, accuracy below threshold)
Percentile values: 95th/99th percentile response time (more robust than averages for detecting outliers)

Evaluating whether “threads are useful” always requires specifying the metric, workload, and system configuration. The answer is always context-dependent.

Multi-Process vs Multi-Threaded¶

Web Server Processing Steps¶

A simple web server: accept connection → read/parse HTTP request → find file → compute header → send header + file (or error). Steps are either CPU-bound (parsing, header computation) or I/O-bound (network accept/send, disk read). Whether I/O blocks depends on system state (data may already be cached).

Multi-Process Model¶

Spawn multiple instances of the same single-threaded process.

Pros: simple to implement
Cons: high memory overhead (separate address space per process), expensive context switches, costly IPC and synchronization, difficulty sharing socket ports

Multi-Threaded Model¶

Multiple threads within a single address space, each processing a request (or using boss-worker pattern).

Pros: shared address space (lower memory, no IPC needed), cheap context switching, user-level synchronization
Cons: synchronization complexity (explicit locking for shared state), requires OS thread support

Event-Driven Model¶

Architecture¶

Single process, single thread of control. An event dispatcher loops continuously, receiving events and invoking registered handlers. Events include: incoming network connections, received HTTP requests, disk read completions, send completions.

Handlers run to completion — if a blocking operation is needed, the handler initiates it and returns control to the dispatcher immediately. The dispatcher operates as a state machine.

Achieving Concurrency¶

Concurrency is achieved by interleaving processing of multiple requests within a single thread. While one request waits on I/O, the thread processes other requests. No context switching overhead — just jumping between handlers.

Why It Works¶

On a single CPU, the event-driven model hides I/O latency (like threads) without context-switching overhead. Processing continues as long as there is no wait; when a wait occurs, the thread switches to another request.

On multiple CPUs, each CPU can host one event-driven process handling many concurrent requests — less overhead than context-switching among many threads/processes per CPU. Requires mechanisms to route events to the correct CPU.

Implementation¶

Events map to file descriptors (both sockets and files use the same abstraction).

Event notification mechanisms:

select(): scans a range of file descriptors, returns the first with input — O(n) scan, wasteful for sparse activity
poll(): similar to select, same scanning problem
epoll() (Linux): eliminates scanning overhead — used by modern high-performance servers

Benefits: no context switching, smaller memory footprint, no synchronization needed. Minor cache pollution from jumping between handlers is much less than full context switches.

Helper Threads/Processes (AMPED Model)¶

Problem: a blocking I/O call in a handler blocks the entire event-driven process.

Solution 1 — Asynchronous I/O: kernel handles I/O in background; caller thread continues. The OS kernel (being multi-threaded) or the device itself performs the operation. Fits naturally with event-driven model (events from async completions). Not universally available for all device types.

Solution 2 — Helper entities: when a handler needs blocking I/O, it delegates to a helper (communicating via pipes or sockets, which are file descriptors monitorable by select/poll/epoll). The helper blocks; the main event loop continues.

AMPED (Asymmetric Multi-Process Event-Driven): helpers are processes (for portability to non-multithreaded kernels). AMTED variant uses helper threads instead.

Benefits of AMPED over multi-process/multi-threaded models:

Smaller memory footprint: helpers exist only for concurrent blocking I/O operations (not for every concurrent request)
Main event loop avoids context switches and synchronization

Limitation: best suited for server-type applications; event routing on multi-CPU systems adds complexity.

The Flash Web Server¶

Flash is an AMPED web server:

Helper processes handle blocking disk reads only
Communication from helpers via pipes
Uses mmap() to map files; mincore() checks if pages are memory-resident before deciding whether to use a local handler or dispatch to a helper
The mincore() check adds minor overhead but prevents blocking the main process

Application-Level Optimizations¶

These optimizations are applicable to any web server:

Data caching: cache frequently requested files in memory
Computation caching: cache directory lookup results (pathname resolution) and pre-computed HTTP response headers — avoid redundant computation on repeated requests for the same file
Network optimizations: align data structures for zero-copy DMA; use scatter-gather DMA so headers and file data can be sent from separate memory locations without copying

Apache Web Server¶

Apache (at the time of the paper) used a multi-process model where each process was internally a multi-threaded boss-worker with dynamic thread pool management. Total process count was also dynamically adjusted based on load (outstanding connections, pending requests, CPU usage). Each request flows through configurable modules (security, dynamic content, HTTP processing).

Experimental Methodology¶

Comparison Points¶

All implementations (except Apache) used the same Flash optimizations:

MP (Multi-Process): single-threaded processes, Flash optimizations
MT (Multi-Threaded): boss-worker model, Flash optimizations
SPED (Single-Process Event-Driven): basic event-driven, no helpers
Zeus: SPED with two processes (to handle blocking I/O)
Apache: open-source, multi-process, no Flash optimizations
Flash (AMPED): event-driven with asymmetric helper processes

Workloads¶

Synthetic: all requests for the same file (best-case analysis)
CS trace (Rice University CS department): large working set, does not fit in memory → I/O-bound
Owlnet trace (student web pages): small working set, mostly fits in memory → cache-friendly
Synthetic workload generator: for best/worst case and what-if analysis

Metrics¶

Bandwidth: (N × file_size) / time — measured as a function of file size
Connection rate: client connections serviced per unit time

Results¶

Best Case (Synthetic — Single File)¶

All requests for one cached file; no disk I/O needed.

SPED best (no context switching, no helpers)
Flash slightly lower (mincore() check overhead, but no helpers invoked)
MT/MP slower due to context switching and synchronization
Apache worst (lacks Flash optimizations)

Owlnet Trace (Small, Mostly Cached)¶

Similar to best case since most data fits in cache. Occasional blocking I/O needed.

Flash slightly better than SPED (helpers handle rare blocking I/O; SPED blocks entirely)
MT/MP and Apache remain slower

CS Trace (Large, I/O-Bound)¶

Working set exceeds memory; frequent disk I/O required.

SPED performance drops dramatically (no async I/O support; blocking stalls entire process)
MT better than MP (smaller memory footprint → more cache space → less I/O; cheaper synchronization and context switching)
Flash best overall: smallest memory footprint → most memory for file caching → fewest blocking I/O calls; no synchronization overhead; helpers handle only concurrent I/O-bound requests

Impact of Optimizations¶

Flash without optimizations performs comparably to Apache. Adding optimizations incrementally:

Path caching (directory lookup): significant improvement
+ File caching (mmap): further improvement
+ Header caching: highest connection rate

This confirms the optimizations are crucial and explains Apache’s relatively lower performance (lacking them).

Summary of Flash Results¶

Data in cache: SPED ≈ AMPED > MT > MP (event-driven avoids context switching; SPED slightly better than AMPED due to no mincore check)
Disk-bound workload: AMPED > MT > MP > SPED (AMPED has helpers for non-blocking I/O; smallest memory footprint; SPED blocks without async I/O)
Many modern high-performance servers use event-driven models combined with asynchronous I/O

Designing Experiments¶

Principles¶

Define comparison points: isolate what you’re comparing (software vs software → same hardware; hardware vs hardware → same software)
Choose representative workloads: use real traces for reproducibility; supplement with synthetic loads for best/worst case analysis
Select stakeholder-relevant metrics: client-facing (response time) vs operator-facing (throughput, cost) vs system-facing (utilization)
Use standard domain metrics: enables broader audience to understand and relate to results

Experimental Design Guidelines¶

Identify impactful factors: system resources (CPUs, memory, threads, buffer sizes) and workload parameters (request rate, file size, access pattern)
Use realistic ranges: parameter values must reflect real-world deployments
Best/worst case exceptions: unrealistic workloads are acceptable only when demonstrating bounds or system limitations
Compare apples to apples: vary one factor at a time; never change both workload and resources simultaneously
Establish baselines: compare against state-of-the-art or common practice; demonstrate improvement
Run multiple trials: compute averages; report variability
Draw explicit conclusions: state what the results support; don’t just show graphs