KWIC Exercise¶

Introduction¶

This exercise, first described by David Parnas, asks you to design four different architectures for the same problem: producing a Key Word in Context (KWIC) index. It illustrates how architectural choices affect maintainability, performance, and resilience to change.

KWIC Index Definition¶

A KWIC index system:

Accepts a sequence of text lines (e.g., titles) as input
Generates all circular shifts of each line — removing the first word and appending it to the end, repeated for each word
Outputs all circular shifts of all lines in alphabetical order by the key word

A line of n words produces n circular shifts (including the original). Stop words (e.g., “and”, “of”, “the”) are typically excluded from indexing. Output is formatted so key words align in a column for easy scanning.

Example: “Gone With the Wind” produces shifts:

Gone With the Wind
With the Wind Gone
the Wind Gone With
Wind Gone With the

Four Architectural Solutions¶

Shared Data Decomposition¶

Principle: Decompose into components based on functions they compute; all components share access to common in-memory data structures.

Components:

Master Controller (“System”) — orchestrates the overall process
Input — reads and parses lines into shared data structures (Lines, Words, Index)
Circular Shifter — generates all circular shifts, stored in shared memory
Alphabetizer / Sorter — sorts shifts alphabetically
Output — writes the final result

Communication: Subroutine calls (solid arrows) for control flow; direct memory access (dashed arrows) for data.

Characteristics:

(+) Simple, efficient — no function call overhead for data access
(−) Fragile — any change to the shared data representation breaks all components

Pipe and Filter¶

Principle: Decompose into a linear chain of independent filters connected by pipes; each filter transforms its input stream and passes results to the next. No shared data storage.

Pipeline: Input → Circular Shifter → Alphabetizer → Output

Characteristics:

(+) Intuitive; each filter is an independent, reusable unit
(+) Easy to reorder or replace filters
(−) Cannot support interactive use (batch-only, streaming model)
(−) Not space-efficient — no persistent shared storage; data may be duplicated across filters

Abstract Data Type (ADT)¶

Principle: Decompose based on important data structures; hide representations behind abstract interfaces with operations. Precursor to object-oriented design.

Components:

Lines — ADT storing parsed input lines
Characters — ADT for character-level access
Circular Shifts — ADT with an operation to compute shifts (shifts as a data structure, not just a verb)
Alphabetized Shifts — ADT for sorted results
Input / Output — I/O components
Master Controller — invokes other components

Communication: Need-to-know basis — Output queries Alphabetized Shifts, which queries Circular Shifts, which queries Lines.

Characteristics:

(+) Excellent maintainability and reuse — hidden representations can change independently
(+) Easy to add operations (e.g., deletion) to individual ADTs
(−) Function call overhead through abstract interfaces may reduce performance

Implicit Invocation¶

Principle: Coordinate components via registration-broadcast (publish-subscribe). Clients register interest in state changes of servers; servers broadcast events without knowing client identities.

Components: Same as ADT, but interaction is event-driven:

When data changes, the component announces the change
All registered components are called back automatically
Servers don’t know client identities → loose coupling

Characteristics:

(+) Maintainability — changing a component’s representation only requires changes in one place
(+) Facilitates reuse
(−) Harder to reason about control flow and debug (implicit, not explicit invocation)
(−) Performance overhead similar to ADT

Comparative Evaluation¶

Resilience to Change¶

Potential enhancements a customer might request:

GUI interface for browsing the index
Interactive search over sorted keywords
Data persistence (add/remove titles without full reprocessing)
Changed input formats (e.g., bibliographic databases)
Reuse of components in other applications
Algorithm changes (shift on read vs. shift after read, incremental sort)
New functionality (stop word filtering, deletions)
External storage (database on disk)
Changed data representation (different in-memory library)

Key insight: The best architecture depends on which changes you anticipate:

Reusability → Pipe and Filter (independent, pluggable filters)
Data representation change → Shared Data is worst (shared assumptions break everything)
Interactive deletion → ADT is best (just add a delete operation to the Line ADT)

Lessons¶

Multiple architectural styles can solve the same design problem
Each style has distinct advantages and disadvantages
Choose the style whose strengths match your system’s most important requirements and anticipated evolution