Datacenter Technologies¶
Internet Services¶
An internet service is any service accessible via a web interface (weather, email, banking, search). End users send web requests via browsers and receive responses.
Multi-Tier Architecture¶
Most internet services decompose into three tiers:
Presentation — interfaces with end users; serves static content (web page layout)
Business logic — processes dynamic, user-specific content
Database — data storage and management
These tiers may run as separate processes on separate machines, or be combined (e.g., Apache httpd serving static content + PHP modules generating dynamic content in one process). Middleware provides cross-cutting services: messaging, configuration management, security, accounting, persistence, and recovery.
Inter-process communication between tiers uses RPC/RMI and shared-memory optimizations when processes co-locate on the same machine.
Internet Service Architectures¶
Scale-Out¶
For high or variable request rates, services deploy multiple processes across multiple machines — scale-out architecture. A front-end load balancer routes incoming requests to available backend nodes (analogous to the boss-worker threading pattern).
Functionally Homogeneous¶
Every node can process any request type and execute any processing step.
Pros: simple front-end (round-robin or basic CPU-load balancing); easy to scale — just add more identical nodes
Cons: limited caching benefit — front-end doesn’t exploit data locality; requests for the same data may land on different nodes
Functionally Heterogeneous¶
Nodes are specialized for certain request types or processing steps (e.g., eBay: browsing servers vs bidding servers; partitioning by file name range).
Pros: specialization improves caching and locality
Cons: complex front-end (must parse requests to route); harder management (must profile workload to decide what to add); workload shifts cause hotspots on specialized nodes while others idle
Scale-Out Limitations¶
Even homogeneous scale-out has limits:
Management complexity grows — a single manager/controller cannot handle infinite resources
Physical capacity — datacenter space, power
Software stack constraints — if you must run your own stack, each machine requires setup and management
These limitations motivated the emergence of cloud computing.
Cloud Computing¶
The Animoto Example¶
Animoto (compute-intensive image→video service) ran on ~50 Amazon EC2 instances. After launching on Facebook in April 2008:
750,000 new users in 3 days
Resources grew from 50 → 400 (Tuesday) → 3,400 instances (Friday)
Two orders of magnitude scale-up in under a week — impossible with self-owned infrastructure
Requirements¶
The ideal: capacity scales elastically and instantly with demand; cost proportional to usage.
Cloud computing distills to:
On-demand, elastic resources and services
Fine-grained, usage-based pricing (not per-physical-server)
Professionally managed and hosted
Accessible via APIs for remote provisioning and control
What Cloud Computing Provides¶
Shared resource pool:
Infrastructure (IaaS) — virtual machines, storage, networking (e.g., Amazon EC2)
Software services (SaaS) — email, databases, processing runtimes (e.g., Gmail)
Platform (PaaS) — development/execution environments with OS, libraries, tools (e.g., Google App Engine)
APIs: web-based, language libraries (Java, Python), CLIs.
Billing: typically discrete step-function pricing (tiny/medium/large/XL instances), not true per-cycle usage.
Management stacks: OpenStack (open source), VMware vSphere.
Why It Works¶
Law of Large Numbers — average resource demand across many customers is roughly constant, even as individual peaks shift in time
Economies of Scale — hardware costs amortized across many tenants
Cloud Computing Vision¶
John McCarthy (1961): “Computing may some day be organized as a public utility, just as the telephone system is a public utility.”
Cloud computing aims to make IT resources a fungible utility via virtualization. Remaining limitations:
Hardware dependencies that virtualization cannot fully mask
API lock-in across providers (no universal standard)
Privacy and security concerns (data leaves your premises)
Latency from geographic distance
Cloud Deployment Models¶
Defined by NIST (2011):
Public — infrastructure owned by provider, open to third-party tenants (e.g., AWS)
Private — infrastructure and tenants owned by the same organization; cloud technology used internally for elastic provisioning
Hybrid — private cloud + public cloud for failover, spike handling, or testing
Community — public cloud restricted to a specific user community
Cloud Service Models¶
Model |
What the provider manages |
Example |
|---|---|---|
IaaS |
Hardware + virtualization |
Amazon EC2 |
PaaS |
|
Google App Engine |
SaaS |
|
Gmail |
On-premise: you manage everything. Moving up the stack, the provider manages more layers.
IaaS resources are virtualized and typically shared with other tenants. Exception: Amazon high-performance and GPU instances run on dedicated physical hardware due to performance virtualization challenges.
Cloud Technology Requirements¶
Fungible resources — easily repurposed across customers and hardware generations via virtualization
Elastic resource management — dynamic allocation/deallocation at scale (thousands of nodes); platforms: Mesos, YARN
Failure handling — with N components each having failure probability p, system failure probability = 1 − (1−p)^N. At scale, failures are inevitable (p=0.03, N=10 → 26% chance; N=100 → 95%). Software must incorporate timeouts, retries, checkpointing, and graceful recovery.
Multi-tenancy — performance isolation so one tenant cannot monopolize resources
Security and privacy — isolation of tenant data; protection of provider infrastructure from tenant vulnerabilities
Cloud Enabling Technologies¶
Virtualization — fungible, dynamically repurposable resources
Resource scheduling — Mesos, YARN for infrastructure provisioning
Big data processing — Hadoop MapReduce, Spark
Distributed storage — append-only distributed file systems, NoSQL databases, distributed in-memory stores
Software-defined infrastructure — software-defined networking, storage slicing for tenant isolation
Monitoring — real-time log processing and anomaly detection (Flume, CloudWatch, LogInsight)
Cloud as a Big Data Engine¶
Cloud empowers access to virtually infinite compute and storage resources for data-intensive applications.
A cloud big data platform typically includes:
Distributed storage — stores and replicates data across many nodes (e.g., HDFS)
Data processing framework — parallel computation over stored data (e.g., MapReduce, Spark); scheduler co-locates computation with data to minimize transfers
In-memory caching — avoids repeated disk access (e.g., Tachyon/Alluxio)
Query language front-end — SQL-like interfaces over distributed data (e.g., Hive, Spark SQL)
Analytics libraries — machine learning, graph processing (e.g., Mahout, MLlib, GraphX)
Streaming — continuous data ingestion and real-time output (e.g., Kafka, Spark Streaming)
Example Big Data Stacks¶
Hadoop stack:
HDFS (storage) → MapReduce/YARN (processing) → HBase (table-structured access) → Hive (SQL front-end) → Mahout (ML), R, Pig (analytics) + ZooKeeper (coordination), Kafka (streaming)
Berkeley Data Analytics Stack (BDAS):
Storage → Tachyon (in-memory FS) → Spark (processing) → Spark SQL, Spark Streaming, GraphX, MLlib (front-ends/libraries) + Mesos (resource manager enabling elastic sharing across Hadoop, Spark, MPI partitions)
Both are open source; many proprietary alternatives exist (e.g., LexisNexis HPCC predates Hadoop).