Datacenter Technologies¶

Internet Services¶

An internet service is any service accessible via a web interface (weather, email, banking, search). End users send web requests via browsers and receive responses.

Multi-Tier Architecture¶

Most internet services decompose into three tiers:

Presentation — interfaces with end users; serves static content (web page layout)
Business logic — processes dynamic, user-specific content
Database — data storage and management

These tiers may run as separate processes on separate machines, or be combined (e.g., Apache httpd serving static content + PHP modules generating dynamic content in one process). Middleware provides cross-cutting services: messaging, configuration management, security, accounting, persistence, and recovery.

Inter-process communication between tiers uses RPC/RMI and shared-memory optimizations when processes co-locate on the same machine.

Internet Service Architectures¶

Scale-Out¶

For high or variable request rates, services deploy multiple processes across multiple machines — scale-out architecture. A front-end load balancer routes incoming requests to available backend nodes (analogous to the boss-worker threading pattern).

Functionally Homogeneous¶

Every node can process any request type and execute any processing step.

Pros: simple front-end (round-robin or basic CPU-load balancing); easy to scale — just add more identical nodes
Cons: limited caching benefit — front-end doesn’t exploit data locality; requests for the same data may land on different nodes

Functionally Heterogeneous¶

Nodes are specialized for certain request types or processing steps (e.g., eBay: browsing servers vs bidding servers; partitioning by file name range).

Pros: specialization improves caching and locality
Cons: complex front-end (must parse requests to route); harder management (must profile workload to decide what to add); workload shifts cause hotspots on specialized nodes while others idle

Scale-Out Limitations¶

Even homogeneous scale-out has limits:

Management complexity grows — a single manager/controller cannot handle infinite resources
Physical capacity — datacenter space, power
Software stack constraints — if you must run your own stack, each machine requires setup and management

These limitations motivated the emergence of cloud computing.

Cloud Computing¶

The Animoto Example¶

Animoto (compute-intensive image→video service) ran on ~50 Amazon EC2 instances. After launching on Facebook in April 2008:

750,000 new users in 3 days
Resources grew from 50 → 400 (Tuesday) → 3,400 instances (Friday)
Two orders of magnitude scale-up in under a week — impossible with self-owned infrastructure

Requirements¶

The ideal: capacity scales elastically and instantly with demand; cost proportional to usage.

Cloud computing distills to:

On-demand, elastic resources and services
Fine-grained, usage-based pricing (not per-physical-server)
Professionally managed and hosted
Accessible via APIs for remote provisioning and control

What Cloud Computing Provides¶

Shared resource pool:

Infrastructure (IaaS) — virtual machines, storage, networking (e.g., Amazon EC2)
Software services (SaaS) — email, databases, processing runtimes (e.g., Gmail)
Platform (PaaS) — development/execution environments with OS, libraries, tools (e.g., Google App Engine)

APIs: web-based, language libraries (Java, Python), CLIs.

Billing: typically discrete step-function pricing (tiny/medium/large/XL instances), not true per-cycle usage.

Management stacks: OpenStack (open source), VMware vSphere.

Why It Works¶

Law of Large Numbers — average resource demand across many customers is roughly constant, even as individual peaks shift in time
Economies of Scale — hardware costs amortized across many tenants

Cloud Computing Vision¶

John McCarthy (1961): “Computing may some day be organized as a public utility, just as the telephone system is a public utility.”

Cloud computing aims to make IT resources a fungible utility via virtualization. Remaining limitations:

Hardware dependencies that virtualization cannot fully mask
API lock-in across providers (no universal standard)
Privacy and security concerns (data leaves your premises)
Latency from geographic distance

Cloud Deployment Models¶

Defined by NIST (2011):

Public — infrastructure owned by provider, open to third-party tenants (e.g., AWS)
Private — infrastructure and tenants owned by the same organization; cloud technology used internally for elastic provisioning
Hybrid — private cloud + public cloud for failover, spike handling, or testing
Community — public cloud restricted to a specific user community

Cloud Service Models¶

Model	What the provider manages	Example
IaaS	Hardware + virtualization	Amazon EC2
PaaS	OS, middleware, runtime	Google App Engine
SaaS	application, data	Gmail

On-premise: you manage everything. Moving up the stack, the provider manages more layers.

IaaS resources are virtualized and typically shared with other tenants. Exception: Amazon high-performance and GPU instances run on dedicated physical hardware due to performance virtualization challenges.

Cloud Technology Requirements¶

Fungible resources — easily repurposed across customers and hardware generations via virtualization
Elastic resource management — dynamic allocation/deallocation at scale (thousands of nodes); platforms: Mesos, YARN
Failure handling — with N components each having failure probability p, system failure probability = 1 − (1−p)^N. At scale, failures are inevitable (p=0.03, N=10 → 26% chance; N=100 → 95%). Software must incorporate timeouts, retries, checkpointing, and graceful recovery.
Multi-tenancy — performance isolation so one tenant cannot monopolize resources
Security and privacy — isolation of tenant data; protection of provider infrastructure from tenant vulnerabilities

Cloud Enabling Technologies¶

Virtualization — fungible, dynamically repurposable resources
Resource scheduling — Mesos, YARN for infrastructure provisioning
Big data processing — Hadoop MapReduce, Spark
Distributed storage — append-only distributed file systems, NoSQL databases, distributed in-memory stores
Software-defined infrastructure — software-defined networking, storage slicing for tenant isolation
Monitoring — real-time log processing and anomaly detection (Flume, CloudWatch, LogInsight)

Cloud as a Big Data Engine¶

Cloud empowers access to virtually infinite compute and storage resources for data-intensive applications.

A cloud big data platform typically includes:

Distributed storage — stores and replicates data across many nodes (e.g., HDFS)
Data processing framework — parallel computation over stored data (e.g., MapReduce, Spark); scheduler co-locates computation with data to minimize transfers
In-memory caching — avoids repeated disk access (e.g., Tachyon/Alluxio)
Query language front-end — SQL-like interfaces over distributed data (e.g., Hive, Spark SQL)
Analytics libraries — machine learning, graph processing (e.g., Mahout, MLlib, GraphX)
Streaming — continuous data ingestion and real-time output (e.g., Kafka, Spark Streaming)

Example Big Data Stacks¶

Hadoop stack:

HDFS (storage) → MapReduce/YARN (processing) → HBase (table-structured access) → Hive (SQL front-end) → Mahout (ML), R, Pig (analytics) + ZooKeeper (coordination), Kafka (streaming)

Berkeley Data Analytics Stack (BDAS):

Storage → Tachyon (in-memory FS) → Spark (processing) → Spark SQL, Spark Streaming, GraphX, MLlib (front-ends/libraries) + Mesos (resource manager enabling elastic sharing across Hadoop, Spark, MPI partitions)

Both are open source; many proprietary alternatives exist (e.g., LexisNexis HPCC predates Hadoop).