Datacenter Technologies

Internet Services

An internet service is any service accessible via a web interface (weather, email, banking, search). End users send web requests via browsers and receive responses.

Multi-Tier Architecture

Most internet services decompose into three tiers:

  • Presentation — interfaces with end users; serves static content (web page layout)

  • Business logic — processes dynamic, user-specific content

  • Database — data storage and management

These tiers may run as separate processes on separate machines, or be combined (e.g., Apache httpd serving static content + PHP modules generating dynamic content in one process). Middleware provides cross-cutting services: messaging, configuration management, security, accounting, persistence, and recovery.

Inter-process communication between tiers uses RPC/RMI and shared-memory optimizations when processes co-locate on the same machine.

Internet Service Architectures

Scale-Out

For high or variable request rates, services deploy multiple processes across multiple machines — scale-out architecture. A front-end load balancer routes incoming requests to available backend nodes (analogous to the boss-worker threading pattern).

Functionally Homogeneous

Every node can process any request type and execute any processing step.

  • Pros: simple front-end (round-robin or basic CPU-load balancing); easy to scale — just add more identical nodes

  • Cons: limited caching benefit — front-end doesn’t exploit data locality; requests for the same data may land on different nodes

Functionally Heterogeneous

Nodes are specialized for certain request types or processing steps (e.g., eBay: browsing servers vs bidding servers; partitioning by file name range).

  • Pros: specialization improves caching and locality

  • Cons: complex front-end (must parse requests to route); harder management (must profile workload to decide what to add); workload shifts cause hotspots on specialized nodes while others idle

Scale-Out Limitations

Even homogeneous scale-out has limits:

  • Management complexity grows — a single manager/controller cannot handle infinite resources

  • Physical capacity — datacenter space, power

  • Software stack constraints — if you must run your own stack, each machine requires setup and management

These limitations motivated the emergence of cloud computing.

Cloud Computing

The Animoto Example

Animoto (compute-intensive image→video service) ran on ~50 Amazon EC2 instances. After launching on Facebook in April 2008:

  • 750,000 new users in 3 days

  • Resources grew from 50 → 400 (Tuesday) → 3,400 instances (Friday)

  • Two orders of magnitude scale-up in under a week — impossible with self-owned infrastructure

Requirements

The ideal: capacity scales elastically and instantly with demand; cost proportional to usage.

Cloud computing distills to:

  • On-demand, elastic resources and services

  • Fine-grained, usage-based pricing (not per-physical-server)

  • Professionally managed and hosted

  • Accessible via APIs for remote provisioning and control

What Cloud Computing Provides

Shared resource pool:

  • Infrastructure (IaaS) — virtual machines, storage, networking (e.g., Amazon EC2)

  • Software services (SaaS) — email, databases, processing runtimes (e.g., Gmail)

  • Platform (PaaS) — development/execution environments with OS, libraries, tools (e.g., Google App Engine)

APIs: web-based, language libraries (Java, Python), CLIs.

Billing: typically discrete step-function pricing (tiny/medium/large/XL instances), not true per-cycle usage.

Management stacks: OpenStack (open source), VMware vSphere.

Why It Works

  • Law of Large Numbers — average resource demand across many customers is roughly constant, even as individual peaks shift in time

  • Economies of Scale — hardware costs amortized across many tenants

Cloud Computing Vision

John McCarthy (1961): “Computing may some day be organized as a public utility, just as the telephone system is a public utility.”

Cloud computing aims to make IT resources a fungible utility via virtualization. Remaining limitations:

  • Hardware dependencies that virtualization cannot fully mask

  • API lock-in across providers (no universal standard)

  • Privacy and security concerns (data leaves your premises)

  • Latency from geographic distance

Cloud Deployment Models

Defined by NIST (2011):

  • Public — infrastructure owned by provider, open to third-party tenants (e.g., AWS)

  • Private — infrastructure and tenants owned by the same organization; cloud technology used internally for elastic provisioning

  • Hybrid — private cloud + public cloud for failover, spike handling, or testing

  • Community — public cloud restricted to a specific user community

Cloud Service Models

Model

What the provider manages

Example

IaaS

Hardware + virtualization

Amazon EC2

PaaS

  • OS, middleware, runtime

Google App Engine

SaaS

  • application, data

Gmail

On-premise: you manage everything. Moving up the stack, the provider manages more layers.

IaaS resources are virtualized and typically shared with other tenants. Exception: Amazon high-performance and GPU instances run on dedicated physical hardware due to performance virtualization challenges.

Cloud Technology Requirements

  • Fungible resources — easily repurposed across customers and hardware generations via virtualization

  • Elastic resource management — dynamic allocation/deallocation at scale (thousands of nodes); platforms: Mesos, YARN

  • Failure handling — with N components each having failure probability p, system failure probability = 1 − (1−p)^N. At scale, failures are inevitable (p=0.03, N=10 → 26% chance; N=100 → 95%). Software must incorporate timeouts, retries, checkpointing, and graceful recovery.

  • Multi-tenancy — performance isolation so one tenant cannot monopolize resources

  • Security and privacy — isolation of tenant data; protection of provider infrastructure from tenant vulnerabilities

Cloud Enabling Technologies

  • Virtualization — fungible, dynamically repurposable resources

  • Resource scheduling — Mesos, YARN for infrastructure provisioning

  • Big data processing — Hadoop MapReduce, Spark

  • Distributed storage — append-only distributed file systems, NoSQL databases, distributed in-memory stores

  • Software-defined infrastructure — software-defined networking, storage slicing for tenant isolation

  • Monitoring — real-time log processing and anomaly detection (Flume, CloudWatch, LogInsight)

Cloud as a Big Data Engine

Cloud empowers access to virtually infinite compute and storage resources for data-intensive applications.

A cloud big data platform typically includes:

  1. Distributed storage — stores and replicates data across many nodes (e.g., HDFS)

  2. Data processing framework — parallel computation over stored data (e.g., MapReduce, Spark); scheduler co-locates computation with data to minimize transfers

  3. In-memory caching — avoids repeated disk access (e.g., Tachyon/Alluxio)

  4. Query language front-end — SQL-like interfaces over distributed data (e.g., Hive, Spark SQL)

  5. Analytics libraries — machine learning, graph processing (e.g., Mahout, MLlib, GraphX)

  6. Streaming — continuous data ingestion and real-time output (e.g., Kafka, Spark Streaming)

Example Big Data Stacks

Hadoop stack:

  • HDFS (storage) → MapReduce/YARN (processing) → HBase (table-structured access) → Hive (SQL front-end) → Mahout (ML), R, Pig (analytics) + ZooKeeper (coordination), Kafka (streaming)

Berkeley Data Analytics Stack (BDAS):

  • Storage → Tachyon (in-memory FS) → Spark (processing) → Spark SQL, Spark Streaming, GraphX, MLlib (front-ends/libraries) + Mesos (resource manager enabling elastic sharing across Hadoop, Spark, MPI partitions)

Both are open source; many proprietary alternatives exist (e.g., LexisNexis HPCC predates Hadoop).