HDFS (Hadoop Distributed File System)

Summary: A distributed, shared-nothing filesystem designed to run on commodity hardware, providing high-throughput access to large datasets.

Sources: raw/chapter10

Last updated: 2026-04-18


HDFS is an open-source implementation of the Google File System (GFS). It is based on the shared-nothing principle, where nodes communicate over a network rather than sharing physical disks (source: chapter10, p. 398).

Architecture

  • NameNode: A central server that maintains the metadata and keeps track of where file blocks are stored (source: chapter10, p. 398).
  • DataNodes: A daemon process running on each machine in the cluster that exposes a network service allowing access to the blocks stored on its local disks.
  • Replication: To tolerate machine and disk failures, file blocks are replicated across multiple nodes (source: chapter10, p. 398).

Data Locality

A key principle of HDFS and MapReduce is putting the computation near the data. By running the mapper on the same machine that stores a replica of the input file, the system saves network bandwidth and reduces latency (source: chapter10, p. 400).