HDFS (Hadoop Distributed File System)
Summary: A distributed, shared-nothing filesystem designed to run on commodity hardware, providing high-throughput access to large datasets.
Sources: raw/chapter10
Last updated: 2026-04-18
HDFS is an open-source implementation of the Google File System (GFS). It is based on the shared-nothing principle, where nodes communicate over a network rather than sharing physical disks (source: chapter10, p. 398).
Architecture
- NameNode: A central server that maintains the metadata and keeps track of where file blocks are stored (source: chapter10, p. 398).
- DataNodes: A daemon process running on each machine in the cluster that exposes a network service allowing access to the blocks stored on its local disks.
- Replication: To tolerate machine and disk failures, file blocks are replicated across multiple nodes (source: chapter10, p. 398).
Data Locality
A key principle of HDFS and MapReduce is putting the computation near the data. By running the mapper on the same machine that stores a replica of the input file, the system saves network bandwidth and reduces latency (source: chapter10, p. 400).