HDFS (Hadoop Distributed File System)

Summary: A distributed, shared-nothing filesystem designed to run on commodity hardware, providing high-throughput access to large datasets.

Sources: raw/chapter10

Last updated: 2026-04-18

HDFS is an open-source implementation of the Google File System (GFS). It is based on the shared-nothing principle, where nodes communicate over a network rather than sharing physical disks (source: chapter10, p. 398).

Architecture

NameNode: A central server that maintains the metadata and keeps track of where file blocks are stored (source: chapter10, p. 398).
DataNodes: A daemon process running on each machine in the cluster that exposes a network service allowing access to the blocks stored on its local disks.
Replication: To tolerate machine and disk failures, file blocks are replicated across multiple nodes (source: chapter10, p. 398).

Data Locality

A key principle of HDFS and MapReduce is putting the computation near the data. By running the mapper on the same machine that stores a replica of the input file, the system saves network bandwidth and reduces latency (source: chapter10, p. 400).

Quartz 4

Explorer

hdfs

HDFS (Hadoop Distributed File System)

Architecture

Data Locality

Graph View

Table of Contents

Backlinks

Quartz 4

Explorer

hdfs

HDFS (Hadoop Distributed File System)

Architecture

Data Locality

Related pages

Graph View

Table of Contents

Backlinks