LLM Wiki: Designing Data-Intensive Applications

Welcome to the knowledge base for “Designing Data-Intensive Applications.”

Chapters

Foundations of Data Systems

Reliability

Scalability

Maintainability

Data Modeling

  • relational-model: Tabular data organization with strong join support.
  • document-model: Self-contained documents (JSON) with high data locality.
  • graph-models: Optimized for highly interconnected many-to-many data.
  • nosql: A movement toward non-relational, scalable, and flexible datastores.
  • impedance-mismatch: The friction between object-oriented code and relational tables.
  • schema-on-read: Implicit schemas interpreted at read time.
  • normalization: Removing duplication by storing data in one place and using IDs.
  • data-locality: Storing related data together on disk for performance.

Querying

  • query-languages: Declarative vs. imperative ways to interact with data.
  • mapreduce: A programming model for bulk data processing.
  • cypher: A declarative query language for property graphs.
  • sparql: A query language for triple-stores (RDF).
  • datalog: A foundational, rule-based query language for graphs.

Storage and Retrieval

  • hash-indexes: Simple in-memory index for log files.
  • sstables: Sorted string tables for efficient merging and indexing.
  • lsm-trees: Log-structured merge-trees for high write throughput.
  • b-trees: The standard update-in-place indexing structure.
  • compaction: Background space reclamation for log-structured stores.
  • bloom-filters: Memory-efficient way to check for key existence.

Analytics (OLAP)

Encoding and Evolution

  • encoding: Translating in-memory data structures to byte sequences.
  • compatibility: Essential for evolvability in distributed systems.
  • thrift: Binary encoding with field tags.
  • protocol-buffers: Popular binary format by Google.
  • avro: Binary encoding with writer’s and reader’s schemas.
  • rest: Design philosophy for web services.
  • rpc: Remote procedure call model and its leaky abstractions.
  • message-brokers: Asynchronous communication via intermediaries.
  • actor-model: Logic encapsulated in message-passing actors.

Distributed Data

Replication

Consistency Guarantees

Conflict Resolution

Partitioning

Transactions

The Trouble with Distributed Systems

  • partial-failures: Non-deterministic failures where some parts of the system are broken.
  • unreliable-networks: Communication channels that can lose, delay, or reorder messages.
  • unreliable-clocks: Local clocks that can drift or jump, making them unreliable for ordering.
  • process-pauses: Temporary interruptions in program execution (e.g., GC pauses).
  • fencing-tokens: Mechanisms for preventing nodes with expired leases from corrupting data.
  • byzantine-faults: Nodes that lie or act maliciously (Byzantine Generals Problem).
  • system-models: Formal assumptions about timing (synchronous/asynchronous) and failures.
  • safety-and-liveness: Formal properties for reasoning about distributed algorithm correctness.

Consistency and Consensus

Distributed Consistency

  • linearizability: Making a distributed system appear as if there is only one copy of data.
  • cap-theorem: The trade-off between consistency and availability during a network partition.
  • causality: Ordering events based on their “happened-before” relationship.
  • lamport-timestamps: Generating a total order of events consistent with causality.

Consensus and Coordination

  • consensus: Getting multiple nodes to agree on a value despite failures.
  • total-order-broadcast: Protocol for delivering the same messages in the same order to all nodes.
  • atomic-commit: Ensuring all nodes in a transaction either commit or abort.
  • two-phase-commit: Classic algorithm for distributed atomic commit.
  • zookeeper: Coordination and configuration service for distributed systems.

Derived Data

Batch Processing

  • batch-processing: Offline processing of large, bounded datasets.
  • unix-philosophy: Composable tools and uniform interfaces.
  • hdfs: Distributed filesystem for large-scale data storage.
  • mapreduce: The foundational distributed batch processing model.
  • dataflow-engines: Evolution of batch processing using DAGs (Spark, Flink).
  • joins: Distributed join strategies (map-side vs. reduce-side).
  • reduce-side-joins: General-purpose sort-merge joins.
  • map-side-joins: Performance optimizations for joins.
  • materialization: Writing intermediate state to disk for fault tolerance.
  • pregel: Iterative graph processing model.
  • throughput: Primary performance metric for batch systems.

Stream Processing

The Future of Data Systems