LLM Wiki: Designing Data-Intensive Applications
Welcome to the knowledge base for “Designing Data-Intensive Applications.”
Chapters
- chapter1: Reliable, Scalable, and Maintainable Applications.
- chapter2: Data Models and Query Languages.
- chapter3: Storage and Retrieval.
- chapter4: Encoding and Evolution.
- chapter5: Replication.
- chapter6: Partitioning.
- chapter7: Transactions.
- chapter8: The Trouble with Distributed Systems.
- chapter9: Consistency and Consensus.
- chapter10: Batch Processing.
- chapter11: Stream Processing.
- chapter12: The Future of Data Systems.
Foundations of Data Systems
Reliability
- reliability: Systems should work correctly even in the face of adversity.
- fault-tolerance: Anticipating and coping with faults.
- write-ahead-log: Crash recovery for update-in-place structures.
Scalability
- scalability: Coping with increased load.
- load-parameters: Describing current load.
- latency-and-response-time: Measuring performance.
- percentiles: Understanding response time distribution.
Maintainability
- maintainability: Working productively over time.
- operability: Making systems easy to run.
- simplicity: Managing complexity through abstraction.
- evolvability: Making change easy.
Data Modeling
- relational-model: Tabular data organization with strong join support.
- document-model: Self-contained documents (JSON) with high data locality.
- graph-models: Optimized for highly interconnected many-to-many data.
- nosql: A movement toward non-relational, scalable, and flexible datastores.
- impedance-mismatch: The friction between object-oriented code and relational tables.
- schema-on-read: Implicit schemas interpreted at read time.
- normalization: Removing duplication by storing data in one place and using IDs.
- data-locality: Storing related data together on disk for performance.
Querying
- query-languages: Declarative vs. imperative ways to interact with data.
- mapreduce: A programming model for bulk data processing.
- cypher: A declarative query language for property graphs.
- sparql: A query language for triple-stores (RDF).
- datalog: A foundational, rule-based query language for graphs.
Storage and Retrieval
- hash-indexes: Simple in-memory index for log files.
- sstables: Sorted string tables for efficient merging and indexing.
- lsm-trees: Log-structured merge-trees for high write throughput.
- b-trees: The standard update-in-place indexing structure.
- compaction: Background space reclamation for log-structured stores.
- bloom-filters: Memory-efficient way to check for key existence.
Analytics (OLAP)
- oltp: Online transaction processing for user-facing applications.
- olap: Online analytic processing for business intelligence.
- data-warehousing: Centralized storage for analytics.
- column-oriented-storage: Efficient storage for scanning millions of records.
- materialized-views: Cached query results for analytics.
Encoding and Evolution
- encoding: Translating in-memory data structures to byte sequences.
- compatibility: Essential for evolvability in distributed systems.
- thrift: Binary encoding with field tags.
- protocol-buffers: Popular binary format by Google.
- avro: Binary encoding with writer’s and reader’s schemas.
- rest: Design philosophy for web services.
- rpc: Remote procedure call model and its leaky abstractions.
- message-brokers: Asynchronous communication via intermediaries.
- actor-model: Logic encapsulated in message-passing actors.
Distributed Data
Replication
- replication: Keeping a copy of the same data on several different nodes.
- single-leader-replication: One node handles all writes, others follow.
- multi-leader-replication: More than one node can accept writes.
- leaderless-replication: Any node can accept writes (Dynamo-style).
- asynchronous-replication: Leader returns success without waiting for followers.
- replication-lag: The delay between a write on the leader and its arrival at a follower.
- quorum: Minimum number of nodes participating in a successful operation.
- read-repair: Fixing stale data during a read.
- anti-entropy: Background process for keeping replicas in sync.
Consistency Guarantees
- eventual-consistency: Replicas reach the same state over time without a specified lag.
- read-after-write-consistency: Ensuring users see their own updates.
- monotonic-reads: Preventing users from seeing data move backward in time.
- consistent-prefix-reads: Maintaining causality in sequences of writes.
Conflict Resolution
- failover: Handling the failure of a leader node.
- conflict-resolution: Reconciling concurrent updates.
- last-write-wins: Resolving conflicts using timestamps.
- version-vectors: Tracking causal dependencies between writes.
- crdts: Data structures that automatically resolve conflicts.
Partitioning
- partitioning: Breaking a large dataset into smaller subsets.
- key-range-partitioning: Assigning a continuous range of keys to each partition.
- hash-partitioning: Distributing keys by their hash value.
- consistent-hashing: Approach to hash partitioning for minimizing rebalancing work.
- rebalancing: Moving partitions between nodes in a cluster.
- hot-spots: Partitions with disproportionately high load.
- local-index: Secondary index partitioned with the data (scatter/gather).
- global-index: Secondary index partitioned by term (term-partitioned).
- service-discovery: Finding the correct node for a request.
- gossip-protocol: Decentralized information sharing.
- split-brain: Scenario where two nodes simultaneously believe they are the leader.
Transactions
- transactions: An abstraction layer for grouping reads and writes.
- acid: Safety guarantees (Atomicity, Consistency, Isolation, Durability).
- read-committed: Basic isolation level (no dirty reads or dirty writes).
- snapshot-isolation: Consistent reads from a point-in-time snapshot.
- multi-version-concurrency-control: Implementing isolation using multiple versions of objects.
- lost-updates: Concurrency anomaly in read-modify-write cycles.
- write-skew: Race condition where transactions update different objects based on a shared premise.
- serializability: Strongest isolation level, guaranteeing serial execution.
- two-phase-locking: Pessimistic concurrency control via shared/exclusive locks.
- serializable-snapshot-isolation: Optimistic concurrency control with snapshot isolation.
- distributed-transactions: Transactions spanning multiple nodes or systems.
- xa: Standard interface for 2PC across heterogeneous systems.
The Trouble with Distributed Systems
- partial-failures: Non-deterministic failures where some parts of the system are broken.
- unreliable-networks: Communication channels that can lose, delay, or reorder messages.
- unreliable-clocks: Local clocks that can drift or jump, making them unreliable for ordering.
- process-pauses: Temporary interruptions in program execution (e.g., GC pauses).
- fencing-tokens: Mechanisms for preventing nodes with expired leases from corrupting data.
- byzantine-faults: Nodes that lie or act maliciously (Byzantine Generals Problem).
- system-models: Formal assumptions about timing (synchronous/asynchronous) and failures.
- safety-and-liveness: Formal properties for reasoning about distributed algorithm correctness.
Consistency and Consensus
Distributed Consistency
- linearizability: Making a distributed system appear as if there is only one copy of data.
- cap-theorem: The trade-off between consistency and availability during a network partition.
- causality: Ordering events based on their “happened-before” relationship.
- lamport-timestamps: Generating a total order of events consistent with causality.
Consensus and Coordination
- consensus: Getting multiple nodes to agree on a value despite failures.
- total-order-broadcast: Protocol for delivering the same messages in the same order to all nodes.
- atomic-commit: Ensuring all nodes in a transaction either commit or abort.
- two-phase-commit: Classic algorithm for distributed atomic commit.
- zookeeper: Coordination and configuration service for distributed systems.
Derived Data
Batch Processing
- batch-processing: Offline processing of large, bounded datasets.
- unix-philosophy: Composable tools and uniform interfaces.
- hdfs: Distributed filesystem for large-scale data storage.
- mapreduce: The foundational distributed batch processing model.
- dataflow-engines: Evolution of batch processing using DAGs (Spark, Flink).
- joins: Distributed join strategies (map-side vs. reduce-side).
- reduce-side-joins: General-purpose sort-merge joins.
- map-side-joins: Performance optimizations for joins.
- materialization: Writing intermediate state to disk for fault tolerance.
- pregel: Iterative graph processing model.
- throughput: Primary performance metric for batch systems.
Stream Processing
- stream-processing: Continuous processing of unbounded event streams.
- event-streams: Sequences of immutable events representing things that happened.
- log-based-message-brokers: Message brokers that use append-only logs for durability and replay.
- change-data-capture: Extracting database changes into a stream.
- event-sourcing: Architecture where application state is stored as a sequence of events.
- complex-event-processing: Pattern matching across event streams.
- windowing: Dividing unbounded streams into time-based buckets.
- stream-joins: Techniques for joining continuous streams.
- exactly-once-semantics: Guaranteeing that the effect of an event is reflected only once.
The Future of Data Systems
- unbundling-databases: Decomposing database functions into specialized, composed services.
- data-integration: Harmonizing disparate data systems via derived dataflows.
- end-to-end-argument: The necessity of application-level correctness measures.
- timeliness-and-integrity: Distinguishing between eventual consistency and data corruption.
- idempotence: Enabling safe retries in distributed and asynchronous systems.
- lambda-architecture: Combining batch and stream layers for robust data processing.
- data-ethics: Navigating the social and moral impact of data technologies.