LLM Wiki: Designing Data-Intensive Applications

Welcome to the knowledge base for “Designing Data-Intensive Applications.”

Chapters

chapter1: Reliable, Scalable, and Maintainable Applications.
chapter2: Data Models and Query Languages.
chapter3: Storage and Retrieval.
chapter4: Encoding and Evolution.
chapter5: Replication.
chapter6: Partitioning.
chapter7: Transactions.
chapter8: The Trouble with Distributed Systems.
chapter9: Consistency and Consensus.
chapter10: Batch Processing.
chapter11: Stream Processing.
chapter12: The Future of Data Systems.

Foundations of Data Systems

Reliability

reliability: Systems should work correctly even in the face of adversity.
fault-tolerance: Anticipating and coping with faults.
write-ahead-log: Crash recovery for update-in-place structures.

Scalability

scalability: Coping with increased load.
load-parameters: Describing current load.
latency-and-response-time: Measuring performance.
percentiles: Understanding response time distribution.

Maintainability

maintainability: Working productively over time.
operability: Making systems easy to run.
simplicity: Managing complexity through abstraction.
evolvability: Making change easy.

Data Modeling

relational-model: Tabular data organization with strong join support.
document-model: Self-contained documents (JSON) with high data locality.
graph-models: Optimized for highly interconnected many-to-many data.
nosql: A movement toward non-relational, scalable, and flexible datastores.
impedance-mismatch: The friction between object-oriented code and relational tables.
schema-on-read: Implicit schemas interpreted at read time.
normalization: Removing duplication by storing data in one place and using IDs.
data-locality: Storing related data together on disk for performance.

Querying

query-languages: Declarative vs. imperative ways to interact with data.
mapreduce: A programming model for bulk data processing.
cypher: A declarative query language for property graphs.
sparql: A query language for triple-stores (RDF).
datalog: A foundational, rule-based query language for graphs.

Storage and Retrieval

hash-indexes: Simple in-memory index for log files.
sstables: Sorted string tables for efficient merging and indexing.
lsm-trees: Log-structured merge-trees for high write throughput.
b-trees: The standard update-in-place indexing structure.
compaction: Background space reclamation for log-structured stores.
bloom-filters: Memory-efficient way to check for key existence.

Analytics (OLAP)

oltp: Online transaction processing for user-facing applications.
olap: Online analytic processing for business intelligence.
data-warehousing: Centralized storage for analytics.
column-oriented-storage: Efficient storage for scanning millions of records.
materialized-views: Cached query results for analytics.

Encoding and Evolution

encoding: Translating in-memory data structures to byte sequences.
compatibility: Essential for evolvability in distributed systems.
thrift: Binary encoding with field tags.
protocol-buffers: Popular binary format by Google.
avro: Binary encoding with writer’s and reader’s schemas.
rest: Design philosophy for web services.
rpc: Remote procedure call model and its leaky abstractions.
message-brokers: Asynchronous communication via intermediaries.
actor-model: Logic encapsulated in message-passing actors.

Distributed Data

Replication

replication: Keeping a copy of the same data on several different nodes.
single-leader-replication: One node handles all writes, others follow.
multi-leader-replication: More than one node can accept writes.
leaderless-replication: Any node can accept writes (Dynamo-style).
asynchronous-replication: Leader returns success without waiting for followers.
replication-lag: The delay between a write on the leader and its arrival at a follower.
quorum: Minimum number of nodes participating in a successful operation.
read-repair: Fixing stale data during a read.
anti-entropy: Background process for keeping replicas in sync.

Consistency Guarantees

eventual-consistency: Replicas reach the same state over time without a specified lag.
read-after-write-consistency: Ensuring users see their own updates.
monotonic-reads: Preventing users from seeing data move backward in time.
consistent-prefix-reads: Maintaining causality in sequences of writes.

Conflict Resolution

failover: Handling the failure of a leader node.
conflict-resolution: Reconciling concurrent updates.
last-write-wins: Resolving conflicts using timestamps.
version-vectors: Tracking causal dependencies between writes.
crdts: Data structures that automatically resolve conflicts.

Partitioning

partitioning: Breaking a large dataset into smaller subsets.
key-range-partitioning: Assigning a continuous range of keys to each partition.
hash-partitioning: Distributing keys by their hash value.
consistent-hashing: Approach to hash partitioning for minimizing rebalancing work.
rebalancing: Moving partitions between nodes in a cluster.
hot-spots: Partitions with disproportionately high load.
local-index: Secondary index partitioned with the data (scatter/gather).
global-index: Secondary index partitioned by term (term-partitioned).
service-discovery: Finding the correct node for a request.
gossip-protocol: Decentralized information sharing.
split-brain: Scenario where two nodes simultaneously believe they are the leader.

Transactions

transactions: An abstraction layer for grouping reads and writes.
acid: Safety guarantees (Atomicity, Consistency, Isolation, Durability).
read-committed: Basic isolation level (no dirty reads or dirty writes).
snapshot-isolation: Consistent reads from a point-in-time snapshot.
multi-version-concurrency-control: Implementing isolation using multiple versions of objects.
lost-updates: Concurrency anomaly in read-modify-write cycles.
write-skew: Race condition where transactions update different objects based on a shared premise.
serializability: Strongest isolation level, guaranteeing serial execution.
two-phase-locking: Pessimistic concurrency control via shared/exclusive locks.
serializable-snapshot-isolation: Optimistic concurrency control with snapshot isolation.
distributed-transactions: Transactions spanning multiple nodes or systems.
xa: Standard interface for 2PC across heterogeneous systems.

The Trouble with Distributed Systems

partial-failures: Non-deterministic failures where some parts of the system are broken.
unreliable-networks: Communication channels that can lose, delay, or reorder messages.
unreliable-clocks: Local clocks that can drift or jump, making them unreliable for ordering.
process-pauses: Temporary interruptions in program execution (e.g., GC pauses).
fencing-tokens: Mechanisms for preventing nodes with expired leases from corrupting data.
byzantine-faults: Nodes that lie or act maliciously (Byzantine Generals Problem).
system-models: Formal assumptions about timing (synchronous/asynchronous) and failures.
safety-and-liveness: Formal properties for reasoning about distributed algorithm correctness.

Consistency and Consensus

Distributed Consistency

linearizability: Making a distributed system appear as if there is only one copy of data.
cap-theorem: The trade-off between consistency and availability during a network partition.
causality: Ordering events based on their “happened-before” relationship.
lamport-timestamps: Generating a total order of events consistent with causality.

Consensus and Coordination

consensus: Getting multiple nodes to agree on a value despite failures.
total-order-broadcast: Protocol for delivering the same messages in the same order to all nodes.
atomic-commit: Ensuring all nodes in a transaction either commit or abort.
two-phase-commit: Classic algorithm for distributed atomic commit.
zookeeper: Coordination and configuration service for distributed systems.

Derived Data

Batch Processing

batch-processing: Offline processing of large, bounded datasets.
unix-philosophy: Composable tools and uniform interfaces.
hdfs: Distributed filesystem for large-scale data storage.
mapreduce: The foundational distributed batch processing model.
dataflow-engines: Evolution of batch processing using DAGs (Spark, Flink).
joins: Distributed join strategies (map-side vs. reduce-side).
reduce-side-joins: General-purpose sort-merge joins.
map-side-joins: Performance optimizations for joins.
materialization: Writing intermediate state to disk for fault tolerance.
pregel: Iterative graph processing model.
throughput: Primary performance metric for batch systems.

Stream Processing

stream-processing: Continuous processing of unbounded event streams.
event-streams: Sequences of immutable events representing things that happened.
log-based-message-brokers: Message brokers that use append-only logs for durability and replay.
change-data-capture: Extracting database changes into a stream.
event-sourcing: Architecture where application state is stored as a sequence of events.
complex-event-processing: Pattern matching across event streams.
windowing: Dividing unbounded streams into time-based buckets.
stream-joins: Techniques for joining continuous streams.
exactly-once-semantics: Guaranteeing that the effect of an event is reflected only once.

The Future of Data Systems

unbundling-databases: Decomposing database functions into specialized, composed services.
data-integration: Harmonizing disparate data systems via derived dataflows.
end-to-end-argument: The necessity of application-level correctness measures.
timeliness-and-integrity: Distinguishing between eventual consistency and data corruption.
idempotence: Enabling safe retries in distributed and asynchronous systems.
lambda-architecture: Combining batch and stream layers for robust data processing.
data-ethics: Navigating the social and moral impact of data technologies.

Quartz 4

Explorer

index

LLM Wiki: Designing Data-Intensive Applications

Chapters

Foundations of Data Systems

Reliability

Scalability

Maintainability

Data Modeling

Querying

Storage and Retrieval

Analytics (OLAP)

Encoding and Evolution

Distributed Data

Replication

Consistency Guarantees

Conflict Resolution

Partitioning

Transactions

The Trouble with Distributed Systems

Consistency and Consensus

Distributed Consistency

Consensus and Coordination

Derived Data

Batch Processing

Stream Processing

The Future of Data Systems

Graph View

Table of Contents

Backlinks