Encoding
Summary: The process of translating in-memory data structures into a byte sequence for storage or network transmission.
Sources: chapter4
Last updated: 2026-04-15
Programs work with data in two representations:
- In-Memory: Objects, structs, hash tables, and trees optimized for the CPU (often using pointers).
- On-Disk/On-Wire: Self-contained byte sequences (e.g., JSON documents) for transmission or storage.
The process of translating from in-memory to byte sequences is called encoding (also known as serialization or marshalling). The reverse is decoding (parsing, deserialization, or unmarshalling).
Types of Encoding
Language-Specific Formats
Many languages (e.g., Java’s java.io.Serializable, Python’s pickle) have built-in encoding. These are often tied to the language, have poor performance, and lack compatibility guarantees.
Textual Formats (JSON, XML, CSV)
Widespread but ambiguous with data types (e.g., numbers vs. strings) and verbose. They are human-readable but less efficient for large datasets.
Binary Formats (Thrift, Protocol Buffers, Avro)
More compact and efficient than textual formats. They use schemas to define data and support schema-evolution.