Avro

Summary: A binary encoding format that uses a writer’s and a reader’s schema for flexible and efficient data representation.

Sources: chapter4

Last updated: 2026-04-15


Apache Avro was started in 2009 as a subproject of Hadoop. It differs significantly from thrift and protocol-buffers by NOT using field tags.

Schema-Based Encoding

Avro’s binary format contains no field names or tags. It simply concatenates the values in the order they appear in the schema. This makes Avro the most compact binary format of all.

Writer’s and Reader’s Schema

The key to Avro is the distinction between two schemas:

  1. Writer’s Schema: The schema used to encode the data.
  2. Reader’s Schema: The schema the application expects when reading the data.

Avro resolves differences between the two schemas during decoding by looking at them side-by-side.

Schema Evolution Rules

  • Forward Compatibility: You can have a newer version of the schema as writer and an older version as reader.
  • Backward Compatibility: You can have a newer version as reader and an older version as writer.

Key Rules

  • To maintain compatibility, only add or remove fields with a default value.
  • Changing a field’s name is backward-compatible but NOT forward-compatible.

Advantages

  • No Tag Numbers: Avoids the need for manual tag management.
  • Dynamic Generation: Friendly for dynamically generated schemas (e.g., from database tables).
  • Self-Describing: Binary files (Object Container Files) embed the writer’s schema, making them easy to share.