Binary Document Format

Introduction

Automerge is a library that allows people to collaboratively work together without a central co-ordination or a reliable connection. It is a specific implementation of of a conflict-free replicated data type (or CRDT).

This document describes the storage format used when serializing Automerge documents and changes for storage or transfer.

We strongly encourage people to use a library based on the reference implementation automerge-rs (which is available as a C shared library or a WebAssembly module for ease of integration). That said, this document should let you get started building your own, or at least understanding how Automerge works.

The storage format is designed for compactness and speed of parsing. Automerge stores the full history of changes to the document: this is a large amount of data but in practice it is very repetitive and amenable to compression.

In addition to parsing the storage format, an Automerge library must resolve conflicts between concurrent operations in a consistent way. This document does not yet discuss how to do that (pull requests welcome :D), but the reference implementation should serve as a guide.

Terminology and Conventions

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in [RFC2119] and [RFC8174] when, and only when, they appear in all capitals, as shown here.

Concepts

Document

An Automerge document is a collaboratively editable JSON-like structure. The serialized form of a document contains complete history of changes and operations that collaborators have applied.

Automerge documents have a root value that is a map from string keys to arbitrary values.

Change

A change is a group of operations that modify a document, analagous to a "commit" in a version control system like git.

Each change is made by an actor, and has a (possibly empty) set of predecessor changes. Changes have an optional wall-clock timestamp, to keep track of when a change was committed, and an optional message to describe meaningful changes.

A change is identified by its change hash which is the [SHA256] hash of the binary representation of the change.

Actor

An actor applies a linear sequence of operations to a document and commits them in a sequence of changes. Each actor has an actor ID that uniquely identifies it. An actor ID is an arbitrary sequence of bytes, which should be generated in a way that will not collide with other actors (the reference library generates 128-bit random identifiers).

There is a small amount of per-actor overhead, so if you have one process that edits a document several times sequentially, it is preferrable to re-use the same actor ID for each change.

Operation

An operation is an individual mutation made by an actor to a document. For example, setting a key to a value or inserting a character into some text.

An operation has an action which identifies what it does, and various fields depending on which action. For example a "set" operation has to identify the object being modified, the key in that object, and the new value.

Each operation is identified by an operation ID. An operation ID is a pair of (actor ID, counter), where the counter is a unique always-incrementing value per actor.

Object

An object represents a collaboratively editable value in an Automerge document. There are three kinds of object:

map which maps string keys to values,
list which is an ordered list of values
text which is a collaboratively editable utf-8 string.

Each object is created by an operation with an action of "makeMap", "makeList" or "makeText", and is identified by its object ID. The object ID is the operation ID of the operation that created the object.

Each document has a root map which is identified by the object ID with a null actor id and null counter.

Value

Automerge objects are dynamically typed, and can contain any of the following kinds of value:

map, list, text – the collaboratively editable objects
null - an typed null
bool - either true or false
float - a 64-bit IEEE754 float
int - a 64-bit signed int
uint – a 64-bit unsigned int
string - a utf-8 encoding string (possibly containing U+0000)
bytes - an arbitrary sequence of bytes
timestamp - a 64-bit signed integer representing milliseconds since the unix epoch
counter - a 64-bit signed intenger that collaborators increment or decrement (instead of overwriting)

File structure

An Automerge file consists of one or more length delimited chunks. Implementations must attempt to read chunks until the end of the file.

Chunks

Field Byte Length Description

Field	Byte Length	Description
Magic bytes	4	The sequence `[0x85, 0x6f, 0x4a, 0x83]`
Checksum	4	Validates the integrity of the chunk
Chunk type	1	The type of this chunk
Chunk length	Variable (64-bit uLEB)	The length of the following chunk bytes
Chunk contents	Variable	The actual bytes for the chunk

Magic bytes

The sequence [0x85, 0x6f, 0x4a, 0x83]

Checksum

Validates the integrity of the chunk

Chunk type

The type of this chunk

Chunk length

Variable (64-bit uLEB)

The length of the following chunk bytes

Chunk contents

Variable

The actual bytes for the chunk

If the first four bytes are not exactly the magic bytes implementations MUST abort.

The checksum is the first four bytes of the [SHA256] hash of the concatenation of the chunk type, chunk length and chunk contents fields. Implementations MUST abort reading if the checksum does not match.

Chunk Type

The chunk type is either:

Value

Type

Description

0x00

Document chunk

Contains a graph of related changes

0x01

Change chunk

Contains a single change and its operations

0x02

Compressed change chunk

Contains a single change deflate compressed

Document Chunk

The fields in a document chunk, in order, are:

Field	Type	Description
Actors	Array of Actor IDs	The actor IDs in sorted order
Heads	Array of Change Hashes	The hashes of the change graph in sorted order
Change columns metadata	Column Metadata	Description of the change columns
Operation columns metadata	Column Metadata	Description of the operation columns
Change columns	Column Data	The actual bytes for the change columns
Operation columns	Column Data	The actual bytes for the operation columns
Heads index	Heads Index	A lookup from change hash to change

Field

Type

Description

Actors

Array of Actor IDs

The actor IDs in sorted order

Heads

Array of Change Hashes

The hashes of the change graph in sorted order

Change columns metadata

Column Metadata

Description of the change columns

Operation columns metadata

Column Metadata

Description of the operation columns

Change columns

Column Data

The actual bytes for the change columns

Operation columns

Column Data

The actual bytes for the operation columns

Heads index

Heads Index

A lookup from change hash to change

A document contains a set of changes that represent the history of a collaboratively edited document. A document always contains a complete history of changes: for each change in the document, all the changes that were made to the document before that change was made are also included.

Document chunks use a columnar storage format for both changes and operations that assumes that the values of various fields are similar across adjacent changes and operations to optimize for high compression ratios and fast decoding.

Most fields are of arbitrary length, so parsing the document must proceed in order; for example it is not possible to know the length of the column fields until the column metadata has been parsed.

By implication, a document with no changes consists of 0x00 0x00 0x00 0x00, as the counts of actors, heads, change columns and operations columns are all zero. With the chunk header, this gives a file consisting of the following bytes: 0x85 0x6f 0x4a 0x83 0xb8 0x1a 0x95 0x44 0x00 0x04 0x00 0x00 0x00 0x00.

Change Chunk

The fields in a change chunk, in order, are:

Field	Type	Description
Dependencies	Array of Change Hashes	The set of changes that this change depends on
Actor length	64-bit uLEB	The length of the actor ID
Actor	bytes	The actor ID
Sequence number	64-bit uLEB	The sequence number
Start op	64-bit uLEB	The counter of the first op in this change
Time	64-bit LEB	The time this change was created in milliseconds since the unix epoch
Message length	64-bit uLEB	The length of the message in bytes
Message	UTF-8 encoded string	The message associated with this change
Other actors	Array of Actor IDs	Other actor IDs in this change
Operation columns metadata	Column Metadata	Description of the operation columns
Operation columns	Column Data	The actual bytes for the operation columns
Extra bytes	bytes	All data remaining in the chunk

Field

Type

Description

Dependencies

Array of Change Hashes

The set of changes that this change depends on

Actor length

64-bit uLEB

The length of the actor ID

Actor

bytes

The actor ID

Sequence number

64-bit uLEB

The sequence number

Start op

64-bit uLEB

The counter of the first op in this change

Time

64-bit LEB

The time this change was created in milliseconds since the unix epoch

Message length

64-bit uLEB

The length of the message in bytes

Message

UTF-8 encoded string

The message associated with this change

Other actors

Array of Actor IDs

Other actor IDs in this change

Operation columns metadata

Column Metadata

Description of the operation columns

Operation columns

Column Data

The actual bytes for the operation columns

Extra bytes

bytes

All data remaining in the chunk

A change chunk just contains a single change, its metadata and operations. It does not include any dependent changes, so you can only apply the change to a document that already contains those dependent changes.

Change chunks use a columnar storage format that assumes that the values of various fields are similar across adjacent operations to optimize for high compression ratios and fast decoding.

The extra bytes must be retained when processing changes. If future versions of automerge add new metadata to changes, this will allow old clients to collaborate with new clients without limiting which features the new clients can use.

Compressed Change Chunk

A compressed change chunk is a change chunk where the contents have been compressed using [DEFLATE]. The checksum of a compressed chunk is calculated as through the chunk was not compressed (with type = 1, the length of the uncompressed data, and the original uncompressed data).

For example if you have a change which has 372 bytes of data (excluding the header) which compresses to 323 bytes. If the original change chunk header consisted of the following bytes: 0x85 0x6f 0x4a 0x83 0x80 0xb7 0x5d 0x54 0x01 0xf4 0x02, the compressed change chunk header would consist of: 0x85 0x6f 0x4a 0x83 0x80 0xb7 0x5d 0x54 0x02 0xc3 0x02

To decode a compressed change chunk first decompress the chunk contents, and then construct a new chunk with the same magic bytes and checksum, but with type = 1 and the length of the decompressed data. Then parse that as a change chunk.

Simple types

uLEB

uLEB is an unsigned little endian base 128 value. This is a variable length encoding used throughout.

To encode a uLEB, represent the number in binary and pad it with leading zeros so that it has a length which is a multiple of 7. Take each group of 7 bytes from least-significant to most-significant and output them in bytes - the first bit of every byte is 1 except for the last byte which is 0.

Unsigned ints 0 - 127 are stored as one byte: 0b00000000 - 0b01111111
Unsigned ints 128 - 16383 are stored as two bytes: 0b10000000 0b00000001 - 0b11111111 0b01111111 etc.

To decode a uLEB, read bytes up to and including the first byte with a 0 as the first bit. Take the latter 7 bits from each byte (the last byte contains the most significant bits, so you need to concatenate them in the opposite order to which the bytes are represented on disk).

Although uLEB encoding can support numbers of arbitrary bitsize, unsigned integers in Automerge must not exceed 64 bits. Implementations should fail to parse documents with uLEBs that decode to a value too large to be represented in a 64-bit unsigned integer.

Implementations must generate the shortest possible uLEB encodings, and should reject documents with overly long encodings. For example using the decoding rules above the bytes 0b10000000 0b00000000 would be decoded as 0; but this is overly long: 0 can be represented in just one byte as 0b00000000, so should be rejected.

LEB

LEB is a signed variant little endian base 128 value

To encode a uLEB, represent the number in twos complement, and sign-extend it so that it has a length which is a multiple of seven. If the number is negative the padding will be of 1-bits and if the number is positive the padding will be 0-bits.

0 is represented as one byte: 0b0000000
Ints from 1 to 63 are represented as one byte: 0b00000001 - 0b00111111
Ints from -1 to -64 are represented as one byte: 0b01111111 - 0b010000000
Ints from 64 to 8191 are represented as two bytes: 0b11000000 0b00000000 - 0b11111111 0b00111111
Ints from -65 to -8192 are represented as two bytes: 0b10111111 0b01111111 - 0b10000000 0b01000000 etc.

To decode an LEB, read bytes up to and including the first byte with a 0 as the first bit. Take the latter 7 bits from each byte (the last byte contains the most signfiicant bits, so you need to concatenate them in the opposite order to which the bytes are represented on disk). If the first bit of your number is 1 (from the second bit of the last byte in encoded form) then you have a negative number and you can take twos complement to get to its absolute value; otherwise you have a positive number (or 0).

Although LEB encoding can support numbers of arbitrary bitsize, signed integers in Automerge must not exceed 64 bits. Implementations should fail to parse documents with uLEBs that decode to a value too large to be represented in a 64-bit signed integer.

Implementations must generate the shortest possible LEB for a given integer, and should reject documents with overly long encodings. For example the decoding rules above the bytes 0b11111111 0b01111111 would be decoded as -1; but this is overly long: -1 can be represented as just one byte 0b01000000, so should be rejected.

Change Hash

A change hash is the 32-byte [SHA256] hash of the concatenation of the chunk type (0x01) chunk length and chunk contents fields of a change represented as a change chunk.

The first four bytes of the change hash are used as a checksum when a change chunk is serialized.

Action

The actions of the reference data model are encoded in the storage format as a byte as follows:

Byte Action Description

Byte	Action	Description
0x00	`makeMap`	Creates a new map object
0x01	`set`	Sets a key of a map, overwrites an item in a list, inserts an item in a list, or edits text
0x02	`makeList`	Creates a new list object
0x03	`del`	Unsets a key of a map, or removes an item from a list (reducing its length)
0x04	`makeText`	Creates a new text object
0x05	`inc`	Increments a counter stored in a map or a list

0x00

makeMap

Creates a new map object

0x01

set

Sets a key of a map, overwrites an item in a list, inserts an item in a list, or edits text

0x02

makeList

Creates a new list object

0x03

del

Unsets a key of a map, or removes an item from a list (reducing its length)

0x04

makeText

Creates a new text object

0x05

inc

Increments a counter stored in a map or a list

Future versions of automerge may add new actions, and implementations must preserve operations containing actions they don’t support when processing changes for forward compatibility.

Column Specification

Column specifications are a 32-bit uLEB interpreted as a bitfield:

The least significant three bits encode the column type
The 4th least significant bit is 1 if the column is [DEFLATE] compressed and 0 otherwise
The remaining bits are the column ID

If the deflate bit is set then the column data must first be decompressed using DEFLATE before proceeding with decoding the values.

The DEFLATE bit is only permitted in document chunks, implementations must abort if they find compressed columns in change chunks.

The ID defines the purpose of the column for either Change Columns or Operation Columns, and implementations must preserve columns that they do not understand.

The column type specifies how the data in the column is encoded. The possible types are:

Value	Description
0	Group Column
1	Actor Column
2	uLEB Column
3	Delta Column
4	Boolean Column
5	String Column
6	Value Metadata Column
7	Value Column

Compound types

Array of Actor IDs

The actor ID array consists of a 64-bit uLEB giving the count of actor ids, followed by each actor ID as a length-prefixed byte array.

Each item in the array consists of a 64-bit uLEB giving the length in bytes, and then that number of bytes.

For example an array consisting of the single actor ID [0xab, 0xcd, 0xef] would be encoded as: 0x01 0x03 0xab 0xcd 0xef.

Implementations must store actor ids lexicographically, and should error when reading a document with actor ids in the wrong order.

Array of Change Hashes

The heads array consists of a 64-bit uLEB N giving the count of heads, followed by N change hashes each exactly 32-bytes long.

For example an array consisting of the heads f986a4318d1f1cc0e2e10e421e7a9a4cd0b70a89dae98bc1d76d789c2bf7904c and 4355a46b19d348dc2f57c046f8ef63d4538ebb936000f3c9ee954a27460dd865 would be represented as 0x02 0xf9 0x86 ..{28 bytes elided).. 0x90 0x4c 0x43 0x55 ..{28 bytes elided}.. 0xd8 0x65

Heads Index

The heads index provides a lookup table from the change hash to the change in a document. Very old automerge documents may be missing this field.

The index consists of N 64-bit uLEB's (one per head in the Heads array of the document chunk), and each uLEB gives the index of that head’s change in the columnar change storage.

In a well-formed document, the change hash of the change indicated will match the change hash in the heads array, but implementations may chose to not validate this when parsing documents to avoid having to recompute every change hash.

Column Metadata

The column metadata consists of a 64-bit uLEB N giving the number of columns, followed by N pairs describing each columns in the chunks column data.

Field	Description
Column Specification	a 32-bit uLEB encoded Column Specification
Column Length	64-bit uLEB of the length (in bytes) of the column data block

Field

Description

Column Specification

a 32-bit uLEB encoded Column Specification

Column Length

64-bit uLEB of the length (in bytes) of the column data block

The column specifications must be unique and sorted. Implementations must not include both an uncompressed and a compressed column with the same ID and type, and the column order should be sorted with the deflate bit set to 0.

A column that contains only null values, or is otherwise empty, should be omitted from the chunk. In this case there will be no column specification in the column metadata and no data in the column data.

In the case that there are no changes or operations at all, then the column metadata will be encoded as 0x00 to indicate that there are no columns at all, and there will be no column data in the chunk.

Column Data

Columns are stored one after the other with no separators or length indicators. The columns are stored in order they appear in the column metadata and each can be decoded according to its column specification.

All columns must have the same number of items (or the same number of arrays of items for grouped columns), though as they are compressed differently they may have vastly different byte counts.

For future compatibility it is important that programs which edit Automerge documents maintain all columns, even those that they don’t understand the meaning of. When new changes or operations are added to a document with an unknown column a null should be added following the encoding rules of its specification.

Change Columns

The currently defined columns for changes in a document chunk are:

Name	Specification	ID	Type	Description
actor	1	0	Actor Column	The actor that made the change
sequence number	3	0	Delta Column	The sequence number for each change
maxOp	19	1	Delta Column	The largest counter that occurs in each change
time	35	2	Delta Column	The (optional) wallclock time at which each change was made
message	53	3	String Column	The (optional) commit message for each change
dependencies group	64	4	Group Column	The number of dependencies for each change
dependencies index	67	4	Grouped Delta Column	The indices of the changes this change depends on
extra metadata	86	5	Value Metadata Column	The metadata for any extra data for this change
extra data	87	5	Value Column	Any extra data for this change

Each value in the dependencies index column is an index into the changes that are stored in the document’s columns. Implementations MUST abort if an index is out of bounds.

The sequence number of a change should be 1 if it is the first change by a given actor. Each subsequent change must have a sequence number exactly 1 higher than the previous change by the same actor. Implementations MUST abort if there are missing changes for a given actor ID.

The maxOp field of the change refers to the largest counter component of an operation ID in the set of operations in this change. For a given actor ID this must always increase. Implementations MUST abort if the maxOp of a change is not larger than all the maxOp of changes from that actor with smaller seq.

After decoding all the columns, and de-referencing indices into other columns, you will have an array of changes, where each change conceptually has the following fields:

Field	Type	Mapping
actor ID	array of bytes	The id of the actor that made the change
seq	64-bit uint	The sequence number of the change
ops	array of operations	The operations for this change (take all operations with counter greater the previous change’s maxOp and less than or equal to this change’s maxOp)
deps	array of changes	The changes this change depends on (look up each index in the dependencies index in this documents changes columns)
time	64-bit int	The (optional) wallclock time of the change
message	utf-8 string	The (optional) message of the change
extra data	any	The (optional) extra data (parse the extra data column according to the extra metadata column)

Field

Type

Mapping

actor ID

array of bytes

The id of the actor that made the change

seq

64-bit uint

The sequence number of the change

ops

array of operations

The operations for this change (take all operations with counter greater the previous change’s maxOp and less than or equal to this change’s maxOp)

deps

array of changes

The changes this change depends on (look up each index in the dependencies index in this documents changes columns)

time

64-bit int

The (optional) wallclock time of the change

message

utf-8 string

The (optional) message of the change

extra data

any

The (optional) extra data (parse the extra data column according to the extra metadata column)

Operation Columns

The currently defined columns for operations are:

Field	Specification	ID	Type	Description
object actor ID	1	0	Actor Column	actor index of object ID each operation targets
object counter	2	0	uLEB Column	counter of the object ID each operation targets
key actor ID	17	1	Actor Column	actor of the operation ID of the key of each operation
key counter	19	1	Delta Column	counter of the operation ID of the key of each operation
key string	21	1	String Column	The string key each operation targets
actor ID	33	2	Actor Column	The actor of each operations ID
counter	35	2	Delta Column	The counter of each operations ID
insert	52	3	Boolean Column	Whether or not this is an insert operation
action	66	4	uLEB Column	The Action of each operation
value metadata	86	5	Value Metadata Column	The metadata for the value of this operation
value	87	5	Value Column	The value of this operation
predecessor group	112	7	Group Column	The group for the predecessors of this operation (only in change chunks)
predecessor actor IDs	113	7	Grouped Actor Column	The actor ID of each predecessor’s operation ID
predecessor counters	115	7	Grouped Delta Column	The counter of each predecessor’s operation ID
successor group	128	8	Group Column	The group for the successors of this operation (only in document chunks)
successor actor IDs	129	8	Grouped Actor Column	The actor ID of each successor’s operation ID
successor counters	131	8	Grouped Delta Column	The counter of each successor’s operation ID
mark expand	148	9	Boolean Column	How the mark expands on the text
mark name	165	10	String Column	The name of the formatting mark

Warning

The javascript implementation includes a child column, is this required?

We determine the key that the operation refers to thusly:

If the key string is not null then this is the key of the operation (when modifying a map).
Otherwise we use the pair (lookup_actor(key actor ID), key counter) as the key of the operation (when modifying a list).
If key string is null and any of key actor or key counter are null implementations MUST abort

Operations are stored with their predecessors in change chunks and with successors in document chunks. For more information see the section on implementation concerns.

After decoding all the columns, and de-referencing indices into other columns, you will have an array of operations, where each operation conceptually has the following fields:

Field

Type

Mapping to columns

Object

Object ID

The object modified by this operation in (column 0)

Key

String or Operation ID

The position in that object to modify (column 1)

Operation ID

The ID of this operation, and thus the object ID of any object it creates (column 2)

Insert

boolean

For operations on list or text objects, whether to overwrite the position (when false) or insert before the position (when true)

Action

action

The action this operation takes

Value

primitive value

The value inserted by this operation (if needed)

Successors

Operations

Future operations that affect the object created by this operation (if any)

Column Types

Run Length Encoding

Many columns use run length encoding to compress repeated values. Such columns are encoded as repeated pairs of the form (length, value).

A "run" in an RLE columns is encoded as pairs of the form (length,value). length is a signed LEB:

If length is positive, then value is a single instance of the value which occurs length times.
If length is 0 then this pair represents a null value and value is the uLEB encoding of the number of times null occurs
If length is negative then value is a literal run and the absolute value of length is the number of items in the literal run. That is to say, there is no compression.

For example if you were trying to compress the array of uLEBs [0,0,0,null,null,1,2,3] you would encode it as 0x03 0x00 0x00 0x02 0x7d 0x01 0x02 0x03

Group Column

Some fields in automerge have multiple values per change or operation. An example of this is the dependencies index of a change. The group column (denoted by column type 0) defines how many values should be read from each grouped column when parsing each change or operation.

Grouping affects all columns with the same ID as its column specification, so a group column will usually be followed by one or more columns with the same id but different types. It is possible to have a group column with no matching grouped columns if the grouped column is completely empty, as it will be omitted.

The group column is a run length encoded list of 64-bit uLEBs that specifies how many items should be read from the subsequent grouped columns per change or operation. Implementations MUST abort if they cannot read the correct number of values from each of the grouped columns.

For example if you had five changes in a document with [0,1,2,2,2] dependencies each, the group column would be encoded as 0x7e 0x00 0x01 0x03 0x02, and the dependencies index column would contain seven values.

Note that it is not possible for two columns in a group to have the same type as it would not be possible to have a deterministic ordering for the column specifications. Implementations MUST abort if they encounter two column specifications with the same type and column ID.

Implementations MUST abort if they encounter multiple group column specifications with the same ID.

Actor Column

An actor column (denoted by column type 1) uses run length encoding to compress a list of uLEBs that represent an index into an array of actor ids.

In a document chunk the index is the position of the actor id in the array of actor IDs.

In a change chunk index 0 represents the actor id of the change, and index 1+ are given to the other actor ids in the order they appear.

uLEB Column

A uLEB column (denoted by column type 2) uses run length encoding to compress a list of 64-bit uLEBs.

It is used (instead of a delta column) when there is no expectation that delta compression would help reduce the storage requirement, or if the column may contain null values.

Delta Column

A delta column (denoted by column type 3) uses run length encoding to compress a list of 64-bit LEBs.

The sequence is assumed to start from zero, so if you wanted to encode the list [3,4,5,6,9,7,8] you would first calculate the list of deltas [+3,+1,+1,+1,+3,-2,+1], and then run length encode the resulting signed LEBs to get the bytes 0x7f 0x03 0x03 0x01 0x7d 0x03 0x7e 0x01.

Boolean Column

A boolean column (denoted by column type 4) encodes a list of booleans. The column contains sequences of 64-bit uLEB integers which represent the lengths of alternating sequences of false/true. The initial value of the column is always false

For example if you wanted to encode the list [true, true, false, false, false], you would end up with a list of lengths of [0,2,3], which would be encoded as 0x00 0x02 0x03.

String Column

A string column (denoted by column type 5) uses run length encoding to compress a list of length-prefixed UTF-8 strings. Each string is encoded as a 64-bit uLEB followed by that many literal bytes.

For example, if you wanted to encode the list ["a", "", null, "boo", "boo"] you would end up with 0x7e 0x01 0x61 0x00 0x00 0x01 0x02 0x03 0x62 0x6f 0x6f.

Value Metadata Column

The value metadata column (denoted by column type 6) is always paired with a value column with the same ID. The metadata column is a run length encoded list of 64-bit uLEBs that defines the type and length of each value in the value column.

These integers are laid out like so:

The lower four bits encode the type of the value
The higher bits encode the length of the value

The type code may be

Value	Type	Representation of value
0	Null	Not present (length = 0)
1	False	Not present (length = 0)
2	True	Not present (length = 0)
3	Unsigned integer	64-bit uLEB in value column (length = 1..10)
4	Signed integer	64-bit LEB in value column (length = 1..10)
5	IEEE754 float	64-bit little-endian IEEE754 float in value column (length = 8)
6	UTF8 string	Utf-8 string in value column (length = 0..2^60)
7	Bytes	Arbitrary bytes in value column (length = 0..2^60)
8	Counter	64-bit LEB in value column (length = 1..10)
9	Timestamp	64-bit LEB in value column (length = 1..10)

Value

Type

Representation of value

Null

Not present (length = 0)

False

Not present (length = 0)

True

Not present (length = 0)

Unsigned integer

64-bit uLEB in value column (length = 1..10)

Signed integer

64-bit LEB in value column (length = 1..10)

IEEE754 float

64-bit little-endian IEEE754 float in value column (length = 8)

UTF8 string

Utf-8 string in value column (length = 0..2^60)

Bytes

Arbitrary bytes in value column (length = 0..2^60)

Counter

64-bit LEB in value column (length = 1..10)

Timestamp

64-bit LEB in value column (length = 1..10)

If the type tag is none of these values it may be a value produced by a future version of Automerge. In this case implementations MUST read and store the type code and length bytes when reading and write them back in same position when writing.

If the bytes in a UTF8 string value (type 6) are not valid utf-8, then implementations should replace them by the unicode replacement character (U+FFFD).

Warning

Replacing invalid utf-8 seems like it might be a bad idea? Should check this. I think it’s what the javascript implementation does though.

Value Column

The value column (denoted by column type 7) contains raw values. The type and length of each value in the column is determined by the value metadata column with the same column ID.

Note that raw value columns which do not contain values may be omitted. If implementations encounter a lone value metadata column they must assume that it is accompanied by an empty raw value column.

Implementations must abort if they encounter a raw value column not preceeded by a metadata column with the same id. Implementations must also abort if they encounter more than one metadata column with the same column id, or more than one raw value column with the same id.

Unknown columns

When reading the column metadata applications may encounter column specifications which they are not expecting. These column specifications may be produced by future versions of the application. If an implementation encounters an unknown column whilst reading data it MUST retain this data when writing that data back to storage.

This is possible because every column type has some concept of a null value. When inserting new rows into a collection of rows stored in the columnar storage format application MUST write a null value into columns which they do not recognise for the new rows they are inserting.

Warning

What should the null value be for boolean or delta columns?

Implementation concerns

Below are some notes that may help implementors build compatible automerge implementations. They are likely not yet complete, and any differences between what is written here and the reference implementation should be resolved in favor of that.

Operations in document chunks

Ordering of operations

Operations are grouped by the object that they manipulate. Objects are then sorted by their IDs. Thus operations are ordered using the following procedure:

Warning

Is this required? If so should implementations abort if the operations are not inthis order?

First sort by object ID, such that any operations for the same object are consecutive. The null objectId (i.e. the root object) is sorted before all non-null objectIds. Non-null objectIds are sorted by Lamport timestamp.
For each object:
- if the object is a map, sort the operations within that object lexicographically by key, so that all operations for the same key are consecutive. This sort order MUST be based on the UTF-8 byte sequence of the key.
- If the object is a list or text, sort the operations within that object by the operation ID of the element they target. This is determined as follows:
  - For insert operations the target element is the operation ID of the inserting operation
  - For set or delete operations the target is the operation ID in the key field
Among the operations for the same key (for maps) or the same list element (for lists/text), sort the operations by their opId, using lamport timestamp ordering. For list elements, note that the operation that inserted the operation will always have an opId that is lower than the opId of any operations that updates or deletes that list element, and therefore the insertion operation will always be the first operation for a given list element.

Warning

the JavaScript implementation currently does not do this sorting correctly, since it sorts keys by JavaScript string comparison, which differs from UTF-8 lexicographic ordering for characters beyond the basic multilingual plane.

Successors and omitting deletes

The document storage format does not encode a predecessors field. Instead this information is encoded in the successors field. This can be used to reconstruct the predecessors field from the reference data model.

Delete operations do not carry any information other than the object ID and key they are deleting. As such they are encoded in the document by appending the operation ID of the delete operation to the successors of the operation creating the data to be deleted.

Implementations MUST abort if they encounter explicitly encoded delete operations in a document chunk.

Calculating predecessors

Operations in the document format are not stored in the order they were generated, as they are in the change data model. Furthermore, operations in the document format have a successor rather than predecessor field. The following procedure specifies how to map from document operations to the change operations.

First expand operations:

Add an empty predecessor list to every document operation
For each operation in the document operation rows
- For each operation ID in the successors list of the document operation lookup the target operation in the document operations:
  - If an operation is found add the current operation ID to the target operations predecessor list
  - If no operation is found then insert a new delete operation into the document with its ID set to the target operation ID, the object and key set to the same value as the current operation, and the predecessor set to the current operation.

Second, match up changes:

For each document operation

Sort all the changes for the same actor as the operation ID by ascending maxOp
Add the document operation to the first change which has maxOp >= counter where counter is the counter component of the operation ID.

Implementations MUST abort if no matching change is found

For each change sort the operations within the change by lamport timestamp of the operation ID.

Lamport Timestamps

Operation IDs are lamport timestamps. This imposes a total ordering. To compare two lamport timestamps:

If the counter components are different then whichever timestamp has the larger counter is the larger
If the counter components are the same but the actor IDs are different then the actor ID which is lexicographically larger is considered the larger timestamp
Otherwise the two timestamps are equal

Hash verification

The dependencies in the document model are expressed as integer offsets. But in the reference data model dependencies are expressed as a hash of the ancestor changes. To map to the hash based representation perform a topological traversal of the dependency graph and for each change serialize the change as a Change Chunk then calculate the hash of the change as in the Change Hash, then for every change replace the index of the current change with the calculated hash.

Once this procedure is complete take the heads of the depedency graph and compare their hashes with the head hashes field in the document chunk. If the hashes don’t match implementations MUST abort.

References

[RFC2119]: https://datatracker.ietf.org/doc/html/rfc2119
[RFC8174]: https://datatracker.ietf.org/doc/html/rfc8174
[DEFLATE]: https://datatracker.ietf.org/doc/html/rfc1951
[SHA256]: https://datatracker.ietf.org/doc/html/rfc4634