Introduction
Automerge is a library that allows people to collaboratively work together without a central co-ordination or a reliable connection. It is a specific implementation of of a conflict-free replicated data type (or CRDT).
This document describes the storage format used when serializing Automerge documents and changes for storage or transfer.
We strongly encourage people to use a library based on the reference implementation automerge-rs (which is available as a C shared library or a WebAssembly module for ease of integration). That said, this document should let you get started building your own, or at least understanding how Automerge works.
The storage format is designed for compactness and speed of parsing. Automerge stores the full history of changes to the document: this is a large amount of data but in practice it is very repetitive and amenable to compression.
In addition to parsing the storage format, an Automerge library must resolve conflicts between concurrent operations in a consistent way. This document does not yet discuss how to do that (pull requests welcome :D), but the reference implementation should serve as a guide.
Terminology and Conventions
Concepts
Document
An Automerge document is a collaboratively editable JSON-like structure. The serialized form of a document contains complete history of changes and operations that collaborators have applied.
Automerge documents have a root value that is a map from string keys to arbitrary values.
Change
A change is a group of operations that modify a document, analagous to a "commit" in a version control system like git.
Each change is made by an actor, and has a (possibly empty) set of predecessor changes. Changes have an optional wall-clock timestamp, to keep track of when a change was committed, and an optional message to describe meaningful changes.
A change is identified by its change hash which is the [SHA256] hash of the binary representation of the change.
Actor
An actor makes applies a linear sequence of operations to a document and commits them in a sequence of changes. Each actor has an actor ID that uniquely identifies it. An actor ID is an arbitrary sequence of bytes, which should be generated in a way that will not collide with other actors (the reference library generates 128-bit random identifiers).
There is a small amount of per-actor overhead, so if you have one process that edits a document several times sequentially, it is preferrable to re-use the same actor ID for each change.
Operation
An operation is an individual mutation made by an actor to a document. For example, setting a key to a value or inserting a character into some text.
An operation has an action which identifies what it does, and various
fields depending on which action. For example a "set"
operation has to
identify the object being modified, the key in that object, and the new value.
Each operation is identified by an operation ID. An operation ID is a pair of (actor ID, counter), where the counter is a unique always-incrementing value per actor.
Object
An object represents a collaboratively editable value in an Automerge document. There are three kinds of object:
-
map
which maps string keys to values, -
list
which is an ordered list of values -
text
which is a collaboratively editable utf-8 string.
Each object is created by an operation with an action
of "makeMap"
,
"makeList"
or "makeText"
, and is identified by its object ID. The object ID
is the operation ID of the operation that created the object.
Each document has a root map
which is identified by the object ID with a
null
actor id and null
counter.
Value
Automerge objects are dynamically typed, and can contain any of the following kinds of value:
-
map
,list
,text
– the collaboratively editable objects -
null
- an typed null -
bool
- eithertrue
orfalse
-
float
- a 64-bit IEEE754 float -
int
- a 64-bit signed int -
uint
– a 64-bit unsigned int -
string
- a utf-8 encoding string (possibly containing U+0000) -
bytes
- an arbitrary sequence of bytes -
timestamp
- a 64-bit signed integer representing milliseconds since the unix epoch -
counter
- a 64-bit signed intenger that collaborators increment or decrement (instead of overwriting)
File structure
An Automerge file consists of one or more length delimited chunks. Implementations must attempt to read chunks until the end of the file.
Chunks
Field | Byte Length | Description |
---|---|---|
Magic bytes |
4 |
The sequence |
Checksum |
4 |
Validates the integrity of the chunk |
1 |
The type of this chunk |
|
Chunk length |
Variable (64-bit uLEB) |
The length of the following chunk bytes |
Chunk contents |
Variable |
The actual bytes for the chunk |
If the first four bytes are not exactly the magic bytes implementations MUST abort.
The checksum is the first four bytes of the [SHA256] hash of the concatenation of the chunk type, chunk length and chunk contents fields. Implementations MUST abort reading if the checksum does not match.
Chunk Type
The chunk type is either:
Value |
Type |
Description |
|
Contains a graph of related changes |
|
|
Contains a single change and its operations |
|
|
Contains a single change deflate compressed |
Document Chunk
The fields in a document chunk, in order, are:
Field | Type | Description |
---|---|---|
Actors |
The actor IDs in sorted order |
|
Heads |
The hashes of the change graph in sorted order |
|
Change columns metadata |
Description of the change columns |
|
Operation columns metadata |
Description of the operation columns |
|
Change columns |
The actual bytes for the change columns |
|
Operation columns |
The actual bytes for the operation columns |
|
Heads index |
A lookup from change hash to change |
A document contains a set of changes that represent the history of a collaboratively edited document. A document always contains a complete history of changes: for each change in the document, all the changes that were made to the document before that change was made are also included.
Document chunks use a columnar storage format for both changes and operations that assumes that the values of various fields are similar across adjacent changes and operations to optimize for high compression ratios and fast decoding.
Most fields are of arbitrary length, so parsing the document must proceed in order; for example it is not possible to know the length of the column fields until the column metadata has been parsed.
By implication, a document with no changes consists of 0x00 0x00 0x00 0x00
, as
the counts of actors, heads, change columns and operations columns are all zero.
With the chunk header, this gives a file consisting of the following bytes:
0x85 0x6f 0x4a 0x83 0xb8 0x1a 0x95 0x44 0x00 0x04 0x00 0x00 0x00 0x00
.
Change Chunk
The fields in a change chunk, in order, are:
Field | Type | Description |
---|---|---|
Dependencies |
The set of changes that this change depends on |
|
Actor length |
64-bit uLEB |
The length of the actor ID |
Actor |
bytes |
The actor ID |
Sequence number |
64-bit uLEB |
The sequence number |
Start op |
64-bit uLEB |
The counter of the first op in this change |
Time |
64-bit LEB |
The time this change was created in milliseconds since the unix epoch |
Message length |
64-bit uLEB |
The length of the message in bytes |
Message |
UTF-8 encoded string |
The message associated with this change |
Other actors |
Other actor IDs in this change |
|
Operation columns metadata |
Description of the operation columns |
|
Operation columns |
The actual bytes for the operation columns |
|
Extra bytes |
bytes |
All data remaining in the chunk |
A change chunk just contains a single change, its metadata and operations. It does not include any dependent changes, so you can only apply the change to a document that already contains those dependent changes.
Change chunks use a columnar storage format that assumes that the values of various fields are similar across adjacent operations to optimize for high compression ratios and fast decoding.
The extra bytes must be retained when processing changes. If future versions of automerge add new metadata to changes, this will allow old clients to collaborate with new clients without limiting which features the new clients can use.
Compressed Change Chunk
A compressed change chunk is a change chunk where the contents have been compressed using [DEFLATE]. The checksum of a compressed chunk is calculated as through the chunk was not compressed (with type = 1, the length of the uncompressed data, and the original uncompressed data).
For example if you have a change which has 372 bytes of data (excluding the header) which compresses to 323 bytes.
If the original change chunk header consisted of the following bytes:
0x85 0x6f 0x4a 0x83 0x80 0xb7 0x5d 0x54 0x01 0xf4 0x02
,
the compressed change chunk header would consist of:
0x85 0x6f 0x4a 0x83 0x80 0xb7 0x5d 0x54 0x02 0xc3 0x02
To decode a compressed change chunk first decompress the chunk contents, and then construct a new chunk with the same magic bytes and checksum, but with type = 1 and the length of the decompressed data. Then parse that as a change chunk.
Simple types
uLEB
uLEB is an unsigned little endian base 128 value. This is a variable length encoding used throughout.
To encode a uLEB, represent the number in binary and pad it with leading zeros so that it has a length which is a multiple of 7. Take each group of 7 bytes from least-significant to most-significant and output them in bytes - the first bit of every byte is 1 except for the last byte which is 0.
-
Unsigned ints 0 - 127 are stored as one byte:
0b00000000 - 0b01111111
-
Unsigned ints 128 - 16383 are stored as two bytes:
0b10000000 0b00000001 - 0b11111111 0b01111111
etc.
To decode a uLEB, read bytes up to and including the first byte with a 0 as the first bit. Take the latter 7 bits from each byte (the last byte contains the most significant bits, so you need to concatenate them in the opposite order to which the bytes are represented on disk).
Although uLEB encoding can support numbers of arbitrary bitsize, unsigned integers in Automerge must not exceed 64 bits. Implementations should fail to parse documents with uLEBs that decode to a value too large to be represented in a 64-bit unsigned integer.
Implementations must generate the shortest possible uLEB encodings, and should
reject documents with overly long encodings. For example using the decoding
rules above the bytes 0b10000000 0b00000000
would be decoded as 0; but this is
overly long: 0 can be represented in just one byte as 0b00000000
, so should be
rejected.
LEB
LEB is a signed variant little endian base 128 value
To encode a uLEB, represent the number in twos complement, and sign-extend it so that it has a length which is a multiple of seven. If the number is negative the padding will be of 1-bits and if the number is positive the padding will be 0-bits.
-
0 is represented as one byte:
0b0000000
-
Ints from 1 to 63 are represented as one byte:
0b00000001 - 0b00111111
-
Ints from -1 to -64 are represented as one byte:
0b01111111 - 0b010000000
-
Ints from 64 to 8191 are represented as two bytes:
0b11000000 0b00000000 - 0b11111111 0b00111111
-
Ints from -65 to -8192 are represented as two bytes:
0b10111111 0b01111111 - 0b10000000 0b01000000
etc.
To decode an LEB, read bytes up to and including the first byte with a 0 as the first bit. Take the latter 7 bits from each byte (the last byte contains the most signfiicant bits, so you need to concatenate them in the opposite order to which the bytes are represented on disk). If the first bit of your number is 1 (from the second bit of the last byte in encoded form) then you have a negative number and you can take twos complement to get to its absolute value; otherwise you have a positive number (or 0).
Although LEB encoding can support numbers of arbitrary bitsize, signed integers in Automerge must not exceed 64 bits. Implementations should fail to parse documents with uLEBs that decode to a value too large to be represented in a 64-bit signed integer.
Implementations must generate the shortest possible LEB for a given integer, and
should reject documents with overly long encodings. For example the decoding
rules above the bytes 0b11111111 0b01111111
would be decoded as -1; but this
is overly long: -1 can be represented as just one byte 0b01000000
, so should
be rejected.
Change Hash
A change hash is the 32-byte [SHA256] hash of the concatenation of the chunk type (0x01) chunk length and chunk contents fields of a change represented as a change chunk.
The first four bytes of the change hash are used as a checksum when a change chunk is serialized.
Action
The actions of the reference data model are encoded in the storage format as a byte as follows:
Byte | Action | Description |
---|---|---|
0x00 |
|
Creates a new map object |
0x01 |
|
Sets a key of a map, overwrites an item in a list, inserts an item in a list, or edits text |
0x02 |
|
Creates a new list object |
0x03 |
|
Unsets a key of a map, or removes an item from a list (reducing its length) |
0x04 |
|
Creates a new text object |
0x05 |
|
Increments a counter stored in a map or a list |
Future versions of automerge may add new actions, and implementations must preserve operations containing actions they don’t support when processing changes for forward compatibility.
Column Specification
Column specifications are a 32-bit uLEB interpreted as a bitfield:
-
The least significant three bits encode the column type
-
The 4th least significant bit is
1
if the column is [DEFLATE] compressed and0
otherwise -
The remaining bits are the column ID
If the deflate bit is set then the column data must first be decompressed using DEFLATE before proceeding with decoding the values.
The DEFLATE bit is only permitted in document chunks, implementations must abort if they find compressed columns in change chunks.
The ID defines the purpose of the column for either Change Columns or Operation Columns, and implementations must preserve columns that they do not understand.
The column type specifies how the data in the column is encoded. The possible types are:
Value | Description |
---|---|
0 |
|
1 |
|
2 |
|
3 |
|
4 |
|
5 |
|
6 |
|
7 |
Compound types
Array of Actor IDs
The actor ID array consists of a 64-bit uLEB giving the count of actor ids, followed by each actor ID as a length-prefixed byte array.
Each item in the array consists of a 64-bit uLEB giving the length in bytes, and then that number of bytes.
For example an array consisting of the single actor ID [0xab, 0xcd, 0xef]
would be encoded as: 0x01 0x03 0xab 0xcd 0xef
.
Implementations must store actor ids lexicographically, and should error when reading a document with actor ids in the wrong order.
Array of Change Hashes
The heads array consists of a 64-bit uLEB N giving the count of heads, followed by N change hashes each exactly 32-bytes long.
For example an array consisting of the heads
f986a4318d1f1cc0e2e10e421e7a9a4cd0b70a89dae98bc1d76d789c2bf7904c
and
4355a46b19d348dc2f57c046f8ef63d4538ebb936000f3c9ee954a27460dd865
would be
represented as 0x02 0xf9 0x86 ..{28 bytes elided).. 0x90 0x4c 0x43 0x55 ..{28 bytes elided}.. 0xd8 0x65
Heads Index
The heads index provides a lookup table from the change hash to the change in a document. Very old automerge documents may be missing this field.
The index consists of N 64-bit uLEB's (one per head in the Heads array of the document chunk), and each uLEB gives the index of that head’s change in the columnar change storage.
In a well-formed document, the change hash of the change indicated will match the change hash in the heads array, but implementations may chose to not validate this when parsing documents to avoid having to recompute every change hash.
Column Metadata
The column metadata consists of a 64-bit uLEB N giving the number of columns, followed by N pairs describing each columns in the chunks column data.
Field | Description |
---|---|
Column Specification |
a 32-bit uLEB encoded Column Specification |
Column Length |
64-bit uLEB of the length (in bytes) of the column data block |
The column specifications must be unique and sorted. Implementations must not include both an uncompressed and a compressed column with the same ID and type, and the column order should be sorted with the deflate bit set to 0.
A column that contains only null values, or is otherwise empty, should be omitted from the chunk. In this case there will be no column specification in the column metadata and no data in the column data.
In the case that there are no changes or operations at all, then the column
metadata will be encoded as 0x00
to indicate that there are no columns at all,
and there will be no column data in the chunk.
Column Data
Columns are stored one after the other with no separators or length indicators. The columns are stored in order they appear in the column metadata and each can be decoded according to its column specification.
All columns must have the same number of items (or the same number of arrays of items for grouped columns), though as they are compressed differently they may have vastly different byte counts.
For future compatibility it is important that programs which edit Automerge documents maintain all columns, even those that they don’t understand the meaning of. When new changes or operations are added to a document with an unknown column a null should be added following the encoding rules of its specification.
Change Columns
The currently defined columns for changes in a document chunk are:
Name | Specification | ID | Type | Description |
---|---|---|---|---|
actor |
1 |
0 |
The actor that made the change |
|
sequence number |
3 |
0 |
The sequence number for each change |
|
maxOp |
19 |
1 |
The largest counter that occurs in each change |
|
time |
35 |
2 |
The (optional) wallclock time at which each change was made |
|
message |
53 |
2 |
The (optional) commit message for each change |
|
dependencies group |
64 |
4 |
The number of dependencies for each change |
|
dependencies index |
67 |
4 |
Grouped Delta Column |
The indices of the changes this change depends on |
extra metadata |
86 |
5 |
The metadata for any extra data for this change |
|
extra data |
87 |
5 |
Any extra data for this change |
Each value in the dependencies index
column is an index into the changes that
are stored in the document’s columns. Implementations MUST abort if an index is
out of bounds.
The sequence number
of a change should be 1
if it is the first change by a
given actor. Each subsequent change must have a sequence number exactly 1
higher than the previous change by the same actor. Implementations MUST abort
if there are missing changes for a given actor ID.
The maxOp
field of the change refers to the largest counter component of an
operation ID in the set of operations in this change. For a given actor ID this
must always increase. Implementations MUST abort if the maxOp
of a change is
not larger than all the maxOp
of changes from that actor with smaller seq
.
After decoding all the columns, and de-referencing indices into other columns, you will have an array of changes, where each change conceptually has the following fields:
Field | Type | Mapping |
---|---|---|
actor ID |
array of bytes |
The id of the actor that made the change |
seq |
64-bit uint |
The sequence number of the change |
ops |
array of operations |
The operations for this change (take all operations with counter greater the previous change’s maxOp and less than or equal to this change’s maxOp) |
deps |
array of changes |
The changes this change depends on (look up each index in the dependencies index in this documents changes columns) |
time |
64-bit int |
The (optional) wallclock time of the change |
message |
utf-8 string |
The (optional) message of the change |
extra data |
any |
The (optional) extra data (parse the extra data column according to the extra metadata column) |
Operation Columns
The currently defined columns for operations are:
Field | Specification | ID | Type | Description |
---|---|---|---|---|
object actor ID |
1 |
0 |
actor index of object ID each operation targets |
|
object counter |
2 |
0 |
counter of the object ID each operation targets |
|
key actor ID |
17 |
1 |
actor of the operation ID of the key of each operation |
|
key counter |
19 |
1 |
counter of the operation ID of the key of each operation |
|
key string |
21 |
1 |
The string key each operation targets |
|
actor ID |
33 |
2 |
The actor of each operations ID |
|
counter |
35 |
2 |
The counter of each operations ID |
|
insert |
52 |
3 |
Whether or not this is an insert operation |
|
action |
66 |
4 |
The Action of each operation |
|
value metadata |
86 |
5 |
The metadata for the value of this operation |
|
value |
87 |
5 |
The value of this operation |
|
predecessor group |
112 |
6 |
The group for the predecessors of this operation (only in change chunks) |
|
predecessor actor IDs |
113 |
6 |
Grouped Actor Column |
The actor ID of each predecessor’s operation ID |
predecessor counters |
115 |
6 |
Grouped Delta Column |
The counter of each predecessor’s operation ID |
successor group |
128 |
8 |
The group for the successors of this operation (only in document chunks) |
|
successor actor IDs |
129 |
8 |
Grouped Actor Column |
The actor ID of each successor’s operation ID |
successor counters |
131 |
8 |
Grouped Delta Column |
The counter of each successor’s operation ID |
Warning
|
The javascript implementation includes a child column, is this
required?
|
We determine the key that the operation refers to thusly:
-
If the key string is not null then this is the key of the operation (when modifying a map).
-
Otherwise we use the pair (lookup_actor(key actor ID), key counter) as the key of the operation (when modifying a list).
-
If key string is null and any of key actor or key counter are null implementations MUST abort
Operations are stored with their predecessors in change chunks and with successors in document chunks. For more information see the section on implementation concerns.
After decoding all the columns, and de-referencing indices into other columns, you will have an array of operations, where each operation conceptually has the following fields:
Field |
Type |
Mapping to columns |
Object |
Object ID |
The object modified by this operation in (column 0) |
Key |
String or Operation ID |
The position in that object to modify (column 1) |
ID |
Operation ID |
The ID of this operation, and thus the object ID of any object it creates (column 2) |
Insert |
boolean |
For operations on |
action |
The action this operation takes |
|
primitive value |
The value inserted by this operation (if needed) |
|
Successors |
Operations |
Future operations that affect the object created by this operation (if any) |
Column Types
Run Length Encoding
Many columns use run length encoding to compress repeated values. Such columns are
encoded as repeated pairs of the form (length, value)
.
A "run" in an RLE columns is encoded as pairs of the form (length,value)
.
length
is a signed LEB:
-
If
length
is positive, thenvalue
is a single instance of the value which occurslength
times. -
If
length
is 0 then this pair represents anull
value andvalue
is the uLEB encoding of the number of timesnull
occurs -
If
length
is negative thenvalue
is a literal run and the absolute value oflength
is the number of items in the literal run. That is to say, there is no compression.
For example if you were trying to compress the array of uLEBs [0,0,0,null,null,1,2,3]
you would encode it as 0x03 0x00 0x00 0x02 0x7d 0x01 0x02 0x03
Group Column
Some fields in automerge have multiple values per change or operation. An example of this is the dependencies index of a change. The group column (denoted by column type 0) defines how many values should be read from each grouped column when parsing each change or operation.
Grouping affects all columns with the same ID as its column specification, so a group column will usually be followed by one or more columns with the same id but different types. It is possible to have a group column with no matching grouped columns if the grouped column is completely empty, as it will be omitted.
The group column is a run length encoded list of 64-bit uLEBs that specifies how many items should be read from the subsequent grouped columns per change or operation. Implementations MUST abort if they cannot read the correct number of values from each of the grouped columns.
For example if you had five changes in a document with [0,1,2,2,2]
dependencies each, the group column would be encoded as 0x7e 0x00 0x01 0x03
0x02
, and the dependencies index column would contain seven values.
Note that it is not possible for two columns in a group to have the same type as it would not be possible to have a deterministic ordering for the column specifications. Implementations MUST abort if they encounter two column specifications with the same type and column ID.
Implementations MUST abort if they encounter multiple group column specifications with the same ID.
Actor Column
An actor column (denoted by column type 1) uses run length encoding to compress a list of uLEBs that represent an index into an array of actor ids.
In a document chunk the index is the position of the actor id in the array of actor IDs.
In a change chunk index 0 represents the actor id of the change, and index 1+ are given to the other actor ids in the order they appear.
uLEB Column
A uLEB column (denoted by column type 2) uses run length encoding to compress a list of 64-bit uLEBs.
It is used (instead of a delta column) when there is no expectation that delta compression would help reduce the storage requirement, or if the column may contain null values.
Delta Column
A delta column (denoted by column type 3) uses run length encoding to compress a list of 64-bit uLEBs.
The sequence is assumed to start from zero, so if you wanted to encode the list
[3,4,5,6,9,7,8] you would first calculate the list of deltas
[+3,+1,+1,+1,+3,-2,+1], and then run length encode the resulting signed LEBs to get the bytes
0x7f 0x03 0x03 0x01 0x7d 0x03 0x7e 0x01
.
Warning
|
How should applications handle a decoded delta value which takes the absolute value below zero? |
Boolean Column
A boolean column (denoted by column type 4) encodes a list of booleans. The column contains sequences of
64-bit uLEB integers which represent the lengths of alternating sequences of
false/true
. The initial value of the column is always false
For example if you wanted to encode the list [true, true, false, false,
false]
, you would end up with a list of lengths of [0,2,3]
, which would be
encoded as 0x00 0x02 0x03
.
String Column
A string column (denoted by column type 5) uses run length encoding to compress a list of length-prefixed UTF-8 strings. Each string is encoded as a 64-bit uLEB followed by that many literal bytes.
For example, if you wanted to encode the list ["a", "", null, "boo", "boo"]
you would end up with 0x7e 0x01 0x65 0x00 0x00 0x01 0x02 0x03 0x66 0x6f 0x6f
.
Value Metadata Column
The value metadata column (denoted by column type 6) is always paired with a value column with the same ID. The metadata column is a run length encoded list of 64-bit LEBs that defines the type and length of each value in the value column.
These integers are laid out like so:
-
The lower four bits encode the type of the value
-
The higher bits encode the length of the value
The type code may be
Value | Type | Representation of value |
---|---|---|
0 |
Null |
Not present (length = 0) |
1 |
False |
Not present (length = 0) |
2 |
True |
Not present (length = 0) |
3 |
Unsigned integer |
64-bit uLEB in value column (length = 1..10) |
4 |
Signed integer |
64-bit LEB in value column (length = 1..10) |
5 |
IEEE754 float |
64-bit IEEE754 float in value column (length = 8) |
6 |
UTF8 string |
Utf-8 string in value column (length = 0..2^60) |
7 |
Bytes |
Arbitrary bytes in value column (length = 0..2^60) |
8 |
Counter |
64-bit LEB in value column (length = 1..10) |
9 |
Timestamp |
64-bit LEB in value column (length = 1..10) |
If the type tag is none of these values it may be a value produced by a future
version of Automerge. In this case implementations MUST read and store the type
code and length
bytes when reading and write them back in same position when
writing.
If the bytes in a UTF8 string value (type 6) are not valid utf-8, then implementations should replace them by the unicode replacement character (U+FFFD).
Warning
|
Replacing invalid utf-8 seems like it might be a bad idea? Should check this. I think it’s what the javascript implementation does though. |
Value Column
The value column (denoted by column type 7) contains raw values. The type and length of each value in the column is determined by the value metadata column with the same column ID.
Note that raw value columns which do not contain values may be omitted. If implementations encounter a lone value metadata column they must assume that it is accompanied by an empty raw value column.
Implementations must abort if they encounter a raw value column not preceeded by a metadata column with the same id. Implementations must also abort if they encounter more than one metadata column with the same column id, or more than one raw value column with the same id.
Unknown columns
When reading the column metadata applications may encounter column specifications which they are not expecting. These column specifications may be produced by future versions of the application. If an implementation encounters an unknown column whilst reading data it MUST retain this data when writing that data back to storage.
This is possible because every column type has some concept of a null value. When inserting new rows into a collection of rows stored in the columnar storage format application MUST write a null value into columns which they do not recognise for the new rows they are inserting.
Warning
|
What should the null value be for boolean or delta columns? |
Implementation concerns
Below are some notes that may help implementors build compatible automerge implementations. They are likely not yet complete, and any differences between what is written here and the reference implementation should be resolved in favor of that.
Operations in document chunks
Ordering of operations
Operations are grouped by the object that they manipulate. Objects are then sorted by their IDs. Thus operations are ordered using the following procedure:
Warning
|
Is this required? If so should implementations abort if the operations are not inthis order? |
-
First sort by object ID, such that any operations for the same object are consecutive. The null objectId (i.e. the root object) is sorted before all non-null objectIds. Non-null objectIds are sorted by Lamport timestamp.
-
For each object:
-
if the object is a map, sort the operations within that object lexicographically by key, so that all operations for the same key are consecutive. This sort order MUST be based on the UTF-8 byte sequence of the key.
-
If the object is a list or text, sort the operations within that object by the operation ID of the element they target. This is determined as follows:
-
For insert operations the target element is the operation ID of the inserting operation
-
For
set
ordelete
operations the target is the operation ID in thekey
field
-
-
-
Among the operations for the same key (for maps) or the same list element (for lists/text), sort the operations by their opId, using lamport timestamp ordering. For list elements, note that the operation that inserted the operation will always have an opId that is lower than the opId of any operations that updates or deletes that list element, and therefore the insertion operation will always be the first operation for a given list element.
Warning
|
the JavaScript implementation currently does not do this sorting correctly, since it sorts keys by JavaScript string comparison, which differs from UTF-8 lexicographic ordering for characters beyond the basic multilingual plane. |
Successors and omitting deletes
The document storage format does not encode a predecessors field. Instead this
information is encoded in the successors
field. This can be used to
reconstruct the predecessors field from the reference data model.
Delete operations do not carry any information other than the object ID and key they are deleting. As such they are encoded in the document by appending the operation ID of the delete operation to the successors of the operation creating the data to be deleted.
Implementations MUST abort if they encounter explicitly encoded delete operations in a document chunk.
Calculating predecessors
Operations in the document format are not stored in the order they were
generated, as they are in the change data model. Furthermore, operations in the
document format have a successor
rather than predecessor
field. The
following procedure specifies how to map from document operations to the change
operations.
First expand operations:
-
Add an empty predecessor list to every document operation
-
For each operation in the document operation rows
-
For each operation ID in the successors list of the document operation lookup the target operation in the document operations:
-
If an operation is found add the current operation ID to the target operations predecessor list
-
If no operation is found then insert a new delete operation into the document with its ID set to the target operation ID, the object and key set to the same value as the current operation, and the predecessor set to the current operation.
-
-
Second, match up changes:
For each document operation
-
Sort all the changes for the same actor as the operation ID by ascending
maxOp
-
Add the document operation to the first change which has
maxOp >= counter
wherecounter
is the counter component of the operation ID.
Implementations MUST abort if no matching change is found
For each change sort the operations within the change by lamport timestamp of the operation ID.
Lamport Timestamps
Operation IDs are lamport timestamps. This imposes a total ordering. To compare two lamport timestamps:
-
If the counter components are different then whichever timestamp has the larger counter is the larger
-
If the counter components are the same but the actor IDs are different then the actor ID which is lexicographically larger is considered the larger timestamp
-
Otherwise the two timestamps are equal
Hash verification
The dependencies in the document model are expressed as integer offsets. But in the reference data model dependencies are expressed as a hash of the ancestor changes. To map to the hash based representation perform a topological traversal of the dependency graph and for each change serialize the change as a Change Chunk then calculate the hash of the change as in the Change Hash, then for every change replace the index of the current change with the calculated hash.
Once this procedure is complete take the heads of the depedency graph and compare their hashes with the head hashes field in the document chunk. If the hashes don’t match implementations MUST abort.