Proposal for well defined Streams in Ceramic
Motivation
We want Ceramic streams to be a well defined primitive of the Ceramic protocol.
Additionally we want to be able to use streams for various kinds of use cases.
These use cases can vary wildly in how they expect streams to behave.
We need to design Ceramic in such a way it can accommodate this variability.
We have identified that different kinds of streams can vary in how they aggregate events, however I believe there are two other important ways in which streams types can vary.
First in how they preserve old events in a stream, its reasonable that a stream could delete/forget old events. Second, how streams are grouped together i.e. model, controller or some other application specific way.
Currently streams represent a single use case of Models and Model Instance Documents with specific conflict resolution and aggregation rules.
Additionally it is baked into the protocol that all events for a stream must be preserved.
However if we want to maximize the utility of Ceramic we need to enable other kinds of streams.
A few possible stream kinds are:
- Last write wins
- First write wins
- Swap/Stomp operations
- CRDTs
- JSON Patch (i.e. MIDs)
There are potentially many others.
The design of Ceramic should be that stream kinds are opaque and therefore can accommodate any of these streams types.
This means we need to be careful about what requirements we place on a stream.
For example today a stream’s init event must always be preserved so we can track information about the stream like its controller.
However if we must always preserve an init event then we are potentially storing and replicating data within that event that is no longer useful.
Additionally its not known to Ceramic which grouping of streams are useful.
For example grouping MIDs by model is useful, but other groupings are useful for other types of streams.
We need a mechanism that allows the stream type to dictate grouping of streams.
In summary we need mechanisms for different kinds of streams to vary in these ways:
- how streams are aggregated (this includes conflict resolution),
- how streams are preserved,
- how streams are grouped.
Each of these questions has a potentially unbounded number of different behaviors.
Therefore we should design a system that allows for describing these behaviors separate from the stream type, otherwise we will have an ever growing stream type enumeration and specialized code paths for each.
The stream type enumeration must remain opaque to Ceramic.
The Proposal
Streams are aggregated according to the stream type enumeration.
This is basically what was concluded in Protocol Features to enable custom aggregators.
Streams are grouped along a set dimensions provided in its init event and no payload data can exist in init events.
This decouples the existence and creation of streams from the data within streams.
This enables streams to have different preservation modes as no data for the stream lives within the init events.
Stream data is preserved according to an enumeration of preservation modes.
I propose we implement only a single mode initially (i.e. always preserve) but can add additional modes as needed.
For example a last write wins mode can forget all previous events.
This leads us to a precise definition for a stream.
A stream is:
- controlled by authenticated users,
- identified by its unique values for each grouping dimension,
- a partial order of a set of events.
Using this definition of a stream we can build various different aggregation systems on top of the Ceramic protocol that can vary in how data in aggregated, grouped and preserved.
The Specification
What follows is a detailed specification of how these mechanism will work.
What is important here is that I am precise in what each of these mechanisms is and that it correctly enables the variability we need.
Isomorphic mechanisms can be discussed later.
Event Types
Simplified types for an Event in this proposed protocol.
/// Top level event type
enum Event {
Time(TimeEvent), // No changes here
Signed(SignedEvent),
Unsigned(InitPayload),
}
/// A signed event, no changes here
enum SignedEvent {
Init(InitPayload),
Data(DataPayload),
}
/// An init payload
/// Note there is no payload in an init event.
/// Note sep, model, unique, shouldIndex etc are absent.
/// With this proposal the expectation is that those fields are specific to how
/// MIDs function and therefore would be part of the dimensions.
/// More on this in a bit.
struct InitPayload {
stream_type: StreamType, // Opaque stream type
preservation_mode: PreservationMode,
controllers: Vec<String>,
/// An ordered set of key value pairs defining the uniqueness of a stream.
dimensions: Vec<(String, Bytes)>,
}
/// A data event belongs to a stream defined by the `id` in a partial order defined by its `prev`.
struct DataPayload {
id: Cid, // CID of init Event
prev: Vec<Cid>, // multi-prev, may even be empty if no order is needed
data: Ipld, // Arbitrary payload
}
/// Stream type enumeration, however the variants are opaque to Ceramic.
type StreamType = u64;
/// How events are preserved
/// I propose we only build the Always variant to start but can add more as needed.
/// A few hypothetical examples are given.
enum PreservationMode {
Always,
Last, // Keep only the most recent event per stream, TODO how do we determine most recent? Can the rules for _recent_ vary?
Expiring(Duration), // Keep events until they are older than the duration
}
There are simplified only in that they are not backwards compatible.
In order to make them backwards compatible we need to define a mapping from old structures to new ones, which is possible see below.
There are various techniques we can use but the key is we can distinguish between old and new based on the structure (i.e type
field or missing data
field etc) and we can map old to new.
This way only the parsing code needs to be aware of the old way of doing things and the rest of the systems can operate using these new types.
This will likely be similar to the RawEvent → Event mapping that already exists.
Stream Type
The stream type is an opaque enumeration represented as an integer.
Ceramic need only preserve the stream type information and expose it to consumers of the API.
However it may be beneficial for aggregation layers to define a few aspects associated with each stream type:
- The aggregation method used
- The schema of data that may exist in the payload of a data event
Ceramic itself will not use this information nor validate it, but it is valuable to aggregators.
Dimensions
It will be helpful to expand on this concept of dimensions.
A stream is uniquely identified by the values of each of its dimensions.
Ceramic can leverage these dimensions in two important ways to expose behavior to applications.
- Applications can describe interests as a predicate over the dimensions.
- Applications can subscribe to a feed of events over a predicate of dimensions.
This gives applications control over how can be shared/synchronized across the network and how aggregation layers can consume the data.
The concept of dimensions replaces the concept of sep
and model
from ComposeDB.
ComposeDB could be implemented using the following dimensions:
- model - the stream id of the model
- should_index - boolean value
- unique - arbitrary bytes
- context - arbitrary bytes
However this is not needed as this proposal is backwards compatible.
Predicates
Predicates over dimensions can be defined with the following operations:
- Byte equality accounting for variable length byte sequences
- Byte comparison greater than or less than
- Any logical operation over boolean values
For example model >= xxx and model < yyy
where model is one of the dimension keys.
Feed API
Somewhat tangential to this proposal is the idea that we can have two kinds of feed APIs:
- Raw Feed
- Conclusion Feed
The raw feed is what we have today, in that it returns the events directly (in their car files) without any extra metadata.
The conclusion feed differs in that it returns the events with various conclusions about them.
Specifically payloads from data events can be returned from the conclusion feed with a copy of its dimensions.
Other conclusions could be before
and after
markers that bound the time in which the data event as created.
Also note it is this conclusion feed that allows for seamless backwards compatibility because there is a natural transformation of the raw events into the conclusion events.
Therefore we can map both old and new events into the same conclusion events.
A possible conclusion event could be as follows.
/// Top level event type
enum ConclusionEvent {
Init(ConclusionInit), // better names needed
Data(ConclusionData),
Time(TimeEvent),
}
struct ConclusionInit {
stream_type: StreamType, // Opaque stream type
preservation_mode: PreservationMode,
controllers: Vec<String>,
/// An ordered set of key value pairs defining the uniqueness of a stream.
dimensions: Vec<(String, Bytes)>,
}
struct ConclusionData {
init: ConclusionInit, // Copy of the init dimensions etc
before: CeramicTime,
after: CeramicTime,
data: Ipld, // Arbitrary payload
}
Also note that existing init events that do contain data can be mapped into both a ConclusionInit and ConclusionData events sequentially.
Summary
It seems possible to build Ceramic one according to this proposal in a backwards compatible way that enables generic stream types not only for different aggregation methods but also accounting for variation in how streams are grouped and how stream data is preserved over time.
Looking for feedback on this general design, specifically on if there are other ways in which stream types can vary and if I have addressed well enough the ways in which we have already anticipated.