Context
The Ceramic core protocol is used to distribute and validate events in event streams, but the protocol layer has no idea what the data within those events are, and does not impose any rules on its structure. Currently almost all streams in Ceramic belong to either the Model or ModelInstanceDocument (MID) StreamType (Streamtypes | Ceramic Improvement Proposals), but the concept of StreamType is something that happens at a higher level than the core of the Ceramic protocol. We’ve increasingly been thinking of this as the “aggregation” layer, that exists above the protocol/event streaming layer but below the indexing and querying layer.
Both Models and MIDs use event payloads that contain jsonPatch objects describing transformations to json documents. MIDs also perform schema validation using the jsonSchema specified in their corresponding Model. We’d like to be able to generalize these ideas and enable a more diverse ecosystem of aggregators beyond the two that we have today. For example, one possible future aggregator might look similar to MIDs in that it also operates on JSON data and also uses Models for schema validation, but unlike MIDs instead of the event payloads being json-patch, they might be CRDT operations. This would enable more collaborative multi-writer documents. Another aggregator might use JSON documents but where the event payload is always the full current version of the document, rather than describing a transformation over previous state. Another possible aggregator might not use JSON documents at all, but might instead use something like Protocol Buffers.
Problem Statement
The problem is, if an aggregation library existed that could be configured with some number of different aggregators that it supports, this library would need a way to know how to dispatch incoming events it receives from the ceramic-one event feed and route those events to the proper aggregator. In other words, it needs to know that this event is for a ModelInstanceDocument and should go to the ModelInstanceDocument aggregator, while this other event is for a CRDTDocument and needs to go to its aggregator instead.
There are also interesting intersections between the different formats that data in event payloads can take and the way that event streams are partitioned, discovered, and synchronized at the core protocol layer. Currently this partitioning behavior is controlled via the separator key in event headers, which is currently always the “model” that the stream belongs to.
We don’t want the core protocol to have to know anything about the available aggregators, but we also want to make sure that it is possible to implement aggregators in a simple and straightforward way, and to avoid issues where two different aggregators cannot both co-exist in the same application without interfering with each other.
The goal of this post is to start a discussion to help us decide on what the set of core primitives are that the core protocol must expose in order to enable aggregators to do their job effectively.
Initial Proposal
Add a new field to the header of Ceramic init and data events, customMetadata
, that contains an IPLD object. Users of the Ceramic protocol can write whatever they want into customMetadata
.
Protocol implementations will not do anything with the data inside that object other than pass it back to the consumer. Any stream that is using the standard aggregation library and any of the generally-accepted aggregators (Model and ModelInstanceDocument to start), will reserve a streamType
(or maybe aggregatorType
or aggregatorId
?) field in the customMetadata
. The streamType
field can contain an integer corresponding to an entry in the streamtypes table: CIPs/tables/streamtypes.csv at main · ceramicnetwork/CIPs · GitHub. The value of the streamType
can be used by the aggregation library when applying events to look up the appropriate aggregator to send the event to. When registering an aggregator with the aggregation library, the aggregator must expose what streamtype it corresponds to.
There is still a question around how the registry of streamtypes/aggregators is maintained, and especially how to prevent conflicts like two aggregators both choosing the same streamtype id. For this I propose we follow a similar technique to how multicodecs works, where reserving a new streamtype id number is cheap and easy, but does need to be done in a centralized place to prevent conflicts: GitHub - multiformats/multicodec: Compact self-describing codecs. Save space by using predefined multicodec tables..
Alternate possibilities
In the proposal laid out above, streamtype/aggregator implementations define multiple different behaviors:- What is the format of the data payload itself. For example: the payload is a json-patch diff to the existing JSON document state, the payload is the entire new JSON document, the payload is a CRDT operation describing a transformation to a JSON document, the payload is a binary file like an image, etc.
- The schema validation behavior. For example the ModelInstanceDocument aggregator today will load the corresponding Model by StreamID from the ModelInstanceDocument metadata, extract the jsonSchema from the Model content, and then validate the updated ModelInstanceDocument content against the Model’s jsonSchema
- Conflict resolution rules. For example with ModelInstanceDocument today if there is a fork in the stream’s history the branch with the earlier anchor timestamp wins. In the future a CRDT-based streamtype might have logic to merge both branches of history instead of keeping one and pruning the other.
- Possibly other aggregator-specific logic that we haven’t thought of yet. For example I could imagine a streamtype that has events containing notifications about changes to state on some blockchain, and maybe the streamtype logic would validate the data in the events against an RPC endpoint for the corresponding blockchain before considering the event valid, and would throw the event out if its data doesn’t match what can be observed on the blockchain.
All of these different behaviors could potentially be decoupled. For instance, instead of a single streamType
field in the customMetadata
for events, we could have separate payloadType
, schemaValidationMode
, and conflictResolutionMode
fields instead. This might indeed be more flexible and powerful and is worth serious consideration. It does however increase the complexity a bit, and is further away from how things work today. For now I went with proposing a single streamType
field in the customMetadata
as it seems like the smallest and simplest change enable writing an aggregation library that can support being configured with different aggregators with differing logic, by giving that aggregation library just enough information to route events to aggregators. Then the aggregator implementations can have whatever logic they want and enforce whatever rules they want on the shape and structure of the event data they are aggregating.