Protocol level stream forking

jthor · January 7, 2024, 4:42pm

One of the most beloved features of git is the ability to create a fork of a repository where the history of past contributors is maintained. This is great because it allows new project maintainers to easily build alternatives products and still give credit to old contributors. It also allows a forked project to merge back changes from the original repository (this is common for forks of the Linux kernel for example). Forks in Ceramic streams could similarly allow communities to create alternative versions of documents without asking the original author for permission, opening up for new ways to collaborate.

Ceramic doesn’t currently have any protocol level features to fork a stream, instead this would need to be handled on the application layer. This approach has several drawbacks. When creating a fork, the application (and maybe even users) would need to keep track of and persist the stream that was forked. There is also no standard way of forking which could result in several incompatible implementations.

What follows is a proposal for how to represent stream forks in the protocol.

Technical background

There are two main fields that give the protocol information about which stream is being interacted with. Both of these fields are required in Data Events, but not allowed in Init Events.

id - contains the CID of the InitEvent of the given stream
prev - contains the CID of the previous event in the given stream

Introducing Fork Events

A ForkEvent is simply an InitEvent (that is, and event that creates a new stream) which contains a prev field. This field must be set to a CID of an event in the stream that is being forked. This could be the most recent event, or an historical event (if the fork is based on some past state of the stream). Note that the ForkEvent does not contain an id field.

The streamId of the newly created forked stream is based on the CID of the ForkEvent. To some extent the ForkEvent can be seen as an InitEvent for the fork. Any DataEvent that is added to the fork would therefore use the CID of the ForkEvent as its id.

Merging streams

If one or more forks of a stream has been created, the original author of a stream might be interested in merging the changes from one of the forks back into the original stream. This could be achieved simply by including the CID of an event from the fork in the prev field of a new DataEvent. Currently this approach would be problematic because it’s only possible to reverence a single previous event. However, this is currently being solved with CIP-145.

DataEvents currently always need to include some data. It could make sense to also introduce a separate MergeEvent that is the same as a DataEvent, but without a data field. However, retaining the ability to merge streams with a DataEvent probably also makes sense.

Synchronizing forks

As with git, synchronizing a forked stream should not only synchornize all events back to the InitEvent of the fork, i.e. the ForkEvent. Rather it should synchronize the full history back to the InitEvent of the stream that was initially forked. This means that if a node only cares about a specific fork it also maintains the history of the stream that was forked, up until the point at which the fork happened.

m0ar · January 8, 2024, 9:43am

Thanks for this suggestion! It’s a primitive that would be very valuable for us, as we have requirements for forking and merging history. I have been pondering how to approximate this, but reached the same concerns around persistence and compatibility.

The killer feature of this is that with a single PID, one can trustlessly resolve history back through forks into dependent work. This would be very hard for us to achieve with a special-purpose implementation

jthor · January 8, 2024, 10:24am

Yes, definitely an intention that forks should enable this. Updating the first post.

spencer · January 8, 2024, 9:54pm

Makes sense and does seem pretty useful. I think the tricky part will be to figure out how this plays with Recon, e.g. if a node was configured to sync a range that contained the fork but not the original stream.

AaronDGoldman · January 9, 2024, 5:31pm

If you pulled a stream that had an event that had a prev you don’t sync you could still send requests for those events to the node that told you about the event you are subscribed to. If you are syncing a model then you are subscribed to both but sharding the model by controller to multiple nodes could certainly cause your node to subscribe to one but not the other.