Data deletion and privacy on Ceramic

While Ceramic serves as an excellent infrastructure for public data, there seems to be a strong desire to also use it for more sensitive data. There are usually two main topics that arise in conversation about sensitive data, the ability to delete data, and the ability to only have certain actors read the data. This forum post is an attempt at describing possible future directions for the Ceramic protocol when it comes to storing sensitive data.

Data deletion

One of the core principles of Ceramic event streams is that they should be independently verifiable without trusting any third party. This makes deletion of data hard because deleting an event in an event log would break the integrity of this event log. There is however a small modification to the event log data structure that could fix this problem. Currently the data structure of an event log looks like depicted in Figure 1. There we can see the payload being inside of the same logical object as the Data Event.

data deletion.drawio(1)
Figure 1: the current data structure of an event log. Here an arrow represents a hash pointer, e.g. Data Events to the right contain the hash of the previous Data Event.

It would be fairly easy to separate the payload from the Data Event envelope by having the envelope simply contain the hash (e.g. CID) of the payload.

data deletion.drawio(2)

Figure 2: separating payloads from the Data Event envelope. Now the envelope only contains the hash of the payload.

This simple change would allow for the deletion of payload data, while keeping the integrity of the event log intact. We could introduce a new type of event called “Tombstone Event”, see Figure 3, that specifies specific payloads that should be deleted.

data deletion.drawio(3)

Figure 3: a tombstone event that specifies specific payloads to be deleted.

A more drastic type of Tombstone Event that deletes all events in a stream could also be added. It would need to only include the Init Event hash pointer and signify that all other events for this given stream should be deleted.

data deletion.drawio(4)

Figure 4: A tombstone event that revokes all data in a stream.

Content encryption

So far we’ve only talked about public data and the deletion of this data. Another step to make Ceramic more private is to encrypt the payload of events. Some applications built on Ceramic already do this, however separating the payload as described above give developers more flexibility.

data deletion.drawio(5)

Figure 5: Encrypted payloads

Worth noting is that different methods of encryption give different amount of security. For example using a symmetric key together with a strong cipher for encrypting can provide security against an attacker with a quantum computer, but with the drawback of difficulty of doing a key exchange between multiple parties.

Using asymmetric cryptography makes it easier to share keys between multiple parties, but most asymmetric cryptography used in production today will break with the introduction of quantum computers, so not a good alternative if that’s included in your threat model.

Access control

Currently, all data in Ceramic is publicly accessible. There are some great advantages to this, namely verifiability and decentralization. However, in some cases you might not want to expose this data to the public. It could be possible to introduce a change to the Ceramic protocol in the future that allows a stream creator to specify that only certain DIDs are allowed to read the data. This could be coupled with the separated payload data as outlined above to retain public verifiability of metadata, while keeping payload data private. Note however that the stream creator would have to specifically select a node which they trust to store this data and correctly only give access authorized DIDs.

6 Likes

How would this proposal affect the performance and scalability of the underlying network infrastructure, specifically IPFS? It seems that by separating each event into two events, we would be doubling the amount of load and number of calls to IPFS? If this is true, will it be a problem?

2 Likes

After Recon has been implemented we can eventually move away from using IPFS bitswap and just synchronize the event data directly.

The only real overhead this adds is 34 bytes for the CID and the work to validate this CID, but this is constant and shouldn’t be much of an issue.

4 Likes

+1 I would love to throw in some measurement regarding IPFS performance as we are struggling with this recently.
https://blog.cloudflare.com/ipfs-measurements/

1 Like

Hi Joel. Could you elaborate on “move away from using IPFS bitswap and just synchronize the event data directly.”? Does that means ceramic relies more on p2p sync among ceramic nodes rather than fetch from IPFS network?

Love this. Privacy would be huge!

1 Like

Yes, Recon is being implemented as a separate libp2p sub-protocol for exchanging eventIds. It would be a fairly straight forward change to add data sync on top of this.

1 Like

I see. Does this means that IPFS would be more like a data persistence layer so that it won’t become a performance bottleneck?

I think so. Ultimately Ceramic uses the IPLD data model, in theory data could be exported to IPFS, filecoin, or other networks as well.

2 Likes

Super interesting @jthor! We at denoted would really benefit from having ACL and deletion built-in. It is something our on-chain knowledge management editor needs!

We have already solved encryption with Lit protocol to keep private content private but as you point out this could be more flexible. We would welcome a native solution as I in general think this could help bring even more devs onboard as you get more batteries included. The only caveat is how this would play along with ACL. We are standing in front of the challenge of building ACL with both private & public content and have considered using a MPC wallet powered by PKP. We could make the owner of both the stream and the symmetric encryption key to be the MPC wallet which itself is jointly owned. I’m interested to hear what solutions you have been thinking about. Is there any other good alternatives in your opinion?

As we are using ComposeDB on top of Ceramic, we have to simulate deletions just updated the document with an empty payload and set a flag that indicates its deletion. This is a rather naive solution and we are aware that the initial data is persisted immutably in the stream (event log). So being able to utilize Tombstone events to mark payloads for deletions would be great.

Would all of this functionality be available in ComposeDB as well?

CC: @erci.eth

1 Like

That’s an interesting change. Because my impression was that the whole append only logs with user-signed hash link data structure was used to workaround the limitation of data being immutable on IPFS/using content addressing. Does this mean streams would be used for more general events stream rather than modeling a single mutable data record?

A post was merged into an existing topic: CIP-124: Recon (tip synchronization protocol)

I think for now you are on the right path in terms of adding an encryption layer using Lit. There’s also https://threshold.network/ that might be coming out with a theshold encryption network soon which might be worth considering.

Updating a doc in composedb to null is ok if you are fine with the history of that document being available, but a tombstone event is definitely preferable in some cases.

If / when there’s an effort to implement tombstone events and some of the other things I mentioned in the OP it could definitely be added to CDB as well.

Regarding using an ACL, we’ve been leaning more heavily towards object-capabilities (OCAPs) since they are better suited for distributed systems.

1 Like

Yeah, I think it would be great if you could do whatever you want with an event stream. It could represent one document as in CDB or multiple documents.

If multiple node is storing or indexing the data, how should we coordinate the deletion?

So the tombstone event would be created and published by the user and then synced across multiple nodes the same way as any other events.

1 Like

Are we worried about the abuse potential, where bad actors can try to manipulate data history? Good use case for this would be medical records where privacy is crucial however a bad actor should not have control to alter medical history. Another example would be a reputation use case where actors could delete negative impact on their reputation by altering history

For critical data applications, where history is important, application developers using Ceramic might need to have additional layers verification. Which criteria in the ceramic eco system would be used to decide to select trusted nodes that store private data?

Should we consider having a balance between controller’s right to manage their data and the ceramic’s ecosystem needs for reliability and transparency?