How to extend schema or migrate data to new model?

0xalthea · July 10, 2024, 2:56pm

Hi, I would like to know if there is any method to add a field to a model. I learn from the documentation that model is immutable.

I found that some of the models have multiple commits (e.g. https://cerscan.com/mainnet/stream/kjzl6cwe1jw14936q0quh7drz7a97gw8yw3aoiflwmgsdlf4prnokwywfhhadfn). How do I commit an update to a model?

If it’s impossible to extend a model, how do I migrate the data from existing model to a new model? If I duplicate the data from existing model to a new model, the author will be the admin DID instead of the original user right?

What is the best practice to deal with the situation? Thanks in advanced!

spencer · July 11, 2024, 7:52pm

Congrats, you’ve hit on one of the most challenging problems surrounding composable data!

Background

One the one hand, developers building on traditional web2 databases are used to being able to rapidly iterate on their database schema design, letting their data structures evolve as their app evolves.
On the other hand, if a developer is building on a schema created by someone else, they need to know that that schema won’t be updated out from under them and break their application unexpectedly.

Models in Ceramic fall into an interesting grey zone where they are somewhere in between a traditional web2 database schema and something more like a data standard. In some ways they can even be thought of as similar to a smart contract on a traditional blockchain (a data contract?) in that they are something that is deployed once by one developer, are immutable from that point on, but anyone can use them and compose with other models.

The problem of immutable data Models is made even more challenging by the fact that data in Ceramic is generally written with a key that the end user controls and that the application does not have access to. This is important to provide the “user-controlled” data experience that Ceramic promises, but creates a challenge because developers are not able to update their user’s data, which means they cannot run an automatic schema migration across all their user’s data as a one-off job in the way that web2 developers can.

So Models are less flexible than a web2 database schema because they can be shared between multiple applications (which requires schemas to be immutable), and because the users control their data and are the only ones who can update their own data. These restrictions means developers should always test out their schemas on the Clay Testnet first and only deploy them to Mainnet once they are reasonably confident that they are fairly stable (similar to how a smart contract is developed).

Possible Solutions

All of that said, sometimes even with the most forethought and plenty of up front testing, there are still going to be times when developers need to evolve their schemas after they are already deployed on Mainnet and have user data associated with them, so what should a developer do in that case?

There are two main approaches here, each with their own tradeoffs.

To make these concrete, let’s use the example of a UserProfile Model. Let’s say the initial version has the following fields:

type UserProfile {
  name: string
  age: number
  picture: Byte[]
}

Now let’s say the dev decides they want to add an emailAddress field to profiles. There are two options:

Option 1: An Extension Model

This would mean creating a new Model containing the new fields you want to add. In this example it could be a UserProfileEmail Model:

type UserProfileEmail {
  email: string
}

Now whenever a user creates or updates their profile their name, age, and picture get written to the UserProfile Model while their email gets written to the UserProfileEmail Model. When reading the profile, the app would need to query both Models and combine the results to build up a unified Profile object. When doing writes to the profile, the application would need logic to understand which fields get written to which model.

Pros

Existing other apps that are using `UserProfile` can continue to do so without issue. If they don't care about email, they don't need to sync the `UserProfileEmail` model, and all data stays consistent between apps

Cons

Developer complexity: The app needs to understand that not all the fields in a user’s profile live in the same Model. The app needs code to split a profile up into the component Models, perform all the necessary writes to each Model, handle any errors that come up, have logic for recombining a profile out of the component Models at read time, etc.
Lack of atomicity: If there are multiple Models that make up one conceptual object, and some of the fields in that object get updated together, now the updates to the underlying data Models are independent and it’s possible for one of them to succeed and the other to fail. There is no transaction system to atomically update multiple Models in an all-or-nothing fashion. This may cause data consistency issues for some applications depending on the constraints of the app on the data.

Option 2: A Replacement Model

This would mean creating a new Model that replaces the old one and adds whatever changes are desired. For the example we’ve been using, this could be a UserProfileWithEmail Model that supersedes the previous UserProfile Model:

type UserProfileWithEmail {
  name: string
  age: number
  picture: Byte[]
  email: string
}

The updated app would always create new profiles using the new Model. When loading existing profiles, it would first check for a new UserProfileWithEmail Model and if found just use that, but if it doesn’t fine an instance of the new Model for this user, it could fall back on trying to load an old UserProfile Model. If one is found it could then port the data over to the new format and write out a new UserProfileWithEmail Model (note this would need to be done as part of an active session with the user logged in so that the new profile can be written with the user’s key, there’s no way to mass-migrate all existing profiles since the application backend doesn’t have access to the keys that the end users are using to write their profiles).

Pros

Simpler for Models that are only used in a single-application. If a single application is creating, updating, and using these models, then application code for that app is likely simpler with this approach since the code to fall back and load old versions of the profile and then upgrade them to the new Model structure can be fairly self-contained at the edge where the profile is loaded, and then the rest of the application can just always assume it has access to the newest Model version.
Maintains atomicity of updates for all fields. By keeping a single Model containing all the desired fields in one place, all the fields in the object can be updated at once in a transactional manner where either the full update is visible or none of it is.

Cons

Fragmentation of data across applications. If two apps are both using the UserProfile Model, and then one of them updates to the UserProfileWithEmail Model, now they are no longer sharing data between them anymore. Coordinating multiple independent applications across the ecosystem to all upgrade to the new Model might be a very difficult problem.
Poor end user experience. As a result of the fragmentation mentioned in the above bullet, end user experience might degrade. At the beginning when both apps are using the same UserProfile Model, the end user has the benefit of only having to create their profile once and having it show up in both applications. If the user later updates the profile via one app, the update will be visible in both apps. Once one of the applications updates to the UserProfileWithEmail Model, however, now the two apps have two totally independent profiles, and updates made via one app will not be visible to the other any more. It likely won’t be clear to the end user why the profiles that previously were kept in sync no longer stay in sync.

Conclusion

Unfortunately there’s no easy answers to these problems. For data that is only used within a single application, there are a lot more possibilities for how schema evolution can be handled. But when data is controlled by the end user and shared between multiple applications, this gets a lot trickier. Note that these challenges and restrictions aren’t unique to Ceramic - any type of decentralized data system in which end users sign their own writes with their own keys and in which data can be shared between applications will have to wrestle with these same issues.

This is an area where I’m hoping the broader Ceramic community may contribute new ideas, tooling, libraries, etc to make these problems at least a little easier to work around. One interesting area of research here is the idea of automatic schema transformations via something like Cambria Lenses: Project Cambria: Translate your data with lenses.

Hopefully this gives enough context to help you reason about the tradeoffs and make a decision that makes sense for your use case. If you have any additional questions in this area please let us know!

0xalthea · July 13, 2024, 4:27pm

Appreciate the explanation and suggestions!