Setting up HA Ceramic cluster plus node snapshots

There isn’t really a reliable way to do an HA setup currently, because for this to work well you want to ensure that the data between your primary and secondary nodes are kept in sync, and the pubsub mechanism we use today for data synchronization isn’t very reliable. You could still do it, you can run two nodes both set to index the same data models, and so long as the nodes are both up and well connected, they will replicate each others writes and stay synchronized. The problem is that pubsub does not provide reliable message delivery, and so if either node is offline or disconnected in any way it can miss messages and your nodes can diverge.

We are getting ready to launch our new data synchronization protocol, Recon (see CIP-124: Recon (tip synchronization protocol) for more info), soon, which will radically improve the situation here. Recon will provide reliable data synchronization that can recover even in the face of dropped messages or temporary node outage. Once Recon is launched it will become very easy to run multiple nodes that are configured to synchronize the same set of data and put them behind a load balancer or similar HA/DR setup.

As far as a backup/restore strategy goes for disaster recovery, that should be totally possible today. There are 3 places that Ceramic stores data persistently: the IPFS block store, the Ceramic state store (lives in leveldb, by default in ~/.ceramic/statestore), and the ComposeDB index in Postgres. You would want to snapshot the ComposeDB Postgres database first, the Ceramic state store second, and the ipfs block store last. That should ensure you get a backup where the lower-level services always have all the data expected by the higher-level services.