Setting up HA Ceramic cluster plus node snapshots

Current State
Ceramic ComposeDB environment up & running. Moving to production shortly.

Target Environment
Node setup in Docker Compose (NOT related to ComposeDB) w/ Docker Networking. Note: Does NOT use Swarm, Does NOT using k8s.

Problem Description
To move into production, need to have a redundant/resilient Ceramic node cluster (for obvious reasons; if you need me to elaborate, please ask). What is the process for this? Specifically, I need to know how to:

  1. Setup cluster nodes in all-active primary-secondary (i.e.: all nodes active) config
  2. Setup snapshot agents which are “Ceramic node”-aware. We can take snapshot of the entire node. But not only is that overkill, it may interfere with any pending/potential syncs/writes occurring on one node.

Notes
I see this on the roadmap ComposeDB HA cluster · Issue #13 · ceramicstudio/Roadmap · GitHub but it’s been there since Jul and there’s no update. Without this feature, we’d have to move away from Ceramic since “pray and hope” isn’t a good strategy given Ceramic user profiles houses actual financial-related transaction order data.

If this feature isn’t possible, is there a workaround?

There isn’t really a reliable way to do an HA setup currently, because for this to work well you want to ensure that the data between your primary and secondary nodes are kept in sync, and the pubsub mechanism we use today for data synchronization isn’t very reliable. You could still do it, you can run two nodes both set to index the same data models, and so long as the nodes are both up and well connected, they will replicate each others writes and stay synchronized. The problem is that pubsub does not provide reliable message delivery, and so if either node is offline or disconnected in any way it can miss messages and your nodes can diverge.

We are getting ready to launch our new data synchronization protocol, Recon (see CIP-124: Recon (tip synchronization protocol) for more info), soon, which will radically improve the situation here. Recon will provide reliable data synchronization that can recover even in the face of dropped messages or temporary node outage. Once Recon is launched it will become very easy to run multiple nodes that are configured to synchronize the same set of data and put them behind a load balancer or similar HA/DR setup.

As far as a backup/restore strategy goes for disaster recovery, that should be totally possible today. There are 3 places that Ceramic stores data persistently: the IPFS block store, the Ceramic state store (lives in leveldb, by default in ~/.ceramic/statestore), and the ComposeDB index in Postgres. You would want to snapshot the ComposeDB Postgres database first, the Ceramic state store second, and the ipfs block store last. That should ensure you get a backup where the lower-level services always have all the data expected by the higher-level services.

Thanks @spencer

Re: HA/Recon: That’s great to hear. Release date? Testing completion? What’s the test coverage?

Re: Snapshotting: Is there a standard procedure? Postgre has its own snapshot agent and I want to make sure that if I use that it does not violate some underlying mod the team has made to how Ceramic interacts with it.

Re: data location: For a) IPFS block store? And for b) Compose DB index? By default (esp. when in containerized form)?

If you plan on publishing a Ceramic containerization process, that would really solve many of these issues, and if you have plans for a k8s operator (didn’t see on roadmap). I can file a feature request but I’d like to understand what other elements you’re considering before I go push.

Thanks.

Re: HA/Recon: That’s great to hear. Release date? Testing completion? What’s the test coverage?

We are in the final phases of testing and stabilization right now. We’re hoping to release a beta within in the next few weeks, though exactly when it releases will depend on how testing goes and how long it takes to ensure things are stable and working well.

Re: Snapshotting: Is there a standard procedure? Postgre has its own snapshot agent and I want to make sure that if I use that it does not violate some underlying mod the team has made to how Ceramic interacts with it.

We generally use vanilla out-of-the-box Postgres. There are no custom changes to Postgres, we use it in pretty standard ways, so any normal backup system for Postgres should be fine.

Re: data location: For a) IPFS block store? And for b) Compose DB index? By default (esp. when in containerized form)?

For IPFS, it will depend on how you’re using IPFS. If you’re using our go-ipfs-daemon docker image (GitHub - ceramicnetwork/go-ipfs-daemon: This repo builds a Docker image that runs a js-ceramic compatible version of go-ipfs.) as we recommend, I believe data is stored to ~/.ipfs by default, though it would be good to confirm that yourself as I’m not 100% certain of that.

For the ComposeDB index, it’s up to you to run your own postgres node and configure where it stores things. I don’t know if we have any custom container images that would decide on the storage location for you, maybe @3ben or @mohsin can add more clarity on that.

1 Like

Thanks @spencer

I’d like to hear your opinion on the below to understand if something about the way Ceramic is setup would cause such a solution to falter/fail.

We use Ceph (or Gluster, etc…) to write the Ceramic ComposeDB index (essentially the Postgre data files) to a network volume. All nodes point to this volume. We instantiate Postgre in write through mode (i.e.: no caching; and accept the performance penalty (our duty cycle isn’t at 100% at the moment so we should see minimal impact)). Then have the load balancer point to one node only & failover to others when the one degrades (i.e.: no round robin).

  1. Would this work?
  2. Do we need to do the same with the IPFS block store & the Ceramic state store (i.e.: put them on the network volumes as well)?

Thanks.

Each ceramic/composedb node needs its own independent storage stack. You can’t share the same postgres database between nodes or they are likely to clobber each other’s writes and lead to data inconsistencies.

If ONLY one node (and always the same node) is ever doing writes until it degrades then another node picks up (as described above), how could that cause clobbering?

I guess if you’re very confident in your mutual exclusion guarantees then it should be fine. You’d also need to make sure that a node has never done any reads while it was secondary before it became primary, because if it did it could have old versions of documents cached in memory.