Proposal for Flat Recon

mohsin · February 10, 2024, 12:54am

Related to CIP-124

Motivation

Recon is the set reconciliation of key-value pairs. It is a filter-then-divide-and-conquer protocol. As it currently exists, values you wish to filter on MUST be encoded in the Event ID.

This has a few adverse effects:

The size of the Event ID grows with the number of filters encoded.
The order of the filters matters. An application that would like to apply a different filter order would need a new Event ID.
The choice of filters is limited and inflexible.

Proposal

Flat Recon proposes moving the tags to filter on from keys to values, and simplifying the key to be the (abbreviated) hash of the value.

Under Flat Recon, each side begins the conversation by sending its Interests. Each side then calculates its associative hash only on values that survive the filters corresponding to these Interests. Now we can express ranges and perform divide-and-conquer in terms of the (much shorter) keys.

Example value:

[
  {
    "model": "Model_ID",
    "controller": "DID_1",
    "after": "Timestamp_1",
  },
  "payload"
]

This makes Recon for Models more analogous to the SQL query:

SELECT Sum(hash)
WHERE model = <Model_ID>
  AND controller in (<DID_1>, <DID_2>)
  AND after > <Timestamp>
  AND <left_fencepost> < hash
  AND hash < <right_fencepost>

model, controller, and after become filters and we use the left and right fenceposts to narrow down the range.

For specifying Interests, we want a more restricted language than SQL. Since both the tags and values are strings, queries for Interests would likely be restricted to the operations =, <, <=, >, >=, in, and not in.

Once Recon finds a key that is on one side but not the other, it will send the value that contains the tags and the payload.

With Flat Recon, we no longer care about the order the filters are in because the SQL engine can choose the query plan. This enables us to make the wire protocol independent of the query plan.

An option we could use for the keys under this proposal would be the first 96 bits (12 bytes) of the SHA256 of the payload. For the SHA sums, we could use the first 12 bytes of the sum of the SHA256 of the values.

This also lets us separate all the Ceramic logic from Recon logic.

The API to Recon now becomes:

Recon.Add([ { "model": "Model_ID", "controller": "DID_1" }, "payload" ])

Recon.Subscribe("controller in (DID_1, DID_2)")

// Streams new events matching the Interests you're subscribed to
Stream<Events> <- Recon.Listen()

// Gives you everything currently stored matching the specified Interest
Set<Events> <- Recon.Query("controller in (DID_1, DID_2)")

// Not relevant to flat recon
Recon.Connect(peer)

jthor · February 13, 2024, 9:05am

This seems like a great improvement for Recon from a usability perspective. It seems like it could also make it easier to introduce new filters in the future.

One key aspect that I think was left out in the OP is that two nodes talking Recon still needs to sort their databases in the same order, otherwise range hashes would end up being random and not converge even if nodes have the same data.

Would also love to hear @nathanielc chime in here!

nathanielc · February 14, 2024, 4:44pm

Here are some initial thoughts. I have many more but we can dig deeper into each in follow up replies.

The ergonomics of the API and the end user are clearly improved
This design makes a trade off between ease of use and performance of the server implementation.
a. This approach assumes a relational engine backs the implementation of Recon. Do we like that?
b. Arbitrary filters means either slow queries that need to do full table scans for each range request or maintaining an index for each possible filter predicate.
c. This approach makes it difficult to implement an in memory cache of hashes for ranges as now we need to maintain that cache indexed by predicates.

Some branch points for further discussion:

2a.

It seems fine that Recon assumes a relational engine in its implementation but something that should likely be explicit about.

2b.

One solution to arbitrary filters is to instead limit the set of filters to a discrete set. For example you can filter on model and controller but not event height. Or you can only filter on controller if you also filter on model, (i.e. force a hierarchical sort ). If a peer tries to talk Recon with an unsupported predicate the node would refuse to talk. This seems to actually hurt the ergonomics of the API if it appears to be more expressive than it really is. As users will be confused/surprised they cannot combine arbitrary filters.

If we instead do allow arbitrary filters I expect this will destroy sync performance as it will practically mean full table scans for the first layers of the recon conversation.

2c.

We have discussed building an in memory LRU cache of hash ranges and updates on write. In short a B-Tree that maintains hash sum on all nodes not just leafs and then prunes the tree to a certain depth based on usage.

However building such an in memory cache would not be as straightforward in a Flat Recon model as either it will never get a cache hit or you will have to maintain different versions of the cache with different predicates.

This comes back to point 2a because it means we basically need an in-memory relational engine to build such a cache.

Conclusion

Generally I expect that Flat Recon will introduce significant performance burden on the server implementation such that it becomes unusable at scale. However I would like to empirically test some of these concerns at defined scales before actually concluding one way or the other.