RFC: Query Filtering and Ordering (New Feature)

msena · May 3, 2023, 12:45pm

Treat this proposal as a first draft that will evolve based on your feedback, not a final version to rubber stamp. Please let us know your thoughts as comments below!

Problem

ComposeDB currently supports loading lists of documents by their model type, their controlling account, or their related documents. However, ComposeDB lacks the ability to filter and/or order the returned documents by their fields. This has been a heavily requested feature by the community on this forum and in direct conversations.

Solution

This RFC introduces a possible solution to add simple filtering and ordering primitives to GraphQL queries as arguments on GraphQL connections, in addition to the existing pagination arguments. As a requirement for filtering and ordering, this proposal also defines how users specify the fields they want indexed.

Approach

This RFC is written with the following in mind:

MVP – Provide an intentionally minimum set of functionalities; let the community ask for more
UX – Queries should be easy to setup and use
Performance – Queries should perform efficiently
Security – Feature should be secure and introduce no vulnerabilities to the node

Example Composite

For discussion in this RFC, we will be using the following composite:

enum PublicationStatus {
  DRAFT
  PUBLISHED
  ARCHIVED
}

type Image {
  src: URI!
  alt: String
}

type Post @createModel(...) {
  text: String!
  status: PublicationStatus!
  publishedAt: Date
  image: Image!
  tags: [String!]
}

Indexing

To properly support query filtering and sorting, we need to define a format for indexed fields. In our approach, fields can be set to be indexed using the @createIndices directive in the composite, as seen below:

enum PublicationStatus {
  DRAFT
  PUBLISHED
  ARCHIVED
}

type Image {
  src: URI!
  alt: String
}

type Post @createModel(...) {
  text: String!
  status: PublicationStatus!
  publishedAt: Date
  image: Image!
  tags: [String!]
}

@createIndices(model: "Post",
               indices: [
		         {"fields":["status"]},
                 {"fields":["publishedAt"]}
			   ])

We would have the following definition for the indices field in the directive:

{
  "fields": [array of field names, with nested fields specified using dot notations],
}

Supported Indices

The following types of fields will be supported for creating indices:

Scalar (including Enum) - Top Level
Scalar - On nested objects, such as alt in image

The following types fields will NOT be supported for creating indices:

Arrays - Arrays will not be indexed

For indices, we will support both single value indices and multiple value (e.g. multi-column) indices.

Design Considerations

We chose this approach to specifying indices separately in the composite file (vs directly in the models) because it separates indices from the models.

Benefits

Multiple apps can reuse the same models, but each app can set their own indices based on their particular query patterns. This is great for model reuse and composability.

Tradeoffs

Users could create too many indices which would negatively impact the performance of their ComposeDB deployment.
Devs who deploy a Model to their app will need to remember to add indices themselves to support whatever queries they want to perform - the indices won’t automatically come with installing the Model.
This is a slightly different approach than developers might be used to with popular GraphQL data tooling such as Prisma or Hasura, which uses an @index directive directly in the GraphQL model.

Filtering

This section describes value filters, object filters, logic filters, and contains an example.

Value filters

Value filters apply to individual fields in a document. They are object containing a set of supported keys, but a single key must be used per filter.

The following input describes the simple boolean value filter, allowing to filter by a specific value or is the field value is null:

input BooleanValueFilterInput {
  isNull: Boolean
  equalTo: Boolean
}

For other scalars, the GraphQL runtime would generate the following input object for all supported scalars. The Float scalar is used here as example, and would be replaced by the specific scalar (Int, String, Date, URI…) in every instance:

input FloatValueFilterInput {
  isNull: Boolean
  equalTo: Float
  notEqualTo: Float
  in: [Float!]
  notIn: [Float!]
  lessThan: Float
  lessThanOrEqualTo: Float
  greaterThan: Float
  greaterThanOrEqualTo: Float
}

Enums present in the schema also get value filters generated:

enum PublicationStatus {
  DRAFT
  PUBLISHED
  ARCHIVED
}

input PublicationStatusValueFilterInput {
  isNull: Boolean
  equalTo: PublicationStatus
  notEqualTo: PublicationStatus
  in: [PublicationStatus!]
  notIn: [PublicationStatus!]
}

Object filters

Object filters contain the mapping of indexed fields to value filters in a document. This is the simple filtering we would support when this is released.

For example, our Post document described above would have the following object filter generated:

input PostObjectFilterInput {
  status: PublicationStatusValueFilterInput
  publishedAt: DateValueFilterInput
}

Logical filters

At a higher level, it is possible to use logical conditions to refine the filtering on a document, for example based on our PostObjectFilterInput above:

input PostFilterInput {
  doc: PostObjectFilterInput
  and: [PostFilterInput!]
  or: [PostFilterInput!]
  not: PostFilterInput
}

As with value filters, only one key/value pair must be present in the filter object.

Example Filter

The filter being recursive, conditions can be nested, for example in a GraphQL JSON input:

// Filter posts that are drafts or published in April 2023
{
  "filter": {
    "and": [
		  "or": [
		    { "doc": { "status": { "equalTo": "DRAFT" } } },
		    {
	        "and": [
			    { "doc": { "status": { "equalTo": "PUBLISHED" } } },
		        { "doc": { "publishedAt": { "greaterThanOrEqualTo": "2023-04-01" } } },
		        { "doc": { "publishedAt": { "lessThanOrEqualTo": "2023-04-30" } } }
		      ]
	        }
		  ]
	  ]
  }
}

Ordering

Ordering can be specified on an indexed field, in ascending or descending order, with null values last:

enum SortOrder {
  ASC
  DESC
}

As with filtering, an input type is automatically generated for each document having indexed fields.

For example, our Post document described above as:

type Post @createModel(...) {
  text: String! @string(...)
  status: PublicationStatus!
  publishedAt: Date
  image: Image!
  tags: [String!]
}

@createIndices(model: "Post",
               indices: [
                   {"fields":["status"]},
                   {"fields":["publishedAt"]}
		       ])

Would have the following object generated:

input PostOrderByInput {
  status: SortOrder
  publishedAt: SortOrder
}

Querying

filter and orderBy arguments can be provided in all GraphQL connections, in addition to the first, last, after and before arguments used for pagination already supported by ComposeDB:

query {
  postIndex(
    # Existing pagination arguments
    first: Int,
    last: Int,
    after: String,
    before: String,
    # Added filtering argument
    filter: PostFilterInput,
    # Added ordering argument
    orderBy: PostOrderByInput
  )
}

Downsides, Limitations, and Tradeoffs

This proposal provides a set of operators which would work with ComposeDB, but may not include all operators that users would want. We would seek to implement the operators that would cover the majority of the ComposeDB query use cases.
This approach requires users to learn our GraphQL operators, and how to combine them to achieve the results they want.
Queries cannot be prepared ahead of time, so could be rejected at runtime if malformed or poorly performing.

Possible Future Features

Custom Queries

In this approach, developers specify the fields they want indexed and then write custom functions for querying data based on those indices. This approach offers greater query customization and power, however it is more complex and requires developers to interact with the specified query language. Note, for this example we use SQL, but we could support any query language.

Given a composite:

type Ball {
  red: Number!
  green: Number!
  blue: Number!
  position: Position!
}

@createIndices(model: "Ball", 
               indices: [
			     {"fields":["red"]},
				 {"fields":["green"]},
				 {"fields":["blue"]}
			  ])

Users could then write raw queries in some query language that is decided upon by the ComposeDB team (SQL, Cypher, or custom):

input BallQuery {
  desiredRed: Number!
  neededGreen: Number!
  wantedBlue: Number!
}

query {
  @query(def: "SELECT * FROM Ball WHERE red = $desiredRed AND blue = $wantedBlue AND green = $neededGreen")
  getBallsByColor(input: BallQuery!): [Ball!]
  
  @query(def: "SELECT * FROM Ball WHERE red > 0")
  getRedBalls(): [Ball!]
}

The raw queries would be prepared on the node, and the input would be passed to the prepared statement. An example usage would be:

query GetBallsByColorQuery($input: BallQuery) {
  getBallsByColor(input: $input) {
    position
  }
}

Note that in this system users would need to learn the query language specified. Additionally, queries would be evaluated when creating composites, and composite creation could fail due to queries being malformed or poorly performing.

Inspiration

For this proposal, we have been inspired by the following data products:

Feedback

Please comment below. Based on your feedback, we will start finalizing designs and begin implementation. Here’s some of the things we’re interested in learning:

How satisfied are you with this implementation?
Does this meet all of your filtering / ordering needs and use cases?
Which ones does it meet? Where does it fall short?
How would you improve it?
Do you need more complex filtering capabilities like joins and aggregations? What do you want to use them for?
Are your query needs so specific that they can only be satisfied by custom resolvers (Custom Queries)? If so, which query language would you prefer to use (e.g. SQL, Cypher, MongoDB, custom)?

msena · May 3, 2023, 2:00pm

Draft of your requested filtering / ordering feature is ready for review @Sami @andreiv @0xEE3CA4dd4CeB341691 @erci.eth @IvanH @kiba @Larisabrownb @dysbulic @cristiana @0xE243BB905378f7aB70 @0xA25532B1287dEe6501 @leo

0xEE3CA4dd4CeB341691 · May 7, 2023, 1:07am

Hi Michael. Super excited to see this going!

Liked the idea of separating indexing from model. Does that means I could index composite created by others?
query/filter seems quite flexible. But is there a way to query without defining filter? eg. GraphQL API - The Graph Docs for thegraph, you could just query without defining filters and it seems all fields are indexed by default?
Understanding that composeDB usecase is quite different from thegraph, I think there maybe some advantages of offering similar APIs as devs are in general familiar with thegraph
for now I don’t think complex filtering like joins and aggregations are must-haves. Also they are probably better suited for a SQL index rather than composeDB. But I think SQL index will be an option in addition to composeDB?

erci.eth · May 7, 2023, 6:55pm

Hey there!

First off, thank you for sharing this RFC with, we’ve been waiting We at denoted (@phi_2 and I) reviewed it together, and we have a few questions and comments:

In this example, what does the community field refer to? Seems like a typo

input PostObjectFilterInput {
  community: CommunityObjectFilterInput
  status: PublicationStatusValueFilterInput
  publishedAt: DateValueFilterInput
}

Can you update an existing model by adding, modifying, or removing indices after it has been deployed?
Is filtering on embedded shapes supported?
You mentioned that poorly performing queries might be rejected at runtime. Could you elaborate on what “poorly performing” means?
Are there any limitations when using custom queries?

Our main concerns revolve around relationships and joining data:

We’d most likely want to reference relations in filtered queries, is that possible? Perhaps with custom queries?
We’d be interested in basic aggregations like the total number of objects for a specific model (i.e COUNT)

We’re also curious about the implementation details! Will there be a separate table created for each indexed model? Understanding the data structure and performance implications will help us plan our own data modelling and querying strategies more effectively.

Side note, have you considered giving consumers of the protocol more control over indexing with like a CDC mechanism? This would be great for us since it’d allow all kinds of indexing and make it possible for us to solve our problems without relying entirely on your implementation. Just a thought!

Thanks again for sharing this RFC with us. We’re excited to see how it evolves and look forward to contributing to the discussion. Keep up the great work!

dbcfd · May 8, 2023, 4:18pm

It is a typo. There was some information about querying across models, but that is no longer part of this RFC.

Part of the idea of this proposal is that indices are part of your composite, and won’t change an existing deployed model or composite.

Filtering on nested types will be supported, however we’re still discussing on querying and filtering when you reference another model in your model. Looks like this is a requirement you are interested in.

This is another area of discussion, on how to deal with highly nested or complex queries. One possibility is a hard limit configured on the node, another one is runtime analysis and rejecting queries that would impact the node.

Can you elaborate on what you mean by CDC mechanism? Are you wanting something similar to a commit log?

dbcfd · May 8, 2023, 4:21pm

Yes, you would be able to specify your own indices on top of models/composites that others create.

So the filters described above were what ComposeDB generates “under the hood”, to allow you to query. You do not actually have to define the filters.

spencer · May 8, 2023, 6:44pm

To elaborate on what @dbcfd said above:

I removed this reference from the original post - thanks for catching it!

In this proposal, indices are specified at the composite layer (which is app-specific) instead of the model layer (which is shared between apps). This allows apps to specify what indices are needed for their app. It should indeed be possible to add new indices to existing data my modifying your composite, although there may be some challenges around doing so in a zero-downtime way as the existing synced data will need to be re-indexed.

ComposeDB already supports “count” queries, so this will be supported as well when using filters on fields.

Note that Custom Queries are listed as a Future Improvement and are not a core part of this proposal. The design for Custom Queries is not finalized yet and we don’t know what kind of limitations they will have. One of the things we hope to learn with this RFC is how many of our user’s use cases can be satisfied by this simple filtering support, vs how many require more complex Custom Queries.

Exactly, for now this RFC proposal does not support any kind of “joins” for filters on related models. Such support would require the Custom Queries support, which as mentioned has not been fully designed yet and is not a part of this initial proposal.

Yes, currently ComposeDB relies on a Postgres database in the backend, with one Table for each indexed Model.

Yes, in the long term we eventually want to allow users to interact with the Ceramic Event Streaming protocol directly, without having to go through higher-level APIs like ComposeDB. This will enable more advanced users to build their own database implementations on top of the raw event streams.

spencer · May 8, 2023, 6:50pm

@0xEE3CA4dd4CeB341691 note that ComposeDB uses Postgres under the hood. It is also possible to connect directly to the Postgres database that ComposeDB uses and do queries there directly for more complex queries than are currently supported in ComposeDB APIs

erci.eth · May 9, 2023, 7:20am

Thank you for the answers @spencer & @dbcfd!

What @spencer is referring to with Ceramic Event Streaming sounds interesting!

0xEE3CA4dd4CeB341691 · May 9, 2023, 3:13pm

Thanks for the answer Spencer! I am considering the use cases of both when devs running their own node and using a public nodes. It also requires addition api layer for direct DB so that’s tricky for public nodes

msena · May 11, 2023, 10:22am

Any more feedback? We’re starting to finalize designs in the next week and want to be sure we’re considering everyone’s voice!

leo · May 11, 2023, 1:33pm

This design sounds really exciting, love this RFC! Some comments on my side:

Multiple apps can reuse the same models, but each app can set their own indices based on their particular query patterns. This is great for model reuse and composability.

This goal makes a lot of sense and I really like it, however, from my understanding, only the fields added as indexes would be filterable. It makes sense as the most frequent queries should run fast, but you might on certain occasions need to create some queries that don’t require an index as not often queried, and it seems that it would not be possible with the current solution (and that is possible with Hasura for example). I am wondering if an approach where all the fields are filterable and devs are responsible to manage their indexes for their apps makes sense.
A benefit of this approach would be faster development time. On local when creating your app, you could query all the fields by default and only think about adding indexes when deploying to production which would lead to a better DX. I didn’t get from this proposal if adding a new index and restarting the node require some time to re-sync the data or if it’s instant.

Queries cannot be prepared ahead of time, so could be rejected at runtime if malformed or poorly performing.

Maybe I am wrong there, but adding persisted queries capabilities to the node sounds possible? You could specify a folder of queries when deploying the node, and the queries would be hashed, the client would then send the hash to the server to only whitelist these queries or just for better performance (as less payload is sent to the server, see Automatic persisted queries - Apollo GraphQL Docs).
Devs can also use tooling like Home – GraphQL Code Generator to validate that their queries are valid as part of the CI/CD of their app, some guides would need to be created for this.

As a more general question, I would love to know the technical reason why fields are stored as JSON in Postgres vs using a more standard approach (columns). When I did hit the filtering limitation of the current ComposeDB version I thought about using Hasura on top of it for the time being, but having fields stored as JSON makes it more complex and you lose a lot of benefits of Hasura.

paul · May 11, 2023, 4:53pm

leo:

This design sounds really exciting, love this RFC! Some comments on my side:

Multiple apps can reuse the same models, but each app can set their own indices based on their particular query patterns. This is great for model reuse and composability.

This goal makes a lot of sense and I really like it, however, from my understanding, only the fields added as indexes would be filterable. It makes sense as the most frequent queries should run fast, but you might on certain occasions need to create some queries that don’t require an index as not often queried, and it seems that it would not be possible with the current solution (and that is possible with Hasura for example). I am wondering if an approach where all the fields are filterable and devs are responsible to manage their indexes for their apps makes sense.
A benefit of this approach would be faster development time. On local when creating your app, you could query all the fields by default and only think about adding indexes when deploying to production which would lead to a better DX. I didn’t get from this proposal if adding a new index and restarting the node require some time to re-sync the data or if it’s instant.

You’ll need to specify the indices in the composite so that the GraphQL schema used at runtime can be generated with the expected filters, so one thing you could do is having a first version of your composite with indices on all fields, and refining the indices once your more settled on your app’s needs.

Adding indices wouldn’t require to re-sync the data, but it might take some time to build the indices in the underlying database. If you do it during development with a limited number of documents I expect it would be pretty fast though, it’s mostly a constraint when adding indices in production.

Yes I’d love to support persisted queries, that would require deeper architectural changes to how ComposeDB works now but it should be possible at some point.

Documents in ComposeDB are structured and stored as JSON over the network and IPLD, the local database is basically used as a cache. We considered using MongoDB at some point to store documents, so the fact we’re using SQLite or Postgres should be seen more as an implementation detail for how ComposeDB provides its functionalities.

msena · May 22, 2023, 2:01pm

We’ve started implementing this feature and will report back here when it’s ready for beta testing!

pickleback · May 23, 2023, 5:37pm

0xEE3CA4dd4CeB341691 · July 8, 2023, 11:10pm

Do we have a ball park estimate of when would this be launched?

0xEE3CA4dd4CeB341691 · July 8, 2023, 11:12pm

msena · July 11, 2023, 5:01pm

It goes without saying that all time estimates are subject to change, but the ComposeDB team has approx. 2 weeks left of development and testing then will be aiming to release sometime in the week(s) following that.

mhkarimi78 · July 24, 2023, 10:25am

we are waiting

0xEE3CA4dd4CeB341691 · July 25, 2023, 4:11am

Don’t think this is currently ready. but looks like good progress has been made https://github.com/ceramicstudio/js-composedb/pull/128