I’m Howie, founder of Protico (protico.io), the market-leading Web3 Chat provider.
Our team has been utilizing Ceramic and working on a Web3 Chatting Network for over one year. And, with our global users, recently we detected and confirmed a serious issue that the Ceramic team might want to check carefully.
We have found that recently when using the streamID, more and more basicprofile and socialconnection data we used to get have become not found (Result === content: {}).
However, when we make search on cerscan to double check the data, we can still see the relevant data.
Additional information about the debugging process:
When we load the stream, we have noticed the CACAO expired error. Therefore, we ran a load stream with sync=1. Afterwards, the anchor status returned “fail”, “pending”, and “not requested”.
On Cerscan, there are more commit ids being found comparing to the realtime data we get from the testnet.
Such as this one on Cerscan [https://cerscan.com/testnet-clay/stream/k2t6wyfsu4pg2gxysufkn3mzaa7x1usywuowllihdvq68r72jjh6mgi0a6cp3t]. However, when we run glaze stream:commits k2t6wyfsu4pg2gxysufkn3mzaa7x1usywuowllihdvq68r72jjh6mgi0a6cp3t on our own test ceramic node for testnet-clay, we only find 2 commit ids: [ ‘kjzl6cwd29gy9khjh4eqs9vex4ob17209m93tkzfmpm4e9ea2yi73pc20d6aj28’, ‘k6zn3rc3iwgui3dxnvhz7ia3rppnqtodtwsxz1zqsk373ousxfx3efk6ydsp3uxoiv2z7p1f6vder4os75qqaoghznuubywyy4k800ha7pu4irs3pzg2wb7’ ]
This would be a serious issue especially that our users are growing rapidly and some of their data just become missing on the testnet. What we are building is chatting network, which means most of the data are quite confidential & important to the users. So we need to check what went wrong as quick as possible, otherwise it would be a huge lost/risk for the whole ecosystem of Ceramic.
Hi Howie, I’m a product manager for Ceramic. My sincere apologies for any stream issues - we take data loss seriously. You and Bofu have an ambitious vision for chat and we want to do our part to support you.
We recently had an outage over New Years from 09:38 UTC on 29 Dec 2022 til 13:45 UTC 3 Jan 2023 due to an edge case involving the deletion of the DNS record for the Ceramic Anchor Service on mainnet. I’ve notified our engineers internally and they will reply here ASAP.
Thanks for replying to our report.
It’s great to know that the issue is being handled by someone. Thank you.
But like the other issue report on the forum, actually we have encountered this issue for a least one month since late November. This might be something you can check from you side as well.
Anyway, our team would be happy to cowork with you and figure out what went wrong together.
Hi @howi.eth, I just looked through the logs from our CAS (Ceramic Anchor Service) for the clay testnet, and I see no logs about the commit with the expired CACAO at all (the commit with IPFS CID bagcqceraabwjj4ov7mkkvkvhszlyxzx5o6dsbct72lyaog5kqfbuulal3qjq), or for the genesis commit for that stream (CID bafyreihibk6wseudmmry2yo6vapvdfdq62s62xifznglkg77tkhxkfnyde). It seems like commits from your node aren’t making it to our anchoring service.
How are you running your node? Can you share your Ceramic config file (defaults to being stored in ~/.ceramic/daemon.config.json) and the CLI command you use to launch Ceramic?
Can you also check your Ceramic node logs for any errors or warnings related to connecting to the anchoring service?
Note it is important that streams be anchored in a timely manner, otherwise there will be data loss when the CACAOs granting write permissions for the commits in the streams expires. It looks like your CACAOs are using a 1 week expiration timeout, so I would expect that streams work for about a week after they are created, and then after a week become corrupted. If we can get your node anchoring properly, then those data loss issues should stop.
Hi @howi.eth, please disregard my previous message, I was looking at the wrong time range in the logs. Your commits are indeed making it to our anchor service. The problem is something else, I have a lead I am following up on now, will add more details soon
Okay, the actual issue is a bit weird, but also ultimately results in your update not getting anchored, causing it to eventually become invalidated when the CACAO times out after a week.
The reason why its not getting anchored is interesting though. This stream is apparently an old stream, and previously had an anchor commit for the original genesis commit from when the stream was created. That original anchor commit is from July 2022, and was performed against the Ropsten Ethereum Testnet, which is now decommissioned. These days our anchors for the Clay testnet happen on Gnosis Chain. Because that Anchor Commit is on a no-longer-existing blockchain, the anchor commit cannot be successfully applied on any Ceramic nodes. That is why the recent update commit from Jan 2nd did not build on that Anchor Commit, but rather built on the stream’s genesis commit instead. So far so good.
The problem is, that the Ceramic node that is backing our CAS, still has a copy of that old, no-longer-valid anchor commit. The CAS relies on a Ceramic node to perform conflict resolution to determine which commit to anchor when there are multiple conflicting commits for the same stream. But anchor commits always win conflict resolution over data commits. So that means the CAS Ceramic node keeps seeing this old anchor commit as taking priority over your new data commit, and refuses to anchor it.
This is a weird edge case left over from the migration of the clay Testnet from Ropsten to Gnosis chain. The good news it that this issue can’t affect mainnet, it’s specific to the Clay testnet and dev-unstable Ceramic networks.
I am discussing with the rest of the team the best way to process to unbreak this so that your writes to streams like this can proceed without issue. It may well involve having to wipe the data from our hosted Ceramic nodes so the CAS stops being confused about what the right history for these streams is. That’s something we’d need give the community a bit of warning about, so would take at least a week to happen most likely.
Sorry about the confusion and the impact from this bug! Hopefully we can get your test environment back up and running shortly.
Do you have any way to confirm that this issue is exclusively affecting users who have been using the system for a while, since before early August? Are any user accounts who signed up for the first time in just the last few months seeing issues as well?
Thanks for checking the report. It’s good to know someone is on this serious issue.
And, after checked our log records, we did find that the issue happened on users who user accounts who signed up for the first time after August.
Here are some examples: did:pkh:eip155:1:0x71cdfcc1d96f0cdfe0ea8ec6b19cf5bcfe4d4e16
Signup date: 09.06.2022
There are 2 idxx for this user on Cerscan
did:pkh:eip155:1:0x1d690d5b01225384bb00a5bd42dbc473a901059b
Signup date: 09.14.2022
The user’s socialconnection is missing on Cerscan
did:pkh:eip155:1:0x4a11638d730a8f5ab916ad9267030bef4659f5e1
Signup date: 10.11.2022
There are 2 idxx for this user on Cerscan
did:pkh:eip155:1:0x51ada5ae86db089879cfec972c2e59ab1e777b7a
Signup date: 11.24.2022
You can’t find any data on Cerscan, but you can get data through Ceramic API
We hope this information helps.
And we also noticed that you plan to suspend the testnet and ask everyone to go to use mainnet by 1/18.
However, we suppose there are more issues on other projects than we had detected in ours. So, please make sure you have a solid backup plan for early developers who have built a lot on the network like us can get a smooth transformation.
Thank you for sharing the additional accounts you’ve seen issues with, I will try to investigate them further later this week.
For now I just wanted to clarify, we have no plans to decommission our Clay Testnet. The testnet will remain online and operational for the foreseeable future. The thing that is happening tomorrow is that we are clearing the saved data from the testnet nodes that we operate. Other nodes on the testnet network will be unaffected, and the nodes that we run will continue to operate normally after the data clear event tomorrow, its just that older streams will no longer be saved on those nodes anymore. I hope that helps clarify things and reduces your concerns about developers being able to develop and test before going to mainnet.
Okay, it looks like there are multiple different things going on here:
did:pkh:eip155:1:0x71cdfcc1d96f0cdfe0ea8ec6b19cf5bcfe4d4e16
Signup date: 09.06.2022
There are 2 idxx for this user on Cerscan
The two IDX streams have different capitalizations for the controller. We changed our libraries that manage DIDs to always lower case them in an older version, exactly to prevent duplication due to the DID showing up with different capitalizations. So the duplicate is probably one from before we started always lower casing and one from after. This issue shouldn’t happen for new data going forward.
did:pkh:eip155:1:0x1d690d5b01225384bb00a5bd42dbc473a901059b
Signup date: 09.14.2022
The user’s socialconnection is missing on Cerscan
It’s true that this user’s IDX does not have a link to the social connection document. I can’t really tell why, I don’t see any reference to any commits that tried to set it in the first place, without logs from the original Ceramic node where the updates were performed, it’d be hard for me to tell. You could try looking through your node’s logs for any errors updating the IDX stream (k2t6wyfsu4pg2hwycctvjqf700wub6cckcge5ikqhv2mu4g5afrh6hdn84s8ao). Also if the user were to log in again and retry linking this social connection, I’d expect it to succeed.
did:pkh:eip155:1:0x4a11638d730a8f5ab916ad9267030bef4659f5e1
Signup date: 10.11.2022
There are 2 idxx for this user on Cerscan
This is interesting. At first glance it looks the same as the first issue - the controller is capitalized in one of the IDX streams but not the other. What’s odd is that both streams were anchored at the same time (October 11th 2022, 11:13am GMT), meaning they were created around the same time, which makes it odd that two documents with different capitalization formats for the controller would be created. Maybe the users browser had an older version of the authentication library caches, and then on reload it got the newer version of the library with the new logic? @zfer I wonder if you have any ideas how this could have happened?
did:pkh:eip155:1:0x51ada5ae86db089879cfec972c2e59ab1e777b7a
Signup date: 11.24.2022
You can’t find any data on Cerscan, but you can get data through Ceramic API
Definitely not sure about last part yet, but that is very odd, esp because different capitalizations would usually imply diff wallet/prov clients at the same time as well. Maybe something implementation related here.
I was recently looking at a did:pkh normalization issue, ie can have both formats in dids/pkh libs, but was testing and didnt think there was any issue (thought normalized elsewhere) or maybe missed it, already have a story this cycle for normalizing did:pkh(s) in those libs. Maybe helps here, but not clear.
Thank you, @zfer@spencer for checking this issue.
We believe that this is something urgent and serious that should be under your radar.
Our project is deeply built on Ceramic, so we would be happy to work with you to resolve it and bring better reliability to the network.
Hi @zfer ,
Sorry for replying late, I missed your response somehow.
But you were right, we do use did-session client side in our project.
Any idea you have so far that we can find the missing data stream back?
Hi @zfer@spencer
It’s been a while since the last time you reply.
I wonder if you had time to check the issue I reported and make it fixed?
The missing data is quite crucial for all the users, and I suppose it’s certainly we can bring them back with the work on your end.
Please let me know if you need any further information.
Thanks
Sorry, I didn’t realize this was still an issue for you. The Clay testnet has much lower data reliability guarantees than mainnet and is mostly intended for development purposes. I had assumed you could just wipe the data and reset your app, and that your concern was more about whether this would happen again after you moved to mainnet. If you’re trying to provide production-level data persistence guarantees, then I definitely encourage you to move to mainnet instead of relying on the testnet.
As for what we can do about your testnet data: is it acceptable to recover the data in the existing streams, but need to create new streams to be able to update them again? Or are you okay losing some of the existing writes if you can get the current streams usable again to read and write?
also for data going forward and maybe similar issues, we did make a fix very recently to the latest @didtools/pkh-ethereum to normalize ethereum addresses (lowercase/checksum addresses)