At 14:54 UTC on Saturday, July 18th, 2020 we discovered that a small number of IRCCloud users were seeing logs from other users. This affected certain users on our “hathersage” and “stonehaven” connection servers between 12:53 UTC on Friday July 17th, and around 16:00 UTC on Saturday July 18th.
In order to resolve this issue, we have had to delete some logs for users during this period. All users affected by this issue will receive an email with more details of exactly which logs were affected.
This incident was triggered by a mis-configuration error on our part, and we’re very sorry for this failure to maintain your data integrity. We take seriously the trust placed in us to safeguard your data privacy and security, and we are committed to restoring that trust.
How this happened
This issue was related to our ongoing work to move our connection servers to a more decentralised architecture, which, as we’ve mentioned before, will allow us more flexibility on where we can host these servers.
Part of this work involves connection and buffer (channel/private message) metadata being stored in a separate, local database for each connection server, for performance and reliability reasons. This means that connection and buffer IDs are generated independently on each server, rather than by a single central database. These IDs are namespaced with the ID of the server they were generated on, and each server stores the offset of the IDs it’s currently issuing.
We had originally started to migrate users to hathersage on March 22nd, in the final step of a rolling upgrade of all of our connection servers. However, this migration was aborted when the server suffered a hardware failure on April 18th. As part of the rollback, we moved these users back to another server (stonehaven).
When we finally received replacement hardware for hathersage, we restarted the migration at 12:53 UTC on July 17th. This was the first time we had placed users back on a freshly-configured server which reused a server ID, and our server deployment procedure did not prompt us to configure the offset. The considerable amount of time which had elapsed since the previous migration, due to the hardware failure and other factors, also meant that this process was not fresh in our minds.
Because of this, hathersage started to issue connection and buffer IDs which duplicated those which already existed on stonehaven. These duplicated IDs meant that logs from multiple users’ buffers were being combined together in our log data store. When one user fetched a backlog, the logs from both buffers were returned.
The fix
The immediate fix involved correctly setting the offset, and this was completed at 15:45 UTC on Saturday 18th. At 16:07 UTC, we disabled backlog fetching for all users on hathersage and stonehaven to prevent incorrect logs from being displayed to users.
We then deleted all new buffers and connections created on hathersage during the 27 hour window since the start of the issue. This was completed at 16:28 UTC and backlog fetching was restored on hathersage.
Since the reused IDs had been in use on stonehaven for much longer, deleting all affected logs would have caused considerable data loss for those users. Instead, we began a more involved process of purging those logs of any data leaked from other users. This was completed at 22:03 UTC and backlog fetching was subsequently restored on stonehaven.
Summary of the impact to users
hathersage: Users who created new buffers and connections between 12:53 UTC on Friday July 17th, and around 16:00 UTC on Saturday July 18th may have seen logs from stonehaven accounts in those buffers. They may also have had those newly created logs exposed back to the stonehaven accounts. Those connections and buffers have now been deleted from accounts on hathersage.
stonehaven: Some users who were initially moved during the aborted migration earlier this year may have had some of their logs exposed to users now on hathersage. They may also have seen newly created logs from the affected users on hathersage in their own accounts.
All log entries that were inadvertently shared between users has now been deleted from log storage.
We will be emailing both sets of users with details on their affected logs.
How we’ll avoid this in future
We will require a manual confirmation step to set the offsets on a connection server before it can be brought into use.
We will implement additional checks when backlogs are fetched to ensure that stored logs cannot be leaked to the wrong users in the case of a buffer ID collision.