At 14:54 UTC on Saturday, July 18th, 2020 we discovered that a small number of IRCCloud users were seeing logs from other users. This affected certain users on our “hathersage” and “stonehaven” connection servers between 12:53 UTC on Friday July 17th, and around 16:00 UTC on Saturday July 18th.
In order to resolve this issue, we have had to delete some logs for users during this period. All users affected by this issue will receive an email with more details of exactly which logs were affected.
This incident was triggered by a mis-configuration error on our part, and we’re very sorry for this failure to maintain your data integrity. We take seriously the trust placed in us to safeguard your data privacy and security, and we are committed to restoring that trust.
How this happened
This issue was related to our ongoing work to move our connection servers to a more decentralised architecture, which, as we’ve mentioned before, will allow us more flexibility on where we can host these servers.
Part of this work involves connection and buffer (channel/private message) metadata being stored in a separate, local database for each connection server, for performance and reliability reasons. This means that connection and buffer IDs are generated independently on each server, rather than by a single central database. These IDs are namespaced with the ID of the server they were generated on, and each server stores the offset of the IDs it’s currently issuing.
We had originally started to migrate users to hathersage on March 22nd, in the final step of a rolling upgrade of all of our connection servers. However, this migration was aborted when the server suffered a hardware failure on April 18th. As part of the rollback, we moved these users back to another server (stonehaven).
When we finally received replacement hardware for hathersage, we restarted the migration at 12:53 UTC on July 17th. This was the first time we had placed users back on a freshly-configured server which reused a server ID, and our server deployment procedure did not prompt us to configure the offset. The considerable amount of time which had elapsed since the previous migration, due to the hardware failure and other factors, also meant that this process was not fresh in our minds.
Because of this, hathersage started to issue connection and buffer IDs which duplicated those which already existed on stonehaven. These duplicated IDs meant that logs from multiple users’ buffers were being combined together in our log data store. When one user fetched a backlog, the logs from both buffers were returned.
The immediate fix involved correctly setting the offset, and this was completed at 15:45 UTC on Saturday 18th. At 16:07 UTC, we disabled backlog fetching for all users on hathersage and stonehaven to prevent incorrect logs from being displayed to users.
We then deleted all new buffers and connections created on hathersage during the 27 hour window since the start of the issue. This was completed at 16:28 UTC and backlog fetching was restored on hathersage.
Since the reused IDs had been in use on stonehaven for much longer, deleting all affected logs would have caused considerable data loss for those users. Instead, we began a more involved process of purging those logs of any data leaked from other users. This was completed at 22:03 UTC and backlog fetching was subsequently restored on stonehaven.
Summary of the impact to users
hathersage: Users who created new buffers and connections between 12:53 UTC on Friday July 17th, and around 16:00 UTC on Saturday July 18th may have seen logs from stonehaven accounts in those buffers. They may also have had those newly created logs exposed back to the stonehaven accounts. Those connections and buffers have now been deleted from accounts on hathersage.
stonehaven: Some users who were initially moved during the aborted migration earlier this year may have had some of their logs exposed to users now on hathersage. They may also have seen newly created logs from the affected users on hathersage in their own accounts.
All log entries that were inadvertently shared between users has now been deleted from log storage.
We will be emailing both sets of users with details on their affected logs.
How we’ll avoid this in future
We will require a manual confirmation step to set the offsets on a connection server before it can be brought into use.
We will implement additional checks when backlogs are fetched to ensure that stored logs cannot be leaked to the wrong users in the case of a buffer ID collision.
The IRCCloud service experienced an extended period of downtime between around 22:40 on 07/07/2020 and 19:10 on 08/07/2020 (UTC) due to a fault with our internet service provider. This was the second time such an outage has occurred this year, and we are as frustrated as you are with this unacceptable situation.
This outage affected seven of the eight servers we use to handle our outgoing IRC connections. As we explained in our report on the previous outage, these outbound connection servers have a fairly unique networking configuration which means they are hosted by a specialist ISP, which we cannot quickly migrate away from.
We have built sufficient redundancy to ensure that we can survive the loss of one or two of these eight servers, but we don’t yet have the capability to survive the loss of all of them.
Although we haven’t yet received a full explanation from our ISP, it appears that this problem was caused by some kind of networking failure which required a technician to visit the datacenter to resolve. For reasons which are unclear, it took more than 12 hours for this to be arranged.
As we mentioned in our previous report, we have been working on making changes to our system so we can move to a new ISP. These changes are almost complete and we will now accelerate our migration away from this ISP, which we hope to complete within the next few months.
If you’re an IRCCloud subscriber, we’re happy to issue you a month’s refund in compensation for this downtime - drop us an email at firstname.lastname@example.org with the email address associated with your account.
Last night we experienced approximately 12 hours of downtime between around 18:00 and 06:40 UTC, caused by a prolonged period of internet routing issues which our ISP has attributed to a failed line card in one of their routers. This was our longest period of downtime in many years and we’re very sorry for the disruption it caused.
Running a large service which interfaces with the venerable IRC protocol poses a different set of challenges to most modern web services: Firstly, we have to manage a large number of outbound IRC connections while ensuring as few disconnections as possible. Secondly, IRC networks expect our users to connect from a consistent set of IP addresses, and lastly, IRCCloud is subject to a high volume of distributed denial of service (DDoS) attacks.
These constraints mean that our outbound connection servers, which actually make your outbound IRC connections, have been hosted for years by a specialist DDoS-resistant hosting service provided by a major ISP. This is a costly part of our infrastructure, and it wouldn’t be economical for us to completely duplicate these servers elsewhere to mitigate against rare situations like the one last night. Switching to another ISP - even if we could find one to provide the required servers at short notice - would involve a long process of getting new IP addresses whitelisted by IRC networks.
Our current architecture also restricts us to running our outbound connection servers in relatively close proximity to the rest of our infrastructure (which is hosted on Amazon Web Services). Over the last few months we’ve been working on a significant update of our backend software to remove this restriction - in fact, we started rolling this update out yesterday.
These improvements will make it easier for us to investigate other approaches for our outbound connection servers in future, and we’ll certainly be discussing network redundancy with our ISP and future providers.
If you’re an IRCCloud subscriber, we’re happy to issue you a month’s refund in compensation for this downtime - drop us an email at email@example.com.