Thursday September 20 2012 • posted by James

Downtime post-mortem

Last week, on Thursday September 13, we attempted a planned upgrade of IRCCloud with an expected down time of up to 2 hours. It ended up taking all day, with partial-to-no availability for most users throughout.

Sorry for the poor service - we realise how much of a pain it is to be without your normal IRC client, and we’re busy working to make sure it doesn’t happen again.

Background

IRCCloud started out hosted on one server, where we changed settings and tuned the system as we went along. As part of growing the service, we recently invested time in moving all our configuration management to Chef, with automated deployments and upgrades provided by Jenkins (the subject of another post in future).

These changes allow us to easily deploy additional servers, safely upgrade the systems holding open your IRC connections, and run the site in a more robust and manageable way.

We tried to move the core service to our new infrastructure last week, but ultimately rolled back to the old system to properly address issues uncovered by the migration.

Alpha site

For the last 4 months we’ve been running an updated alpha version of the site with a small group of testers, in parallel with the beta.

The Alpha site lives on the new servers and infrastructure. Our plan was to migrate all users to the new site last week. This was going to involve some down time while we shut down user accounts and IRC connections, migrated them to a new database structure, and started them up again on the new system. Logs also needed migrating, but this was being done ahead of the downtime.

What happened

On Thursday morning we shut down the beta site, disconnecting everyone as planned, and began to migrate user accounts in the order they signed up. At the same time we redirected the irccloud.com domain to the new server, so people could immediately log in once migrated.

After resolving some initial problems with the upgrade procedure, we noticed the migration queue was being processed far slower than expected, owing to heavy load on the alpha server.

Throughout the day, we patched the system to address problems that were cropping up. With hindsight, we should definitely have aborted early on in the process and switched back to the old system.

We also had a couple of crashes caused by incorrectly set configuration parameters. Settings we had previously changed on the old system, but missed when building a new server. All our configuration is managed by Chef now, so mistakes like this will be much less likely when we deploy new servers in future.

At the end of the day, we had almost all users migrated to the new server. We left the remaining few migrations running over night. Things were working, but sluggish.

The following morning, rather than frantically patch the system in place, we decided to move everyone back to the old server. Although painful, this was better than inflicting slow and laggy performance on everyone for any longer. We should have done this much sooner.

Artificial load testing: too artificial

As part of testing performance on the new server, we generate artificial load with a test suite. It turns out our artificial tests were too artificial. The distribution of messages/joins/quits/parts/renames/etc we use for testing no longer resembles the traffic we see in the live environment.

We’ve fixed many of the issues that led to the server overloading and we’re improving our load-testing to better model the specifics of a full restart and ongoing operations. We’re also working to simulate other transient events better in our test environment (eg, netsplits, mass disconnect/reconnects, etc) before we migrate any more accounts to the new system.

Ongoing issues

We have a list of IRC networks where our session limit needs updating for the new servers. We’ve been contacting the relevant network admins this week to improve matters. Please email team@irccloud.com if you’re an IRC network admin and have any questions.

Some users are reporting a varying backlog gap in some channels between the end of August and near the time of the failed migration. This is a temporary display problem that should resolve itself as more activity in those channels pushes the gap out of cache. This might take a bit longer for quieter channels though.

Separately, we have a known issue on beta where backlog is inaccessible from before August 12th. This backlog is safe: it’s been moved to the new system, because we ran out of space on the current server. This is obviously an ugly solution and would have been fixed by the database migration. Since the rollback, this data is again unavailable. It will be accessible again once we migrate over to the new server.

If you’re having any other issues in the mean time, you can contact us via email, twitter, or in our #feedback channel on irc.irccloud.com.