Over the course of Wednesday, March 6th through Friday, March 8th, Nodejitsu experienced its most serious service disruption to date. A significant number of users' applications were rendered intermittently unreachable due to load balancing and routing issues. Furthermore, a variety of circumstances combined that both increased the prevalence of the issues and prolonged the amount of time it took to reliably fix them. We consider the situation to have been completely unacceptable, and since then, the majority of our team has been working hard to ensure that it never happens again.
At this point, we'd like to take the time to explain what went wrong, and give some insight into some of the things we've been doing to make sure these issues don't repeat themselves.
Our unfortunate tale begins last Wednesday, March 6th. In the early hours of the morning, our DevOps team deployed a fairly straightforward change to our load balancing layer. We use CouchDB to store app state, and the CouchDB
_changes feed to keep our load balancers up-to-date when apps are created, destroyed, or deployed to new drone servers. We had, up until this point, been using custom code to follow the
_changes feed with Node's HTTP API, but had decided to switch to using follow instead. We made the changes, we reviewed the changes, and the changes spent time in our staging environment being tested. Everyone, myself included, was confident that this change would both eliminate the few occasional hiccups we'd been seeing with the
_changes feed from time to time, and would also be too straightforward to introduce new problems. So, in the wee hours of Wednesday, we deployed it to production.
As anyone who's ever tried to write a serious benchmark of a network service can tell you, it's not easy to replicate large amounts of authentic internet traffic for testing. It's easy to make one computer, or several, make lots and lots of requests to the same host. The real-world experience of having millions of people all over the world making requests to your site or service, however - requests from different operating systems, different hardware, all travelling over completely different network paths - is extremely difficult to replicate accurately. Needless to say, the testing our balancers got in our staging environment did not turn out to be a reasonable facsimile of our usual production load.
In and amongst all of the events which unfolded, there was required travel taking place on Thursday and Friday for two key members of our DevOps team, including my own flight for my move from San Francisco back home to New York.
Wednesday, and most of Thursday, stayed quiet. The trouble with a subtle bug that produces an incorrect state is that it starts out as an occasional, sporadic issue - if thousands of apps are working fine, but one user is seeing a problem, simple deduction suggests that the issue, while not necessarily the user's fault, lies with some aspect of their app that sets it apart from the rest of the apps in question. Since this looked like a few odd users seeing something strange, and not like a large systemic problem, our support engineers began investigating the apps in question rather than our infrastructure.
In the meantime, they offered a simple workaround: if a user runs
jitsu apps start on their app, the app will be deployed to a new drone, the database will be updated, and the balancers will receive the new network location for the app. In almost all cases, this returns the app in question to a working state, and gives us time to look at what happened without giving our users serious downtime.
Sometime around dawn UTC on Friday, a few of our users who were affected by the current problems cleverly decided to make scripts for themselves, that would run
jitsu apps start over and over and over. While I don't fault our users for taking measures to ensure their uptime, this very quickly escalated into a situation of "deploy spam" - a significant number of users got fancy with Node's
child_process module, and had several concurrent instances of
jitsu in memory, all trying to start the same app over and over.
If you think that sounds a lot like an accidental DDoS, you're right.
By the time my flight was departing from San Francisco early Friday, it had become apparent that this was a subtle and systemic problem - it was still only affecting a very small percentage of apps, but it was affecting "too simple to fail" apps (simple tests and the like) and, most importantly, different load balancers would respond differently, indicating a more pervasive issue. We gathered our Ops team, as best we were able, given the travel circumstances, and began investigating in earnest.
By mid morning, our deployment infrastructure was under a level of load that surpassed what we saw during the last hour of Node Knockout this past year. We were up around 20-30x normal deploy traffic from the aggressive workarounds, and our CouchDB started feeling the strain. The balancers started to get even more out-of-sync with one another, and deploys started to fail from CouchDB timeouts and other related issues.
And that brings me to the other bug.
We operate with a pool of free drone servers, under normal circumstances - when it gets below a certain threshold, auto-provisioning algorithms kick in and start provisioning new VMs to add extra capacity to our cloud. This process is sometimes manual, but it generally requires very little attention from DevOps.
Under a massive sudden load spike, however, and with serious ongoing database issues, the free drone pool got very low, and the high error rate led to more inconsistent state. This time, several attempts to
jitsu apps start would all wind up receiving a success message, and all wind up telling the database - and therefore the balancers - that they were on the same IP address.
Our standard drone VMs are small - 256MB of RAM - and we only provision one app per small VM. They really just don't work well otherwise.
The problem, then, is what happens, for example, when three or four Express apps, all using the default port 3000, all hit the same IP address. Only one app - usually the last one to start - would be running successfully, and the other users would visit their URL, and see a completely different site.
By the time this happened, we realized it was time to admit defeat, roll back our balancer changes, and start looking into ways to lessen the load. We put our API into a short emergency maintenance, let everything calm down a little, added a lot of new capacity to our drone pool, and reverted to the pre-Wednesday version of our load balancers, prior to the switch to
While this was going on, Maciej Malecki, our self-styled Hotfixer-in-Chief, attacked our provisioning service head-on, and successfully found and fixed the edge case that allowed apps to incorrectly be deployed to already-used drones when the pool was empty. With the balancers reverted, and this edge case fixed, the cloud rapidly returned to normal.
Where We Go From Here
We're determined that these issues - or any issues of this size and scope - not happen again, and we’ve taken the following measures towards prevention:
- We have initiated a complete overhaul of both our staging environment and our staging practices, and reaching for new heights of OCD-level Continuous Integration and QA.
- Our staging environment has been completely replaced. As of this writing, every VM in question was deleted and reprovisioned from scratch.
- We're phasing out most of our manual QA in favor of an automatic deployment bot that will send amusing, demeaning emails to the team whenever app deploys don't work in staging.
- Implemented an improved approach to internal load testing in staging, seeking to better answer the complex question of how to accurately simulate real traffic from the wild internet, and avoid any more sudden surprises when new code meets production load.
- We are greatly expanding and decentralizing our CouchDB, to prevent any database growing pains as we continue to scale.
- In the future, scheduled deploys won’t happen if several senior members of the DevOps team aren’t available due to travel or other circumstances.
In closing, I apologize profusely. Issues like this are unacceptable in any hosting provider, and our commitment to being the best Node.js host out there won't let us let things like this slide.