Nodejitsu

Save time managing and deploying your node.js app. Code faster with jitsu and npm

Recent instability and choosing the right cloud

About the author

Name
Location
Worldwide
nodejitsu nodejitsu

From Thursday, September 5th through Monday, September 9th the Nodejitsu Platform-as-a-Service experienced a serious service disruption affecting both application deployments and running applications. Our operations team takes these events very seriously and worked tirelessly throughout this time to determine the complex root cause of this disruption.

Although the initial impetus that resulted in the conditions for our downtime were outside our control we want to take the time to explain what went wrong, what we're doing to avoid this going forward, and how we're working to give all of our users more transparency into the live status of our platform.

But first, I’d like to apologize on behalf of Nodejitsu for this outage. While it did not affect the availability of other services that we run (such as the public npm registry), the severity and the length of the outage are both completely unacceptable to us. I'm very sorry this happened, especially so soon after the launch of major new features we’ve been working on for a very long time.

Relying on the Wrong Cloud

In February, we announced our partnership with Telefónica Digital to use their Instant Servers product. This partnership was centered around expanding our support of the European developer community beyond Joyent's Amsterdam datacenter. Although we were enthusiastic about this at first, we did not see the interest from our users. We decided to continue to support these datacenters given that the added overhead was minimal at the time.

That all changed when all of our servers running in both Telefónica's London and Madrid datacenters were accidentally deleted without notice around 8am Eastern Time on Thursday, September 5th, 2013. This event caused an unexpected cascading failure of our Load Balancers and Application Deployments which we will explain in detail. As of today we have terminated our partnership with Telefónica Digital. There is a formal press release at nodejitsu.com.

We remain committed to the European developer community through Joyent's Amsterdam datacenter and our recent sponsorship of both NodeConf.eu and JSConf.eu. If you had an application running in London or Madrid it is now running in Amsterdam.

Digging into the Specifics

As we started discussing in our series of posts on Infrastructure at Nodejitsu, there are several distinct services that we maintain at scale to power the Nodejitsu platform:

All of these services are orchestrated together using OpsMezzo. There are multiple redundant hosts to avoid a single point of failure. The problems began with these systems working together in harmony right before hundreds of Telefónica servers went offline simultaneously.


Load Balancers

Each one of our load balancers is aware of all other load balancer peers in all datacenters. This is how we can proxy an incoming HTTP(S) request a European datacenter, e.g.:

GET / HTTP/1.1  
Host: blog.jyt.eu.ams1.jit.su  

to the drone running on a private IP address in the correct US datacenter:

GET / HTTP/1.1  
Host: blog.jit.su  

Unfortunately, there was a bug in our peer detection which caused all of our load balancers to begin crashing when all of our Telefónica load balancers went offline. This is obviously our fault: we optimized for the more common case of a single datacenter going offline, not multiple datacenters.

To mitigate this issue our devops team immediately deployed a fix to our load balancers and DNS hosts to remove the Telefónica-based load balancers. To resolve this issue we needed to move the release timeline for the next version of load balancers forward. This release was originally scheduled for Monday September 9th, and had already undergone testing in our staging environment, but had not yet seen production traffic.

This was obviously a mistake on our part and not one we are going to make again.


Application Deployments

Due to the large number of 502 errors being returned from our load balancers due to the reasons outlined above we suggested that users restart any applications in a bad state.

Running jitsu apps start triggers an update in our database, which repopulates our cache and invalidates the in-memory LRU cache in all of our load balancer. In most cases this worked as expected, but combined with the last outstanding issue from our new architecture generated network congestion in our monitoring servers through excessive 'start' events.

This network congestion caused 'start' events to be dropped leading to more 502 errors and in-turn more 'start' events, ad nauseum. Similar to the issues with our load balancers, we had a significant improvement being tested in our staging environment, but did not have the confidence to deploy it immediately.

Through a concerted effort by our support and devops engineers we tested and deployed these improvements ahead of our release schedule on Sunday, September 8th. This improvement to forza and solenoid removed the network congestion permanently by getting initial start information over SSH instead of our TCP-based monitoring servers.

It also enabled a long requested feature from our users: accepting CLI arguments to node and start scripts. You can now include these in scripts.start in your package.json:

{
  "scripts": {
    "start": "node --harmony app.js --app-argument true"
  }
}

Looking for all of the available node options? Check out:

node --help  
node --v8-options  


Residual Memory Leaks

At this point we were relatively confident that all systems were operating normally, but we were still receiving reports of ETIMEOUT and 502 errors from our users. The only consistent symptom our devops team saw was that after about two hours of operation a given load balancer process would exceed 1.5GB of RAM causing the V8 garbage collector to execute more frequently. This in turn led to the persistent errors over the course of Monday, September 9th.

The root cause was clear: our load balancers were rapidly leaking memory. This is where running on top of Joyent and SmartOS was invaluable to us. If you were not aware, SmartOS more advanced post-mortem debugging facilities for node.js than any other platform.

Over the course of several executions of gcore and ::findjsobjects -v we were able to detect the following pattern in the memory usage of our load balancers:

OBJECT   #OBJECTS #PROPS CONSTRUCTOR: PROPS  
80209271     4158      5 Command: buffer_args, callback, command, args, ...  
6da79ce1    11292      5 Command: buffer_args, callback, command, args, ...  
6326ebb5    21378      5 Command: buffer_args, callback, command, args, ...  
5ce0fdb5    53125      5 Command: buffer_args, callback, command, args, ...  

This introspection gave us the insights we needed to resolve the underlying cause of the memory leak from the redis module and finally return all systems to normal in the early hours of Tuesday, September 10th.

Improving our Transparency

Perhaps more frustrating to both us and to our customers was the lack of visibility into the status of the situation during a prolonged period of instability. It is the unknown unknowns that cause more anxiety and lack of confidence in our product.

Running a publicly visible company we know we can't ignore (or even seem to ignore) comments like these:

Since December of 2012 we have had a live status page to communicate any and all issues with the Nodejitsu platform. During this extended outage it has become obvious to us that we cannot accept downtime status page inception. Here's what we're doing to make things better:

  • This week we will be moving our Status Page to a redundant host on a separate IaaS substrate to ensure maximum availability.
  • Now that we are gathering metrics about all applications from forza our Engineers will be working hard to expose this information to you and present load averages on our Status Page similar to Github.

Wrapping up

I’m truly sorry about the availability problems of our Platform. It has not been up to our standards, and we are taking each and every one of the lessons these outages have taught us to heart. We can and will do better. Thank you for supporting us at Nodejitsu, especially during the difficult times like this.