Nodejitsu

Save time managing and deploying your node.js app. Code faster with jitsu and npm

Public npm post-mortem - Jan 24, 2014

About the author

Name
Location
Worldwide
nodejitsu nodejitsu

Starting at 12:45am ET (UTC-5) until 2:25am ET (UTC-5) this morning, January 24th, 2013 the secondary master of the npm registry was down recovering from backups on the primary master. This was necessary because a bad restart of CouchDB (caused by our supervision code) caused two CouchDB processes to start on the same registry.couch file.

The root cause, as we mentioned, was our supervision code which we deployed and then rolled back after the bug was detected. Unfortunately the corrupted registry.couch file caused us to be unable to recover immediately. We needed to roll this update out because the need to scale the public npm registry out from two to three masters becomes more necessary every day. The supervision code, while robust, was not designed to work in a multi-master setup. This has been the cause of several small outages (less than 5-10 minutes each) in January during which our operations team manually restarted CouchDB on the secondary master. As this trend continues: once we are running with three or four masters manual restarts of non-primary masters will not be tenable.

We take the operation of the public npm registry very seriously and we are deeply sorry for the inconvenience this caused. We always deploy fixes like this outside of normal business hours for North America and Europe, but obviously with a service as important as npm any outage causes headaches for users. We will be attempting this update again next week after the third master has been brought up to be used as a hot spare in the the rare case this might happen again despite planned, more rigorous testing in our staging environment.

More information about the current state-of-the-art for the operation of the public npm registry will be released in our monthly blog post from #scalenpm, which will be released on Monday, January 27th.

This outage, while unacceptable, was a perfect test for our new bi-furcated infrastructure announced on Tuesday, which was completely unaffected. If you'd like to take it for a spin simply:

npm install npm --registry "http://registry.nodejitsu.com"  

<3, the Nodejitsu Operations Team