On March 6th - March 8th, Nodejitsu experienced our worst service disruption to date. Following this series of outages, we immediately reallocated our Engineering and DevOps resources so as to properly assess the damage done and to fix some of our core infrastructure to ensure it never happens again.
In the time in between our outage and today we have overhauled our entire load balancing architecture, also known as Conductor. Below is a detailed look at our old load balancing system and our newly implemented solution.
The previous implementation illustrated above suffers from several problems:
- Long startup time: populating the entire cache from CouchDB can take upwards of several minutes.
- No retry on failure: incoming requests are simply killed with
- Lack of consistency: since each balancer maintains it's own
_changeslistener the state maintained is independent of other balancers.
The proposed implementation addresses the problems outlined in the Current implementation by:
- Reducing startup time: There is no cache built at startup, so balancers will be immediately available to take requests.
- Retry on failure: requests will be retried when they fail up to a certain limit.
- Consistent central broker: the Cache Manager is now the authoritative and consistent source for host information and proactively invalidates the in-memory LRU cache on balancers as changes come in from CouchDB.
Put more plainly, our new implementation of Conductor makes us less reliant on memory caching and the risks that come with it. It also moves us to a model where failure to connect with your app initiates a series of automatic retries and self-corrects. This will limit instances of ECONNREFUSED to times when the issue lies within a customers application, rather than our platform. Our team has needed to shelve some things in order to make this a reality, but our reliability is ultimately our top priority, and we believe this will bring a significant improvement to our core infrastructure.
Thanks for hanging in there. We hope customers across the platform will be seeing a more reliable hosting experience.