As mentioned in our most recent blog post, we recently rolled out some major changes to our infrastructure. We realized we needed to make these changes because of certain problems with our existing architecture.
We decided to take this opportunity to make this the first post in a series about how the infrastructure underlying the Nodejitsu platform works. Building a Platform-as-a-Service requires a lot of moving parts. From provisioning to configuration management, from logs to monitoring, from building snapshots to deploying them. It's a lot to keep track of! In this post we will discuss how we deploy applications at Nodejitsu. We will answer the question "how does my app get started" and how our new agentless architecture works and what problems it compared to our previous implementation.
Still interested? Keep reading!
Understanding the problem
On the surface the question of download, install and execute application code seems relatively simple:
- Fetch the code.
- Install it.
- Start the application.
- Send outbound metrics and logs for collection.
But underneath the covers there are a number of subtle issues that make this problem deceptively simple.
What is haibu?
Our previous architecture was based on
haibu - an application server. The deployment process was pretty straightforward: our build server put the snapshot in CloudFiles,
haibu fetched it and started the application.
There were three problems this approach. Let's examine them in more detail:
1. Long-running Agent
Having a long-running daemon is often a good thing (for example
sshd), but for a PaaS they can cause significant headaches:
- Excess memory: Daemons are fine when you have plenty of available memory, but when you're running inside a 256MB Zone or VM every megabyte counts. Haibu ended up using ~20MB of RAM on average which was about 10% of total RAM.
- Upgrades: Because the daemon is running the application any upgrade to the daemon itself requires the application to be restarted. On the surface this seems OK, but overtime it caused a lot of inconsistencies in the underlying drones.
2. Calculus of node versions
Since haibu is built using
node we often had to have multiple versions of node running on the same machine: the version of
node haibu used and the version of node used by the application. Thus, to make it run new node versions, we had to upgrade
haibu-carapace - our process jail. This meant that as features were added and removed as node evolved we had to implement a number of shims (around the
.fork() API for example) to ensure they all work.
3. No metrics
We want our customers to have as much data about their application as possible.
haibu didn't provide any way of easily plugging in a monitoring agent. If we wanted to continue using it, we'd have to add one more process to the existing stack (which meant yet more memory usage).
Through our use of
haibu we saw the deficiencies outlined above and sat down to carefully analyze what needed to change. We distilled our list of requirements to:
- Backward compatible with running applications.
- Low memory usage.
- No daemons. All our processes should be as short-lived as possible.
- Easy to upgrade.
- Built-in monitoring, metrics, and logs.
The low memory requirement suggested using C for parts of the stack running on drone servers. It was based on:
namby Bradley Meck: node application manager
interposed: a tool for detecting on which port process listens on.
aeternumby Charlie McConnell: a C process monitor using
This is what the new implementation looks like at a high-level:
Digging in you can see that we have removed
haibu and replaced it with
Monitoring applications with forza
forza is a C application monitor written using
libuv - the library behind node.js core. It's essentially an interface for plugins to communicate with both metrics servers and the running process (in our case, user's application).
One important thing to note about
forza is that when its child process dies, it dies too. That makes it both easier to upgrade it and to avoid significant memory leaks.
forza comes with few interesting plugins which send additional metrics about the application it monitors:
cpu- CPU load
mem- Used memory
start- plugin for starting applications
logs- log distribution
process- process-level events
uptime- process uptime
To get started with
forza, first install it:
git clone https://github.com/opsmezzo/forza.git cd forza ./configure --with-plugin cpu --with-plugin logs make
Now launch a new terminal window and run
nc to see all of the data that
forza will send from the application.
nc -k -l 1337
Then start the application using
./forza -h 127.0.0.1:1337 -- node <path-to-your-node-app>
When your app starts up, you should start seeing some logs and CPU statistics in a unified JSON format in the second terminal window.
Starting applications with solenoid
solenoid is an application starter. It performs several necessary tasks before starting an application:
- Fetches and unpacks the application snapshot using
- Reads the
package.jsonand determines what node version to run.
- Sets any environment variables passed in.
- Creates the
solenoiduser, and restricts the file system.
solenoid is specific for our needs, but we decided to open it up for the community to get better insight into how our platform works and help us improve it.
solenoid is started by the Nodejitsu API through SSH connection, using the
ssh2 library. We then wait for
forza to determine whether it correctly started responding to our users through
Verifying our assumptions
The new infrastructure has been out for over three weeks. We have gotten lots of positive feedback, but we realize that it may have been rough around the edges for some of our users. We're still working on some edge cases, but we can honestly say that deployments are far more reliable than they were before.
Upgrading software on our servers became a breeze - we can release new software to all of our servers in about one hour. We're extremely happy to finally ship a thing we've been working on for over 6 months.
Looking to learn more about the inner workings of Nodejitsu? Stay tuned, in future posts we will be digging into:
- Building application snapshots.
- Provisioning and configuring servers.
- Real-time logging and metrics infrastructure.
- Load balancing and network routing.
- Databases and backups.