Nodejitsu

Save time managing and deploying your node.js app. Code faster with jitsu and npm

Waiting for Godot

About the author

Name
Location
Worldwide
nodejitsu nodejitsu

Monitoring sucks, right?. At Nodejitsu, we run a large distributed system with next-to-zero external dependencies. Designing a monitoring solution that met our needs of:

  • Able to handle thousands of messages per second.
  • Scaling to accommodate the dynamic needs of thousands of applications.
  • Flexibility to make decisions based on various criteria.

was challenging. In fact, it is something we've tried to address in several implementations. Over the past two years, it was clear the team that we needed something new. Today we are happy to announce that we are releasing this new solution, godot as Open Source software under the MIT license.

Prior Art

There are an endless number of monitoring solutions out there. There was one that caught our eye when considering our needs: Riemann. Riemann stood out because it is designed as a complex event processor. In their own words:

Riemann provides low-latency, transient shared state for systems with many moving parts.

The events the Riemann processes are simple, with a single metric property representing a single numeric value for the event. Godot uses this same event format for the structure of events.

{
  host:         "A hostname, e.g. 'api1', 'foo.com'"
  service:      "e.g. 'API port 8000 reqs/sec'",
  state:        "Any string less than 255 bytes, e.g. 'ok', 'warning', 'critical'",
  time:         "The time of the event, in unix epoch seconds",
  description:  "Freeform text",
  tags:         "Freeform list of strings, e.g. ['rate', 'fooproduct', 'transient']"
  metric:       "A number associated with this event, e.g. the number of reqs/sec."
  ttl:          "A floating-point time, in seconds, that this event is valid for."
}

When we dug into the Reimann codebase, obviously being predispositioned to node.js, the overwhelming feeling was that it felt to heavy-weight and a little bit obtuse. That and the strong emphasis on streams made this type of system an ideal candidate for a node.js incantation.

Enter godot.

Introducing Godot

Lets jump right in. A common use-case for monitoring is ensuring that everything is still running using heartbeats. In godot you just need to declare a client to send heartbeats and a server to ensure heartbeats are received.

Both of these operations can be done in a single line of Javascript. First lets focus on the server:

server.js

var godot = require('godot');

//
// Reactor server which will email `user@host.com`
// whenever any service matching /.*\/health\/heartbeat/
// fails to check in after 60 seconds.
//
godot.createServer({  
  //
  // Defaults to tcp
  //
  type: 'tcp',
  reactors: [
    godot.reactor()
      .where('service', '*/health/heartbeat')
      .expire(1000 * 60)
      .email({ to: 'user@host.com' })
  ]
}).listen(9876);

Now that the server is listening for connections from heartbeat clients, lets create a client to start sending messages.

client.js

var godot = require('godot');

//
// Producer client which sends events for the service
// `app.server/health/heartbeat` every 15 seconds.
//
godot.createClient({  
  //
  // Defaults to TCP
  //
  type: 'tcp',
  producers: [
    godot.producer({
      host: 'app.server.com'
      service: 'app.server/health/heartbeat',
      ttl: 1000 * 15
    })
  ]
}).connect(9876);

Both the .createClient and the .createServer methods of godot accept multiple producers and reactors (respectively) allowing you to compose complex behavior from multiple simple streams.



Figure 1: High-level Godot Architecture

Lets examine the dataflow of a simple reactor that sends ops@host.com an email whenever the CPU load of a server goes above 50%.

godot.reactor()  
  .where('service', 'cpu')
  .over(50)
  .email({ to: 'ops@host.com' });


Figure 2: Data-flow for sending email on high CPU load

This is obviously very naive and you should probably perform somekind of exponentially decaying moving average to avoid false-positives on short-lived spikes (godot has EWMA built-in via the window-stream module).

var windowStream = require('window-stream'),  
    godot = require('godot');

var M1_ALPHA = 1 - Math.exp(-5/60);

godot.reactor()  
  .where('service', 'cpu')
  .movingAverage({
    average: {
      type: 'exponential',
      alpha: M1_ALPHA
    },
    window: new windowStream.EventWindow({ size: 10 })
  })
  .over(50)
  .email({ to: 'ops@host.com' });

Performance

Before writing this, I noticed a recent post from aphyr (the author of Riemann) discussing in intimate detail how he has achieved 200k messages per second in Riemann. This intrigued me to see just how this little node program stacked up:

node test/perf/pummel.js -c 5  
Starting performance test with:  
  network protocol  tcp
  concurrency:      5
  sampling interval 10s
  duration:         10s
  ttl:              0
  port:             10557

Starting reactor 1  
Starting producer 1  
Starting producer 2  
Starting producer 3  
Starting producer 4  
Starting producer 5

Now receiving messages...

Received:  
  1069482 total messages
  106948.2 per second

So with little or no attention paid to performance we're already processing 100k messages per second in node. This makes sense because this sort of IO bound application is exactly what node.js was designed for. I suspect (but have not yet had the time to investigate) that this 100k / second benchmark could be faster because it is deficient in several places:

  1. Single reactor process: the underlying sockets are not shared between multiple processes.
  2. Not distributed: this benchmark was both producing and consuming data from the same machine, so there would be plenty of spare CPU if the messages were produced on a second machine.
  3. No framing: As aphyr points out, combing multiple messages into a single TCP packet can greatly increase network throughput, I'm eager to try this out.

So what is Orchestrion?

If you've been following Nodejitsu then you've probably read about our monitoring product, Orchestrion. Orchestrion is built on-top of Godot adding primitives for orchestrating cloud applications:

  • Network monitoring: Concerned about network or HTTP latency? We've got you covered.
  • Server and application provisioning: Wouldn't you like to just scale your app instead of telling your ops team to scale your app?
  • SSH: Run arbitrary processes on demand to respond to incidents.
  • Process monitoring: Notify your team through alerting tools when crashes occur.

Lets consider how the above example, emailing your ops team on high CPU load, could be handled with Orchestrion:



Figure 3: Data-flow for scaling your app on high CPU load

Automating these common tasks frees up your developers and operations engineers to handle higher-level more important problems like building your application and your business.

Interested in learning more? Why don't you request a quote for our Enterprise Product. If you're using our Public Cloud product today then stay tuned: we'll be integrating many of these higher level features of Orchestrion into the platform soon.




Cloud designed by Pieter J. Smits from The Noun Project.