Nodejitsu

Save time managing and deploying your node.js app. Code faster with jitsu and npm

jsdom + jQuery in 5 lines with node.js

About the author

Name
Location
Worldwide
nodejitsu nodejitsu

Screen scraping has been a focus of engineering for quite some time. Every major language out there has their own library of choice for performing these tasks:

The challenge with using these libraries is that they all have their own quirks that can make working with HTML, CSS and Javascript challenging.

The Javascript difference

By working with server-side Javascript (in this case node.js) developers don't need to worry about these issues because we can use widely accepted and battle-hardened libraries such as jQuery on the server thanks to jsdom, a server-side implementation of the DOM apis developed by nodejitsu's own Elijah Insua. Adding jQuery to any page with jsdom is easy:

var jsdom = require('jsdom');

jsdom.env({  
  html: "<html><body></body></html>",
  scripts: [
    'http://code.jquery.com/jquery-1.5.min.js'
  ]
}, function (err, window) {
  var $ = window.jQuery;

  $('body').append("<div class='testing'>Hello World</div>");
  console.log($(".testing").text()); // outputs Hello World
});

The above code creates a new jsdom window and adds jQuery to the document via a script element. Although it is just an illustrative example it is easy to modify it to work with real pages retrieved from the Internet.

Working with live pages

Node.js makes it easy to retrieve pages online. The low-level http.Agent apis are somewhat verbose, but thankfully there are two libraries out there that make it easier: request and http-agent. You can install these libraries through the node package manager, npm. All of the samples in the blog post are packaged up into a gist for easy use:

curl -o jsdom-jquery.tar.gz https://gist.github.com/gists/1009644/download  
tar -xvf jsdom-jquery.tar.gz  
cd gist1009644*  
npm install  
node request.js  
node jsdom-jquery.js  
node jquery-request.js  
node jquery-http-agent.js  
node http-agent.js  

Request is best suited for making requests for individual webpages:

var request = require('request'),  
    sys = require('sys');

request({ uri:'http://www.google.com' }, function (error, response, body) {  
  if (error && response.statusCode !== 200) {
    console.log('Error when contacting google.com')
  }

  // Print the google web page.
  sys.puts(body);
});

While http-agent is designed for scenarios where you need to iterate over a sequence of webpages. to retrieve after any page is returned:

var httpAgent = require('http-agent'),  
    util = require('util');

var agent = httpAgent.create('www.google.com', ['finance', 'news', 'images']);

agent.addListener('next', function (err, agent) {  
  console.log('Body of the current page: ' + agent.body);
  console.log('Response we saw for this page: ' + util.inspect(agent.response));

  // Go to the next page in the sequence
  agent.next();
});

agent.addListener('stop', function (err, agent) {  
  console.log('the agent has stopped');
});

agent.start();  

Real-life scraping tasks

The above examples can be combined into robust and easy-to-use code for scraping live pages. We simply need to make a new jsdom window for each page we get back, then add jQuery to the page. Here's an example using request:

var request = require('request'),  
    jsdom = require('jsdom');

request({ uri:'http://www.google.com' }, function (error, response, body) {  
  if (error && response.statusCode !== 200) {
    console.log('Error when contacting google.com')
  }

  jsdom.env({
    html: body,
    scripts: [
      'http://code.jquery.com/jquery-1.5.min.js'
    ]
  }, function (err, window) {
    var $ = window.jQuery;

    // jQuery is now loaded on the jsdom window created from 'agent.body'
    console.log($('body').html());
  });
});

where as with http-agent the code would be:

var httpAgent = require('http-agent'),  
    jsdom = require('jsdom');

var agent = httpAgent.create('www.google.com', ['finance', 'news', 'images']);

agent.addListener('next', function (err, agent) {  
  jsdom.env({
    html: agent.body,
    scripts: [
      'http://code.jquery.com/jquery-1.5.min.js'
    ]
  }, function (err, window) {
    var $ = window.jQuery;

    // jQuery is now loaded on the jsdom window created from 'agent.body'
    console.log($('body').html());

    agent.next();
  });
});

agent.addListener('stop', function (agent) {  
  console.log('the agent has stopped');
});

agent.start();  

Wrapping up

So to sum up here, there are some key benefits that doing your screen scraping with node.js, and jsdom provide:

  • Works with the same battle-hardened DOM libraries used in production every day.
  • Node.js is blazing fast and designed for exactly these kind of highly asynchronous coding tasks.
  • Easy to use and deploy with new tools like nodejitsu.

Happy scraping!