With over a million members, there is an amazing pool of open-source developers on Github across a wide range of projects. More amazing is how Github has changed the technical recruiting game. The day's of resumes and references for developers are slowly dying. All the work I've gotten in the past year has been a direct result of my github account. I send potential employers my Github account first. They can see a history of the work I have done, the influence that work has had (followers / forks), and my coding style and abilities. Github introduces a transparency that has never been available before.
In my ongoing obsession of finding top tier developers I came across an article by Matt Biddulph called "algorithmic recruitment with github". He released some basic code for calculating the social influence of a user based on a geo-graphical location. I immediately was hooked. I got the code running with a few minor modifications and ran: "New York"
the top 5...
Awesome! I placed 4th, but I noticed jashkenas was missing from the list. I had figured he would have placed in the top five as well...
I decided further research was required and ran the script against the following locations: Chicago, Boston, California, San Francisco, Los Angeles, San Diego, San Jose, Palo Alto, Portland, Oakland, Seattle, Florida, Alaska, Montreal, Toronto, Canada, Russia, Moscow, Ukraine, Uruguay, Chile, Japan, Taiwan, Korea, India, China, Israel, Argentina, Brazil, England, London, Germany, Spain, France, Switzerland, Sweden, Australia, New Zealand
The results were interesting. I cross referenced as many of the regions that I could with various irc people and saw somewhat consistent results. The issue with not seeing jashkenas in "New York" is simple. His location is listed as "NYC", so the "New York" query didn't find him. The same normalization issue occurs for many regions and winds up excluding potentially influential developers from the graph. I also saw what I could consider "artificial" influence by accounts who were simply following large amounts of people (such as webiest).
The solutions to fix these issues are relatively straight forward.
To fix the normalization issues there are two options. We could ask Github to change the location field on profiles so it could be geo-coded (unlikely to happen anytime soon) or we could manually setup aliases for each region and then change the script to perform an OR operation on these aliases. For instance to calculate the graph for New York we could query "New York, NYC, New York City, Brooklyn, Queens, Manhattan, Staten Island". I suppose we could also just try to take the current location string and geocode it. The point is there is room for improvement.
To fix the results being skewed by people who just happen to follow a lot of projects, we could change the script to take other information into account such as: number of projects, number of watchers on projects, following to followers ratio, fork to ownership ratio, commit history. There is also an issue of legacy organizations and organizations having followers...but that one is a bit tricky.
The current code is written in Ruby and I'm thinking the next version will try to use event-machine to speed up the synchronous net requests or maybe just try and port the whole thing to node. Once there are more accurate results I'll publish the entire list for each region. If you want to try this on your own, you can try here or here