Saturday, March 19, 2011

node-xmpp-bosh: A BOSH (connection manager) session manager for the XMPP protocol

You can find out all about what BOSH is from XEP-0124 and XEP-0206.

Most respectable XMPP servers out there such as Openfire, Ejabberd, and Tigase support BOSH as part of the server.

This is where you can download the node-xmpp-bosh BOSH server from.
node-xmpp-bosh has a new home at github!

I'll just discuss some of the reasons that it's good to have a BOSH session manager running outside of the XMPP server.
  1. Easier to scale the BOSH server and the XMPP server independently (independently being the key here). See my previous blog post for a detailed explanation
  2. You can support users from multiple domains such as gmail.com, chat.facebook.com, jabber.org, pandion.im, etc... using the single BOSH server
  3. Any customizations to the BOSH server are easier made if it is running out of process (restart it independently of your XMPP server - since getting an XMPP server warmed may be more expensive than getting a BOSH server warmed up)
  4. Use protocols other than XMPP at the back, but still retain the same interface as far as clients are concerned. This will still require you to use XMPP for the actual communication (Note: This can be accomplished via transports [for yahoo, msn, aim, etc...] available for any XMPP server [that supports XEP-0114] as a Jabber Component). But, if you please, you can also write a small drop-in replacement server that speaks basic XMPP, but does something entirely bizarre at the back!!

Scale out with services - Scale the services separately

Whatever I am writing about here has been taught to me by this brilliant write up that I quote from the DBSlayer project page.

"The typical LAMP strategy for scaling up data-driven applications is to replicate slave databases to every web server, but this approach can hit scaling limitations for high-volume websites, where processes can overwhelm their given backend DB's connection limits. Quite frankly, we wanted to scale the front-end webservers and backend database servers separately without having to coordinate them. We also needed a way to flexibly reconfigure where our backend databases were located and which applications used them without resorting to tricks of DNS or other such "load-balancing" hacks."

What this means is that it is better (from a scaling perspective) to split your application into services rather than have them run as one monolothic application on some very powerful machine.

Why is it better to do so? That's because each "service" may have different performance requirements and characteristics which can be better addressed in isolation if the service is running as a separate entity.

For example, consider a typical web-stack which consists of:
  1. Load balancer (nginx)
  2. Web server/script execution engine (apache/php)
  3. Data Store (mysql/pgsql)
  4. Caching service (redis/memcached)
  5. Asynchronous processing infrastructure (RabbitMQ)

For a medium to high load web site (a few million hits/day => about 20-30 req/sec), you could make do with just a single instance of nginx, 2 machines running apache/php, 2 machines running MySQL and one machine running RabbitMQ. Even for this deployment, you can see that each of the components have different requirements as far as the machine (and hardware usage) characteristics are concerned. For example,
  1. nginx is network I/O heavy. You could deploy it on a machine with a modest (1GB) amount of RAM, no hard disk space, and not a very fast CPU
  2. The Apache/php servers on the other hand would need more RAM as well as more CPU, but no disk space
  3. The MySQL nodes would need a lot of RAM, CPU as well as fast disks
  4. The node running RabbitMQ (a message queue) could again comfortably run on a machine with a configuration similar to nginx (assuming that data is stored on MySQL)

Thus, we have sliced our stack into components we can club together not just based on function, but also based on the charasteristics of the hardware that they would best be able to run on. Aside: This reminds me of Principal Component Analysis.

Node.js as "the next BIG thing".

I generally hate to talk about "the next BIG thing" because there seem to be (relatively) fewer things (as compared to all the things that we encounter) that influence our lives in significant ways these days (simply because we encounter so many things these days).

However, I feel that node.js is going to be quite influential in the years to come.

So, what is "node.js"? Node (put briefly) is a javascript execution engine. It incorporates asynchronous I/O as part of its design and hence can be considered asynchronous by design. This is where it starts getting fascinating for me.

Do you remember all those $.ajax() requests you made in the browser? Well, what if you could so the same on the server side? What if stitching together a page on the server was as easy as making AJAX calls and filling in empty <div> tags? You needn't stretch your imagination too much because that's exactly what Node lets you do!!

Even more exciting is that fact that this asynchronous API isn't limited to just HTTP/AJAX calls, but extends to protocols such as SMTP, POP, XMPP, and Database handlers such as PGSQL, MySQL, and Sqlite. This is because the Socket & file system I/O on Node is asynchronous by design. (almost??) All the APIs which do I/O are asynchronous in nature. For some, blocking counterparts are provided solely for convenience.


Toy task: Make 10 HTTP web calls and present the output as the result of each of these 10 calls separated by a new line. Assume that each HTTP call takes 1 sec to produce the result.

Sample code (in your favourite programming language; using blocking I/O):

out = "";
for i = 1 to 10:
 # We assume that http_call blocks till the fetch completes.
 response = http_call(url);
 out += (response + "\n");
write_output(out);

You notice that the code takes 10 sec to run - 1 sec for each HTTP web call. We ignore the time taken to write out the results.


Sample code (using non-blocking I/O):

out = "";
res [10] = [ ];
for i = 1 to 10:
 # We assume that http_call does NOT block and retiurns immediately, 
 # processing the task in the background (another thread or using
 # non-blocking I/O)
 res[i] = http_call(url);

for i = 1 to 10:
 # We assume that the when statement blocks execution till the 
 # condition is met
 when res[i].complete:
  out += (res[i].data + "\n");

write_output(out);

You notice that the code takes 1 sec to run - 1 sec for each HTTP web call; all of which fire at the same time. We ignore the time taken to write out the results.

That's a phenominal increase of 10x from the blocking version. Now, why wouldn't anyone use Node for such a use case??