Saturday, April 16, 2011

Lazy Logging on Javascript (2)

A few weeks ago, I wrote about how to do lazy logging in Javascript by passing functions to the logging functions. These functions would actually do the CPU intensive evaluation (if needed) and return something that would subsequently be printed on the screen. The key thing to notice is that the item(s) to be printed would be computed ONLY if they were actually going to be printed (using the current logging level and the 1st parameter to the logging function).

I noticed that a lot of code had started looking quite ugly.
log_it("DEBUG", function() {
  return [ some_var1.toString(), some_var2.toString() ];
});

Note: Any expensive computation that should be performed MUST go inside the toString() method of some_var1 and some_var2.

I have instead decided to solve it by making a function do what the caller was earlier doing:
// toString() Promise
function tsp() {
  var args = arguments;
  return {
    toString: function() {
      return Array.prototype.slice.call(args).map(function(v) {
        return v.toString();
      }).join(' ');
    }
  };
}

As a result of this modification, all code that does logging now looks like this:
log_it("DEBUG", tsp(some_var1, some_var2));

The reason why this works is because log_it() is going to call toString() (if required) on each of its arguments before printing them.

Much neater eh?

Thursday, April 07, 2011

A comment on an email signature

I've added this signature to my outgoing emails at work
n="63686180107683"; n.split('').splice(0,7).map(function() {
   t=n%100; n=n/100; return parseInt(t);
}).reverse().map(function(v) {
   return String.fromCharCode(v+6*6);
}).join('');

and was pleasantly surprised to receive a fairly long and detailed analysis of what Joel Rosario had to say about it.

I wonder how many people have run and tried to understand your signature. I just did. It is devious, and evil.

* What you really want is a loop. Why splice the large string from 0 to 7? It unnecessarily misleads the reader into thinking you're doing a meaningful string operation.
* Why use map with an anonymous function that takes no parameter? This trips up the unwary reader into reading the function, and wondering what it is being applied to.
* The use of n % 100 and n=n/100 seems to show off javascript's perlish ability to treat strings as numbers.
* The above might also confuse the hapless reader. What happens when you cross a string with a number, hmmm?
* By this time, having nearly given up trying to understand what's going on, the reader is hit with reverse(), surprised by a map(), and presented with a function that expects a parameter that is clearly an integer. The arithmetic operations of +6*6 applied to an already mystical parameter v will be too much for any sane person to bear. By now your reader has already succumbed to his fear and fled for a coffee break.

Why lay such traps for the unwary? Why risk the equanimity, the morale, indeed the sanity of your esteemed colleagues? Why couldn't you just have signed your mail with console.log('chat.pw')?

- Joel Rosario.

Needless to say, some of my emails will randomly have console.log("chat.pw") in them ;)
Emails like this make my day; nay week!!

Saturday, March 26, 2011

Lazy Logging on Javascript

One of the things that I like about C/C++ is compile time macros. What they mean is that if I have debug print statements, not only will they be compiled out, but also the expressions passed to the printing statement will not be evaluated.

I have code like this in javascript:
var logging = false;
function log() {
  if (logging)
    console.log.apply(console, arguments);
}

var xml = an_xml_stanza();
log("XML Stanza:", xml.toString());

Even if logging is false, the expression xml.toString() will be evaluated, which can be quite costly in a production setup (I'm talking about node.js and not on a browser).

The way I've solved this is by making sure that my log() function can accept functions as well. Hence, code such as this becomes possible:

var xml = an_xml_stanza();
log(function() {
  return ["XML Stanza:", xml.toString()];
});

The log() function needs to be patched to look like this:
log = function() {
  if (!logging)
    return;

  if (arguments.length == 1 && typeof arguments[0] == "function")
    arguments = arguments[0]();

  console.log.apply(console, arguments);
}

This essentially means that you've gotten rid of the runtime cost of calling xml.toString() when logging is disabled.

Monday, March 21, 2011

Node and the Proxy/Decorator Pattern

The proxy pattern.
The decorator pattern.

Proxies & Decorators are patterns that abstract away or enhance certain functionality in an entiry.

Since they are both almost the same, it isn't very helpful to know the finer differences between the two, but here they are just for completeness.

Example of a Proxy:
When you proxy some functionality, you generally delegate the responsibility to someone else. The most practical example is an HTTP Proxy that forwards requests on your behalf to the real web-server.

Example of a Decorator:
When you decorate some functionality, you generally add more functionality to it. For this example, I shall discuss the code that is presented below. It is basically an EventEmitter that transparently decompresses the string that is passed to it and re-raises the events that the original EventEmitter raised. You can use it to transparently read a compressed/secure stream given that you have a decorated EventEmitter for it.


File: gzip.js
var compress = require('compress');

function _inflater(stream) {
 this._stream = stream;
 var self = this;
 var gunzip = new compress.Gunzip;
 gunzip.init();

 this._stream.on('data', function(d) {
  var _d = gunzip.inflate(d.toString('binary'), 'binary');
  self.emit('data', _d);
 })
 .on('end', function() {
  self.emit('end');
 })
 .on('error', function() {
  var args = Array.prototype.splice.call(arguments, 0);
  args.unshift('error');
  self.emit.apply(self, args);
 });
}

_inflater.prototype = new process.EventEmitter();


function _deflater(stream) {
 this._stream = stream;
 var self = this;
 var gzip = new compress.Gzip;
 gzip.init();

 this._stream.on('data', function(d) {
  var _d = gzip.deflate(d.toString('binary'), 'binary');
  self.emit('data', _d);
 })
 .on('end', function() {
  self.emit('end');
 })
 .on('error', function() {
  var args = Array.prototype.splice.call(arguments, 0);
  args.unshift('error');
  self.emit.apply(self, args);
 });
}

_deflater.prototype = new process.EventEmitter();



exports.inflater = function(stream) {
 return new _inflater(stream);
};

exports.deflater = function(stream) {
 return new _deflater(stream);
};

File: test.js
var gz   = require("./gzip.js");
var http = require("http");

var req = http.request({
 host: "duckduckgo.com", 
 port: 80, 
 path: "/"
}, function(response) {
 console.log("response headers:", response.headers);

 //
 // We check if we need to create a proxy event emitter that
 // inflates data on the go.
 //
 if (response.headers['content-encoding'].search('gzip') != -1) {
  //
  // The response object is now the proxy object. A similar 
  // technique is used to support TLS encrypted sockets too.
  // We just wrap up the original EventEmitter instance in 
  // a proxy instance that does its magic.
  //
  response = gz.inflater(response);
 }

 response.on('data', function(d) {
  console.log("Got:", d.toString());
 });
});

req.setHeader("Accept-Encoding", "gzip");
req.end();

Sunday, March 20, 2011

Issue list in code

A few weeks ago, there was a debate/argument/discussion about what issue tracking tool should be used for our project. The 2 main contenders were JIRA and Basecamp. I'll try to capture the essence of the discussion.

Things going for JIRA:
  1. Very customizable and configurable
  2. Offers many workflows and views
  3. Lots of plugins
  4. Easy for manager to generate reports (good reporting features)

Things going against JIRA:
  1. Not the most convenient to use for developers
  2. Too much (unnecessary) detail to be filled in for each task at times

Things going for Basecamp:
  1. Easy to use. Clean interface. More like a todo list
  2. Developers like it

Things going against Basecamp:
  1. No easy custom report generation facility
  2. Not reporting/manager friendly

We've decided to try both out for a few months and see which one "works out" for us. We were using JIRA before, but the enthusiasm seems to have died out after a few months and the issue tracker is not in sync. with the actual issues/reports/bugs/features being developed.

I have an entirely different point of view though. For me what has worked best is comments in code that say something like:
// TODO: Such and such is broken. It needs to be fixed in so and so way.

The reason is that it's too much tedium for me to update things in 2 places (code and issue tracker) when I fix an issue. If I forget, then it's even harder to related the issue to the actual change made. This method offers a way of alleviating that problem.

I'm also okay with having comments that are verbose just so that they can explain in sufficient detail the reason (or purpose) of the code below.

However, this has been mostly for projects where I have been the sole developer. I don't know how well it would scale for projects with 2-3 developers, 5-6 developers and > 6 developers.

I would guess that it's fine for projects with up to 6 developers, but beyond that it would get cumbersome. Also, this way of tracking issues lacks a good reporting method which allows you to track changes across versions. Of course, if you want to adopt this as a workflow, I'm sure some tooling around this general idea can be developed which works by generating diffs between revisions to track which bugs this revision (or set of revisions) fixes.

Saturday, March 19, 2011

node-xmpp-bosh: A BOSH (connection manager) session manager for the XMPP protocol

You can find out all about what BOSH is from XEP-0124 and XEP-0206.

Most respectable XMPP servers out there such as Openfire, Ejabberd, and Tigase support BOSH as part of the server.

This is where you can download the node-xmpp-bosh BOSH server from.
node-xmpp-bosh has a new home at github!

I'll just discuss some of the reasons that it's good to have a BOSH session manager running outside of the XMPP server.
  1. Easier to scale the BOSH server and the XMPP server independently (independently being the key here). See my previous blog post for a detailed explanation
  2. You can support users from multiple domains such as gmail.com, chat.facebook.com, jabber.org, pandion.im, etc... using the single BOSH server
  3. Any customizations to the BOSH server are easier made if it is running out of process (restart it independently of your XMPP server - since getting an XMPP server warmed may be more expensive than getting a BOSH server warmed up)
  4. Use protocols other than XMPP at the back, but still retain the same interface as far as clients are concerned. This will still require you to use XMPP for the actual communication (Note: This can be accomplished via transports [for yahoo, msn, aim, etc...] available for any XMPP server [that supports XEP-0114] as a Jabber Component). But, if you please, you can also write a small drop-in replacement server that speaks basic XMPP, but does something entirely bizarre at the back!!

Scale out with services - Scale the services separately

Whatever I am writing about here has been taught to me by this brilliant write up that I quote from the DBSlayer project page.

"The typical LAMP strategy for scaling up data-driven applications is to replicate slave databases to every web server, but this approach can hit scaling limitations for high-volume websites, where processes can overwhelm their given backend DB's connection limits. Quite frankly, we wanted to scale the front-end webservers and backend database servers separately without having to coordinate them. We also needed a way to flexibly reconfigure where our backend databases were located and which applications used them without resorting to tricks of DNS or other such "load-balancing" hacks."

What this means is that it is better (from a scaling perspective) to split your application into services rather than have them run as one monolothic application on some very powerful machine.

Why is it better to do so? That's because each "service" may have different performance requirements and characteristics which can be better addressed in isolation if the service is running as a separate entity.

For example, consider a typical web-stack which consists of:
  1. Load balancer (nginx)
  2. Web server/script execution engine (apache/php)
  3. Data Store (mysql/pgsql)
  4. Caching service (redis/memcached)
  5. Asynchronous processing infrastructure (RabbitMQ)

For a medium to high load web site (a few million hits/day => about 20-30 req/sec), you could make do with just a single instance of nginx, 2 machines running apache/php, 2 machines running MySQL and one machine running RabbitMQ. Even for this deployment, you can see that each of the components have different requirements as far as the machine (and hardware usage) characteristics are concerned. For example,
  1. nginx is network I/O heavy. You could deploy it on a machine with a modest (1GB) amount of RAM, no hard disk space, and not a very fast CPU
  2. The Apache/php servers on the other hand would need more RAM as well as more CPU, but no disk space
  3. The MySQL nodes would need a lot of RAM, CPU as well as fast disks
  4. The node running RabbitMQ (a message queue) could again comfortably run on a machine with a configuration similar to nginx (assuming that data is stored on MySQL)

Thus, we have sliced our stack into components we can club together not just based on function, but also based on the charasteristics of the hardware that they would best be able to run on. Aside: This reminds me of Principal Component Analysis.

Node.js as "the next BIG thing".

I generally hate to talk about "the next BIG thing" because there seem to be (relatively) fewer things (as compared to all the things that we encounter) that influence our lives in significant ways these days (simply because we encounter so many things these days).

However, I feel that node.js is going to be quite influential in the years to come.

So, what is "node.js"? Node (put briefly) is a javascript execution engine. It incorporates asynchronous I/O as part of its design and hence can be considered asynchronous by design. This is where it starts getting fascinating for me.

Do you remember all those $.ajax() requests you made in the browser? Well, what if you could so the same on the server side? What if stitching together a page on the server was as easy as making AJAX calls and filling in empty <div> tags? You needn't stretch your imagination too much because that's exactly what Node lets you do!!

Even more exciting is that fact that this asynchronous API isn't limited to just HTTP/AJAX calls, but extends to protocols such as SMTP, POP, XMPP, and Database handlers such as PGSQL, MySQL, and Sqlite. This is because the Socket & file system I/O on Node is asynchronous by design. (almost??) All the APIs which do I/O are asynchronous in nature. For some, blocking counterparts are provided solely for convenience.


Toy task: Make 10 HTTP web calls and present the output as the result of each of these 10 calls separated by a new line. Assume that each HTTP call takes 1 sec to produce the result.

Sample code (in your favourite programming language; using blocking I/O):

out = "";
for i = 1 to 10:
 # We assume that http_call blocks till the fetch completes.
 response = http_call(url);
 out += (response + "\n");
write_output(out);

You notice that the code takes 10 sec to run - 1 sec for each HTTP web call. We ignore the time taken to write out the results.


Sample code (using non-blocking I/O):

out = "";
res [10] = [ ];
for i = 1 to 10:
 # We assume that http_call does NOT block and retiurns immediately, 
 # processing the task in the background (another thread or using
 # non-blocking I/O)
 res[i] = http_call(url);

for i = 1 to 10:
 # We assume that the when statement blocks execution till the 
 # condition is met
 when res[i].complete:
  out += (res[i].data + "\n");

write_output(out);

You notice that the code takes 1 sec to run - 1 sec for each HTTP web call; all of which fire at the same time. We ignore the time taken to write out the results.

That's a phenominal increase of 10x from the blocking version. Now, why wouldn't anyone use Node for such a use case??

Tuesday, March 08, 2011

jQuery stack == evolution

So, twitter was kinda cramped, so I'm writing this here. A bunch of context around this. @addyosmani posted a tweet/link:

"Tools For jQuery Application Architecture – The Printable Chart http://bit.ly/fryIKW"

which I re-tweeted and to which @rakesh314 replied with this tweet/link:

"@addyosmani http://t.co/xWddsTK /cc: @dhruvbird"

So, there's one thing I strongly believe, which is that the jQuery stack seems to be a product of evolution, and evolution is generally not wrong - but that's a separate debate. Lots of people have tried different things and jQuery has evolved to permit all of them to co-exist in (relative) harmony.

The LAMP stack was not THE LAMP STACK before people realized these set of tools that were always used together. It was only after the fact that they decided to bundle them. Different Linux distros. are healthy in the sense that they package what they think their users want.

Similarly, I feel that the jQuery stack is progressing in that general direction and I won't be surprised to see vendors releasing their "version" or "distro" of the jQuery stack.

I'm not as much "against" dodo as I am "for" jQuery.

Sunday, March 06, 2011

HTTPS for all browsing

I've been thinking about using HTTPS for everything (even sites that don't support HTTPS) by routing all my traffic through a proxy that makes connections to the actual site (possibly on HTTP). This at least secures my traffic from packet sniffing on the local LAN.

What this means is that if you are at school or office, no colleague can run Wireshark or TCPDump on the the local LAN and capture/sniff your traffic. Also, you can now safely browse the web over http on insecure/potentially sniffed networks such as stray wireless networks without having to worry about your data being compromised! Welcome starbucks internet :-p

Traditionally, if the browser connects directly to a public proxy, then HTTP traffic still goes unencrypted (to the best of my understanding). Hence, this is what I've thought of doing.
  1. Set up a local proxy on the same machine, which connects to a remote proxy over HTTPS.
  2. Ensure that the remote proxy is running on a safe/trusted network (it could be your home PC if you want to use insecure wireless networks securely)
  3. This remote proxy can now make HTTP connections and the issue of local packet sniffing is resolved.
  4. However, it doesn't prevent remote packet sniffing (on the network where the remote proxy resides), which is why it is important to have the remote proxy sitting on a secure network.

If you are seriously planning to use this proxy, and you aren't yet using HTTPS Everywhere, I would strongly suggest that you start using it since it will reduce the load on the proxy and is more secure (since the encryption is end-to-end and not proxy-to-end).

Mamma says that there shall be a day when browsers pop up a warning when you view an http based page (as opposed to an https based one).

Update: You can grab the code for this proxy here

Sunday, February 06, 2011

Attitude - and a bit of tea!!

Yesterday, I had gone to have Chai (tea) and Brun Pav at Yazdani Bakery. When I was paying the owner (Rashid Zend in the photograph on the previous link) the amount due, he asked one of the waiter to bring a half eaten pav (bread) that a customer seemed to have left behind. He picked it up from the plate, examined it carefully and asked the waiter (in Hindi) "Why did he leave this uneaten?" I was quite taken by surprise by what he had just asked, but was also very very excited about having eaten at a place where the owner himself took such a keen interest in produce and tried to rectify even a single small error that may have occurred on his (bakery's) part.

Now, if we (as programmers) take every customer feedback this seriously and act upon it, I am fairly sure that the quality of the final product will go up by leaps and bounds!!

Monday, January 31, 2011

Fixing javascript variable leaks

Today I happended to have a look at my application's window object and was shocked to see that 'i' was defined on it! I doggedly started searching my code to see where the leak had perpetrated itself from. After an hour of debugging, which included using Google's online closure compiler (Thanks Sandy!!) and jslint, I came up with nothing.

It struck me that it could be some of the js-libraries that I was using that were causing the leak, and not my code. I tried doing a binary-search on the js-includes, but that didn't work since the code wouldn't even run without most of those includes. At this point I really didn't know how to even go about trying to fix it!!

A few hacky thoughts later, I decided to introduce this code in my application:
var _iv = setInterval(function() {
  if (typeof window.i != "undefined") {
    clearInterval(_iv);
  }
}, 10);

Of course, this relied on the fact that I had inserted a verbiage of console.log() statements in my code at various points (It actually helped to have them!!)

This bit of hackery at least helped me determine that the error happened after a certain set of statements (functions) had executed. From here, it was manual work, checking every function and callback (yes) in my code as well as any library that I was using.

After about 30 minutes, I was able to trace the leak to embed.ly's jQuery library. I also discovered that the fb-complete jQuery library I was using had this leak, but it wasn't the one responsible for the specific instance of the variable 'i' that I was seeing then.

Sunday, January 30, 2011

When to use events & when to use callbacks?

With the recent spate of javascript programming that I've been subject to, I've found myself asking this question quite often to myself "Should I use callbacks here or should I use events (jQuery trigger & bind)?"

I've found this initial approximation quite useful till now: "If the operation that will follow depends upon the exact instance of this operation, then wrap the state in a closure and use a callback. Otherwise, use events." The advantage of using events is that:
1. You need not think of all the actions that could result in this event handler being called.
2. You need not tell anyone that this particular event handler needs to be invoked. Everyone just publishes some data (like a tweet) and all those interested (followers) just gobble that data up.
3. You can dynamically attach event handlers. So, if some component in the system suddenly decided to react to certain events, it can just plonk itself into to the listeners queue for that event.

jQuery goes one step further and allows you to trigger (and bind) events on specific DOM nodes only!! And then it goes a step further and lets you define an event namespace to try and prevent event name conflicts.

Tuesday, January 18, 2011

SQL Query optimization - Part-8 (Indexes Schmindexes)

The "Why are my queries slow despite indexes on all columns" Edition.

Consider the following table:
CREATE TABLE florist_orders(
 order_id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, 
 basket_color INT (1) NOT NULL, -- This can be either 0, 1, 2 or 3
 num_roses INT NOT NULL, 
 num_sunflowers INT NOT NULL, 
 num_daisies INT NOT NULL, 
 processed INT(1) NOT NULL DEFAULT 0, 
 created TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Let's assume that the packaging for every basket of a different color is done by a diffrent person. Additionally, baskets packaging also differs based on the maximum number of flowers of each type. If there is more than 1 of any type, the packaging differs from when there is at most 1 of each type of flower. Each packager wants to check the orders that she needs to process.

The following query works fine:
SELECT * FROM florist_orders
WHERE
 basked_color = 2 AND
 processed = 0 AND
 (num_roses > 1 OR num_sunflowers > 1 OR num_daisies > 1)
ORDER BY created ASC -- Process older orders first to avoid starvation
;

However, as the size of the florist_orders tables gorws, you realize that each query is taking too long. Why is this? You have an index on every column and yet queries are slow?

You decide to create composite indexes. As a first attempt, you create an index on
(basket_color, processed, num_roses, num_sunflowers, num_daisies)
but notice that it hasn't really affected the query's execution time. Additionally, EXPLAIN says that no indexes are being used!! What to do now?

To solve this issue, you would need to modify your table and make it friendly for this specific query. These are the steps you would need to take:
  1. Add a column named num_flowers which is set to MAX(num_roses, num_sunflowers, num_daisies) on every INSERT and UPDATE via TRIGGERS.
  2. Then, create an index on:
    (basket_color, processed, num_flowers)
  3. Modify your query to read:
    SELECT * FROM florist_orders
    WHERE
     basked_color = 2 AND
     processed = 0 AND
     num_flowers > 1
    ORDER BY created ASC -- Process older orders first to avoid starvation
    ;
This should pretty much fix any speed issues with the aforementioned query and ensure that all queries remain responsive. This trick will work ONLY if all the flowers are being checked to be greater than the SAME number.

Tuesday, January 04, 2011

Firefox, I love you!!

The day was saved by Firefox's page cache. I typed in a reply on a forum thread, clicked "post reply" and then closed the tab, only to realize that I wasn't logged in and that a login dialog had popped up just as I had closed the tab. I right-clicked the tab-list and clicked "Undo Closed Tab" and poof!! the complete tab with all its state was there, awaiting my credentials!!

Awesome stuff Firefox... I love you!!

Saturday, January 01, 2011

Peer-2-Peer systems at their best!!

Recently, the skype network went down. You can read all about it here.

However, the exciting take away from this is (quoting from the link above) "In order to restore Skype functionality, the Skype engineering and operations team introduced hundreds of instances of the Skype software into the P2P network to act as dedicated supernodes, which we nick-named “mega-supernodes,” to provide enough temporary supernode capacity to accelerate the recovery of the peer-to-peer cloud."

Which is exactly my idea of how a self-healing peer-2-peer system should behave. Introducing more nodes into the mix should ideally help solve scaling and availability issues, which is exactly what the Skype team did. I would like to congratulate them for that!!

Wednesday, December 29, 2010

Practical SVD for the curious reader

Let's see an example of how SVD can be used perform seemingly daunting tasks.
Consider that we are implementing a rating system for a contest and we want to know player that may have similar skill. Additionally we would like to know the hardness of the tasks in the contest since they are not known apriori.

We build a a matrix that holds player number on the y-axis and task number on the x-axis. Let us suppose that we roughly know the first 3 problems to be easy, the next 3 to be of medium difficulty and the next 3 to be hard. The built matrix would look like this:

# 0  1  2  3  4  5  6  7  8
[ 1, 1, 1, 0, 0, 0, 0, 0, 0 ], #0
[ 1, 1, 1, 1, 0, 0, 1, 0, 0 ], #1
[ 1, 0, 0, 0, 0, 1, 1, 1, 0 ], #2
[ 1, 1, 1, 1, 1, 1, 0, 0, 0 ], #3
[ 1, 0, 1, 1, 0, 0, 1, 1, 1 ], #4
[ 1, 0, 1, 0, 0, 0, 0, 0, 0 ], #5
[ 1, 0, 0, 1, 1, 1, 0, 0, 0 ], #6
[ 1, 0, 0, 1, 1, 1, 1, 1, 1 ], #7

Now, we compute the SVD of this matrix and keep only the best 4 feature sets.

We get [U=]
[[-0.22828916 -0.50341088  0.06758431 -0.27577721]
 [-0.3796649  -0.38290566  0.21189598  0.16725752]
 [-0.29514798  0.33343393  0.18247298 -0.80693185]
 [-0.4283514  -0.28592541 -0.50767041  0.05970937]
 [-0.42379852  0.09624233  0.5690528   0.38393312]
 [-0.18451958 -0.30010945  0.12296957 -0.24683212]
 [-0.31511921  0.15603403 -0.56486439  0.03629706]
 [-0.46924257  0.53231036 -0.03863212  0.17781851]]

We get [S=]
[[ 4.86583872  0.          0.          0.        ]
 [ 0.          2.40125574  0.          0.        ]
 [ 0.          0.          2.02979097  0.        ]
 [ 0.          0.          0.          1.29857912]]

We get [V=]
[[-0.55984867 -0.14756061  0.02109021 -0.38852126]
 [-0.21297571 -0.48817872 -0.11242051 -0.03758748]
 [-0.33799385 -0.57307894  0.22851232  0.06799022]
 [-0.41435336  0.0482063  -0.16268579  0.63532176]
 [-0.24923004  0.16758689 -0.54742924  0.21086504]
 [-0.3098872   0.30644504 -0.45753181 -0.41053094]
 [-0.32221659  0.24115755  0.45560831 -0.06000612]
 [-0.24418998  0.40061814  0.35121531 -0.18880653]
 [-0.18353282  0.26176     0.26131788  0.43258946]]

Next, we need to compute U*S and V*S.
What is the significance of these matrices?
The matrix U*S will allow us to find user that are similar in skill to each other.
The matrix V*S will allow us to determine problems that have been misclassified or are actually similar in difficulty to each other.

After computing the U*S and V*S matrices, we will take the row-wise sum of squared differences (SOSD)and check the pairs of rows that have the least such difference. These are the rows that are of interest to us.

We get [SOSD(U*S)=]
[(0, 1, 1.0430999999999999),
 (0, 2, 4.6740000000000004),
 (0, 3, 2.7736000000000001),
 (0, 4, 4.7484000000000002),
 (0, 5, 0.29770000000000002),
 (0, 6, 4.4981999999999998),
 (0, 7, 7.9534000000000002),
 (1, 2, 4.7319000000000004),
 (1, 3, 2.2631000000000001),
 (1, 4, 1.9745999999999999),
 (1, 5, 1.2628999999999999),
 (1, 6, 4.2881999999999998),
 (1, 7, 5.2785000000000002),
 (2, 3, 5.8609),
 (2, 4, 3.7233999999999998),
 (2, 5, 3.1476999999999999),
 (2, 6, 3.6909999999999998),
 (2, 7, 2.7824),
 (3, 4, 5.7964000000000002),
 (3, 5, 3.2058),
 (3, 6, 1.4441999999999999),
 (3, 7, 4.8299000000000003),
 (4, 5, 3.7522000000000002),
 (4, 6, 5.8014999999999999),
 (4, 7, 2.7383999999999999),
 (5, 6, 3.6880000000000002),
 (5, 7, 6.3265000000000002),
 (6, 7, 2.5535000000000001)]

Analysis of SOSD(U*S): User #0 & user #5 seem to be skilled similarly. Similarly, the following pairs of users also seem to be of similar calibre: (0, 1), (0, 5), (1, 4), (1, 5), (3, 6), (4, 7), (6, 7)
Some of this analysis may seem questionable especially considering that we designated the last 3 problems to be hard. How are user #1 & #4 similar if this is the case?
Let's look at the V*S matrix.

We get [SOSD(V*S)=]
[(0, 1, 3.7989000000000002),
 (0, 2, 2.7381000000000002),
 (0, 3, 2.629),
 (0, 4, 4.7946),
 (0, 5, 3.6124999999999998),
 (0, 6, 3.1680999999999999),
 (0, 7, 4.6081000000000003),
 (0, 8, 5.6936999999999998),
 (1, 2, 0.9093),
 (1, 3, 3.3931),
 (1, 4, 3.3944000000000001),
 (1, 5, 4.5884),
 (1, 6, 4.6798999999999999),
 (1, 7, 5.5022000000000002),
 (1, 8, 4.2117000000000004),
 (2, 3, 3.5369999999999999),
 (2, 4, 5.8647999999999998),
 (2, 5, 6.8044000000000002),
 (2, 6, 4.0688000000000004),
 (2, 7, 5.8483000000000001),
 (2, 8, 4.8121),
 (3, 4, 1.6414),
 (3, 5, 2.8456000000000001),
 (3, 6, 2.806),
 (3, 7, 3.6351),
 (3, 8, 2.3344),
 (4, 5, 0.88270000000000004),
 (4, 6, 4.4261999999999997),
 (4, 7, 3.9102999999999999),
 (4, 8, 2.931),
 (5, 6, 3.6707999999999998),
 (5, 7, 2.931),
 (5, 8, 3.7172000000000001),
 (6, 7, 0.36359999999999998),
 (6, 8, 1.0225),
 (7, 8, 0.88270000000000004)]

The following pairs of problems seem to be equally difficult/easy: (3, 4), (1, 2), (4, 5), (6, 8), (7, 8), (6, 7)
You can immediately see a clique of (6, 7, 8) which were all originally designated as hard problems!
Similarly, (3, 4, 5) seem to be related and so do (1, 2). The reason that problems #0 can not be classified by the system could be that since it has been solved by everyone, not much can be said conclusively about it.
You would have noticed that users #1 & #4 have both solved problems #0, #2, #3 & #6. As it turns out, it seems that the features (as deduced by the system) have given a higher weightage to one or more of these problems, which is why users #1 & #4 have been clustered together.

Tuesday, December 28, 2010

SVD 101

I've been trying to find out more about SVD (Singular Value Decomposition) in the last few days, and here is what I have come up with (links marked with a star[*] MUST be READ). I'll try updating it as and when possible.
  1. http://www.funpecrp.com.br/GMR/year2007/vol4-6/xm0017_full_text.html
  2. http://www.cs.um.edu.mt/~csaw/CSAW04/Proceedings/09.pdf
  3. http://www.math.umn.edu/~lerman/math5467/svd.pdf
  4. http://www.seobook.com/lsi/tdm.htm
  5. http://web.mit.edu/snikolov/www/topicweb_verbose.pdf
  6. http://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/svd.html
  7. http://en.wikipedia.org/wiki/Latent_semantic_indexing
  8. * http://web.ics.purdue.edu/~park283/wp-content/uploads/2010/10/Singular_Value_Decomposition_Tutorial.pdf
  9. http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm
  10. http://www.miislita.com/information-retrieval-tutorial/singular-value-decomposition-fast-track-tutorial.pdf
  11. http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html
  12. http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-2-computing-singular-values.html
  13. http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-3-full-svd.html
  14. * http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html
  15. * http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-5-lsi-keyword-research-co-occurrence.html (explains the significance of the U & V matrices)
  16. * http://www.dcs.shef.ac.uk/~genevieve/lsa_tutorial.htm (Thanks to Shashank Singh for this link)
SVD can be used to:
  1. Determine Document Similarity
  2. Determine User similarity in a competition (or in a rating-based system such as movie recommendations)
  3. Perform Image compression
  4. Perform Principal Component Analysis
  5. and much more...
You can use the following tools to compute the SVD of a matrix:
  1. SciLab
  2. NumPy (SciPy)

Saturday, December 25, 2010

Suggest more recipients (in a chat conference)

If you don't already know, gmail has had a feature called Suggest more recipients for a while now.

From the link above, "Gmail will suggest people you might want to include based on the groups of people you email most often. So if you always email your mom, dad, and sister together, and you start composing a message to your mom and dad, Gmail will suggest adding your sister. Enter at least two recipients and any suggestions will show up like this."

I am planning on extending this even for the chat conversation start page in the chat client that I'm working on. So, if you start entering JIDs of people you want to chat with, the client should automatically suggest more participants.

Thanks to Sandy for the providing the link to gmail blog!!

Update (02/01/2011): This feature is now implemented and part of the dev. versions.

Thursday, December 23, 2010

Debugging PHP's $_REQUEST

The $_REQUEST page in the PHP manual, says that "$_REQUEST is an associative array that by default contains the contents of $_GET, $_POST and $_COOKIE. The variables in $_REQUEST are provided to the script via the GET, POST, and COOKIE input mechanisms and therefore could be modified by the remote user and cannot be trusted. The presence and order of variables listed in this array is defined according to the PHP variables_order configuration directive."

I was under the impression that $_REQUEST would only contain data from the $_GET and the $_POST variables. In my application, I was setting cookies that had the same key as one of the GET parameters that I was expecting. The script was hence getting data set by the cookie and not the GET parameter!! I went half mad trying to debug this and was suspecting some caching issue. Things got queerer when changing some of the GET parameters worked, but others refused to budge from their original value!! You can imagine the perplexing situation I was in. Thankfully, though it dawned on me to check the PHP manual to see what the $_REQUEST was supposed to contain and I was able to fix my code to use $_GET.

Since then, I have come across this very useful page which mentions the exact problem I had and also says why one should avoid using $_REQUEST.