Sunday, September 23, 2007

An HTTP server with diff capabilities.

These days, there is a surge in the traffic content on the internet. Many sites are serving dynamic content to their users. However, the static content doesn't seem to have lost popularity. Many sites still serve a lot of their pages more or less statically. Furthermore, these static pages don't change very often even though clients are always re-requesting these pages, and asking them to be refreshed. An example would be google's home page. That apart, site administrators always want clients to not cache the page content, because they want clients to see the latest version of any static content that may have changed since the last time they visited. This makes them advertise HTTP headers which forbid the caching of web-pages by browser and proxy caches.

If you notice carefully, even if use wisely, the If-modified-since header involves sending the entire page back in case it was modified. What I'm suggesting here is to use a diff based scheme, wherein the client sends a request to the server indicating a previous page version that it already possesses, along with a flag indicating that it is willing to accept a diff to the current version of the page. The server will now(optionally) send back a patch to the client which it applies to the page that it already possesses. This sending of the patch is optional since the server may not have cached the page to which the client is referring, so that generation of the patch may not be possible at the server end. In this case, the server will send back the whole page as it would in the current scenario.

What this scheme will effectively do is reduce bandwidth consumption dramatically in places where a central proxy is being used to serve many users, and those users seem to be accessing the same static content repeatedly. If the content changes by a bit, or doesn't change at all, this scheme will result in a definite reduction in bandwidth consumption on the internet. However, the trade off here would be CPU time to network bandwidth, since the diff generation and application of the generated patch are both CPU intensive activities as compared to normally sending the whole page.

No comments: