Saturday, November 01, 2014

Macho-ism in Computer Science

It's common for me to see blog posts by companies talking about the high traffic volume in terms of QPS/RPS they handle and the amount of data they process, and that's super cool. But then there's another class of facts I see floating around a lot, and they tend to talk about the size of their hadoop or serving cluster, and things like "dozens or hundreds" of machines in your fleet seem to be something to be proud of. I don't understand this way of thinking or at least don't see the point of it. As I see things, it's nicer if you can get more done with fewer machines, and not more machines in your fleet.

Sunday, October 12, 2014

Static To Dynamic Transforms or: How I learnt to stop worrying and love static data structures

Gaurav in his blog post describes in great detail what a static to dynamic transform is, when it is applicable, how to dynamize a static data structure, and the costs of inserting and looking up values in a dynamized data structure (relative to the costs in the corresponding static structure). I'll skip all of that since it's been presented so well in the link above, but will give a short description of the essentials.
  • What is a static data structure? A static data structure is one where it isn't computationally efficient to insert a new element, and typically involves re-building the whole structure to be able to add just one element. e.g. A sorted array. If you want to keep an array of elements sorted, then inserting a single element could involve shifting all the elements one place to the right.
  • What is a dynamic data structure? A dynamic data structure is one where adding a single element is computationally efficient, and doesn't involve touching every element in the data structure. e.g. A height-balanced binary search tree. Inserting a single element in a height-balanced binary search tree involves O(log n) node rotations. See this page to read more about the differences between static and dynamic data structures.
  • What is the amortized insertion time to insert an element in a dynamized sorted array? Inserting a single element into a dynamized sorted array costs O(log n) per insertion.
  • How much extra space do you need to dynamize a sorted array? You need O(n) extra space to dynamize a sorted array, since you need intermediate storage space when merging parts (levels) of the dynamized data structure.
  • What is the query time in a dynamized sorted array? Searching for an element in a sorted array costs O(log n), whereas searching in a dynamized sorted array costs O(log2 n).
We can see that the overhead of dynamizing a sorted array is something we can live with. In fact, it's almost unbelievable that we can dynamize a sorted array by paying only as much as an O(log n) overhead per insertion, and an O(log n) overhead per query.

Static to Dynamic transforms in practice: Consider that you're working with an inherently static data structure such as the SSTable (Sorted String Table), and that your system must maintain all its data as part of some SSTable. In such a case, inserting even a single row means that you need to rebuild the new SSTable which contains all the rows from the previous SSTable plus the newly inserted inserted row. This obviously means that the cost of inserting a single row can be linear in N, N being the number of elements in the newly created SSTable. This is extremely undesirable since it means that inserting N elements into the system will cost O(N2).

This is exactly where the Static To Dynamic Transform comes in super handy. You just apply the transform, and almost magically, you've gone from an overall running time of O(n2) down to O(n log n) for inserting n elements into the data structure.

Saturday, October 11, 2014

Deamortizing disk writes during log rotation

According to this page on deamortization, "deamortization refers to the process of converting an algorithm with an amortized bound into one with a worst-case bound."

Log Rotation refers to the automated process used in system administration in which dated log files are archived. logrotate is one such UNIX command that lets you perform log rotation.

Many services (daemons in the UNIX world) write out log files that need to be periodically rotated and compressed. This is achieved using the logrotate command (mentioned above). Certain tools such as multilog also assist in augmenting the log with extra information such as the timestamp, etc...

When log files are rotated and compressed, it ends up reading the complete log file from disk (if not present in the kernel caches), and using a lot of CPU for the duration of compressing the log file. Depending upon the size of the log file, this could take anywhere from a fraction of a second to 10s of seconds. If the log file is large (which is usually the case), it ends up starving other processes that need to use the disk or CPU. If the supervising process (performing the job of log rotation and compression) is synchronously reading from the daemon service's stdout/stderr, and is taking too long to rotate/compress the logs, then it will block writes to the daemon's stdout/stderr file descriptors since the pipes between the supervisor and the daemon will be full.

This can be mitigated by deamortizing the costs of log rotation/compression by writing out the compressed file incrementally instead of compressing the uncompressed log file when it is rotated and a new log file is started. This not only deamortizes the I/O and CPU costs, but also ensures that we log a lot lesser in terms of bytes (on disk). This is because we were earlier writing out the uncompressed file and subsequently compressing it and writing the compressed log file, whereas we will now only write out the compressed file (eliminating writes of the uncompressed file, which is a large fraction of total log file writes). If we write out just the compressed log file incrementally, we're saving greatly on the IO cost associated with writing out the uncompressed log file. This simple change can go a long way in helping you scale your services to be able to handle a larger number of overall requests.

Thursday, March 27, 2014

Gyro balls and yoga

The Powerball is a relatively new gyro-based exercise device intended for strength building in the wrists and fore-arms. I'm using it to help me in my yoga practice.
Aasanas that require wrist and forearm strength are greatly benefited by using the wrist exercising gyro for 3-5 minutes a day.

For example, using the gyro helped me get going with hamsasana (where your fingers are pointing towards your face)
which is a lot harder than mayurasana (where your fingers are pointing away from your face, and towards your feet).

You can immediately spot the difference between the final pose for each posture.
  1. In hamsasana, your body is bent at the buttocks, whereas in mayurasana, the body is mostly straight.
  2. In hamsasana, your face is closer to the ground than it is in mayurasana.
  3. In hamsasana, you are bent further forward (angle that the forearms make with the floor) than you are in mayurasana.
  4. In hamsasana, your back is slightly more bent compared to mayurasana, where you back is mostly straight.
  5. In hamsasana, your elbow is further down your abdomen compared to mayurasana, where it is tucked into your stomach.
  6. Hamsasana needs more wrist strength compared to mayurasana.

Tuesday, February 18, 2014

To Muir Woods and back

S, V, and D went decide to bike to Muir Woods from SF Caltrain. I have a sore index finger, so this will be brief. This is the path we took in the going direction, and this is the elevation profile. On the way back we took a slightly longer route.

6:30am: Wake up.

7:55am: Leave for Caltrain Station

9:36am: Reach San Francisco Caltrain Station

9:40am: Purchase fruits and food items for the trip from Safeway

Here are our bikes:

10:00am: Start biking on The Embarcadero towards the Golden Gate Bridge

The Bay Bridge is on the way:

Some other exhibit. Looks like an arrow facing the ground:

10:05am: Arrive at The Bike Hut since V forgot their helmet at home The guy at the bike hut too a while to inflate his shop, but once he was done, offered V a helmet gratis. Very kind man.

Don't remember when, but reached the Golden Gate Bridge (south side) and then biked across to the north side of the bridge. This was fun. Didn't take any photos since there was quite a bit of uphill and we didn't think of taking photos at that time.

We reach the north side of the bride and rest for a while. Then take the road on to Sausalito. This road is marginally uphill but mostly downhill. I don't enjoy the downhill as much in fear of having to bike back up :-p

Reach Sausalito and can't believe the breathtaking views from the place.

Someone is flying a Quadcopter here as well. The battery life on these seems to be ~20 minutes. The owner prefers to use it just for 15. That's 25% energy wasted. Greenpeace, are you listening??

We keep biking along the coast (Bridgeway) and eventually cross 101 at some point in time. Then on to Mill Valley-Sausalito Path, and then left on to Sycamore Ave. Nothing particularly interesting between Sausalito and now.

We have to constantly consult our maps because the road is no longer just a straight road. We're afraid of taking a wrong turn and biking in the wrong direction. The constant checks make biking frustrating and makes us uneasy. V suggests investing in a phone holder for bikes. I provide a node of approval.

Our first reality check is at Molino Park where we wonder how far we want to go since the uphill slopes have been tiring. V's determination and unwillingness to give up inspires me. We turn on to Edgewood Avenue with renewed energy and determination.

1:30pm: We reach the tip top of Sequoia Valley Road, and the views are just breathtaking. S reaches shortly after. I've dismounted and crossed the road and am already taking photos.

The road ahead is ~700ft of pure downhill pleasure (which means uphill on our way back). S asks if we want to continue on and whether we'll be able to make it back in time. I make some quick approximate calculations and boldly suggest that we should be back by 5:30pm. Sounds like an estimate for delivery on a software project. Read on to find out what happens next.

V arrives and we unanimously decide to continue (against my better judgment, but hey, I did want to visit Muir Woods too :-p). The next ~7 minutes is just pure downhill orgasmic thrill.

Muir Woods Visitor Center is here. We decide to stay till 2:30pm. We have a Bartlett Pear each. V needs caffine in their blood, so we stop for a coffee break.

There's a tiny bridge over a stream and a lot of visitors!

And some #firstworldproblems:
2:45pm: And we're back on the road. This is an ~700ft ascent, and we do it fairly slowly.

3:15pm: The view at this time of the day on the tip top of Muir Woods Road is breathtaking.

V's bike:

3:30pm: There is a slight hiccup when we don't know which road we are on. We think that gmaps is showing us the right road, but the road sign seems to be wrong. We ask a tired runner, and he guides us. We later realize that the road name changes, and gmaps doesn't show where this transition happens. This can get confusing.

4:05pm: We stop at a place where we see a sea-plane and contemplate taking it :-p We also see a helicopter landing there. We all unanimously decide to take the ferry back from Sausalito to Pier39. S is the only one with power in their cell phone. The next ferry is at 5:35pm, and it will take us ~30 minutes to get there according to gmaps. We amble our way to the ferry terminal.

4:45pm: We;re almost at the ferry terminal when S spots R and family. We're pleasantly surprised by the meeting and exchange pleasantries and indulge in small talk.

5:00pm: We head to the ferry terminal and stand in line with the rest of the bikers. S determines the status of tickets and gets a Clipper Card. S tells V that tickets cost $4.25, but the clipper card needs $10.25 balance. I am unsure if I have that sort of cash, but I assume that I can refill my card. I ask S some more questions since I am confused about what is going on. S is searching for their Caltrain ticket. S and I need to communicate more. I have a hard time understanding things and S thinks I am smarter than I am.

5:20pm: The line moves forward, but the ferry is filling up fast. I see a ticket window and wonder why we aren't buying tickets. Everything seems bizarre to me. S then explains that we can clip on on the ferry and that $10.25 is the cost of the ferry and $4.25 is the minimum balance required. I really need to find a good way to communicate with S. We casually joke that we shouldn't be the first ones in line for the next ferry.

We are the first ones in line for the next ferry. Never has a STOP sign hurt so much.

5:45pm: We wait for the ferry to leave (since it isn't over till it's over), and start biking back. I am somewhat pleased that we aren't waiting for the next ferry, which is at 6:45pm.

Sausalito uphill seems easier than the one on Muir Woods Road. We are all pretty tired.

6:00pm: We've reached a point in the road where we need to decide whether we should turn right or go straight. I remember turning left after emerging from a tunnel on our way here, so I suggest turning right. I check the map for good measure, and it looks okay. The is my first blunder. We turn right into a tunnel that's longer than any other tunnel we've taken so far and find ourselves on Bunker Road. Phone doesn't have data, but the GPS is able to tell us where we are. We can't route back so we decide to retreat back into the tunnel we came from.

6:05pm: We;re back at the mouth of the tunnel, and we decide to route back to SF Caltrain. gmaps will suggest a path that doesn't exist. We don't know this, so we take the path. On the way down (which is a steep downhill), S suggests that they don't remember gaining so much altitude on the way up. We're halfway down. I completely agree with S, but don't have a better idea, so I don't stop. This is my second blunder. We find ourselves at the bottom of the Golden Gate Bridge. The view is majestic.

We crawl back up to the Golden Gate Bridge and realize that we have to cross since the west side of the pedestrian/bike path is now closed and we need to go over to the east side. My 5:30pm estimate has been blown to bits.

7:05pm: We take a short break to determine when we'll reach Cafe Chaat (dinner!!). I speculate another 20-25 minutes to reach Cafe Chaat.

7:55pm: We reach Cafe Chaat. Dinner is on the menu :-p

9:15pm: Take the last Caltrain back (it's Sunday).

Friday, January 31, 2014

Making soft chapatis that balloon

I've been making chapatis for a while now, and conventional wisdom says that getting the dough well knead is the key to soft chapatis since it forms the gluten strands in the dough and makes it tough. I very recently chanced upon the real method to make soft pliable dough that also results in chapatis that balloon even under sub-optimal rolling.

The key is to knead the dough in salty water. Take some salt (to taste) and mix it in some water (at room temperature). When the salt has completely dissolved in the water, use that water to knead your dough. You will find it much easier to knead your dough, and it will result in a soft pliable dough that is somewhat forgiving of mistakes (or chapati got stuck to the rolling pin and folded over itself) while rolling.

I additionally add some turmeric and red chilli powder to my dough for enhanced taste and colour! :)

Friday, January 10, 2014

Merging AVL Trees

Problem Statement: Given two AVL trees T1 and T2, where the largest key in T1 is less than the smallest key in T2, Join(T1, T2) returns an AVL tree containing the union of the elements in T1 and T2. Give an algorithm (in pseudocode) for Join() that runs in time O(log n), where n is the size of the resulting AVL tree. Justify the correctness and efficiency of your algorithm.

Problem statement and solution stolen from these assignment solutions.

Solution: Begin by computing the heights h1 of T1 and h2 of T2. This takes time O(h1 + h2). You simply traverse a path from the root, going to left child if the balance factor is -1, to the right child if it is positive, and to any of the children if the balance factor is 0, until you reach a leaf. Assume that h1 > h2; the other case is symmetric.

Next, DELETE the smallest element x from T2, leaving T'2 of height h. This takes O(h2) time.

Find a node v on the rightmost path from the root of T1, whose height is either h or h + 1, as follows:

v = root(T1)
h' = h1
while h' > h + 1 do
  if balance factor(v) = -1
  then h' = h' - 2
  else h' = h' - 1
  v = right-child(v)

This takes O(h1) time.

The reason we choose a node with height h or h + 1 is because if we are at a node of height h, and we move to its parent node, then the height of the new (parent) node might increase by 2 (to h + 1), since the sibling of node from which we moved up might have a height greater (by 1) than its sibling.

Let u denote the parent of v.

Form a new tree whose root contains the key x, whose left sub-tree is the sub-tree rooted at v and whose right sub-tree is T'2.

Note that this is a valid binary search tree, since all the keys in the sub-tree rooted at v are in T1 and, hence, smaller than x, and, by construction, x is smaller than or equal to all elements in T'2. The balance factor of the root of this new tree is h - h' which is either -1 or 0, so this new tree is a valid AVL tree. The height of this new tree is h' + 1, which is 1 bigger than v's height. Let the root of this new tree be the right child of u, in place of v. Again, since all keys in this new tree are at least as big as u, this results in a valid binary search tree. This part of the construction takes constant time.

Now, as in the INSERT algorithm, we go up the tree, starting at u, fixing balance factors and perhaps doing a rotation. This takes O(h1) time. Note that the correctness follows from the condition at u before this process is begun is a condition that can arise during the INSERT algorithm.

Since h1, h2 ∈ O(log n), the total time taken by this algorithm is in O(log n).