Thursday, August 02, 2012

SQL Query Execution Optimization opportunities missed: Part-1

There are some common queries that are used almost everywhere, but they seem to be not executed very cleverly by most (all??) SQL execution engines.

An example of one such query is the quintessential pagination query. Assume you have a schema for comments that looks like this:


CREATE TABLE COMMENTS(post_id INT, 
  comment_id INT, 
  comment_text TEXT,
  poster_id INT,
  posted_at TIMESTAMP,
  PRIMARY KEY (comment_id),
  UNIQUE KEY (post_id, posted_at, comment_id)
);


Now, you display the first page (say the most recent 10 comments) for a post (with id=33) using a query such as:

SELECT * FROM COMMENTS 
  WHERE post_id=33 
  ORDER BY posted_at DESC
  LIMIT 0,10;


The LIMIT 0,10 syntax fetches at most 10 comments from offset 0 in the result set.

The most natural way to fetch the 2nd page would be to say:

SELECT * FROM COMMENTS 
  WHERE post_id=33 
  ORDER BY posted_at DESC
  LIMIT 10,10;


Similarly, you can get the 3rd page by saying:

SELECT * FROM COMMENTS 
  WHERE post_id=33 
  ORDER BY posted_at DESC
  LIMIT 20,10;


and so on...

You must have noticed that in case LIMIT 1000,10 is specified, a naive execution of the query plan involves fetching all the 1000 rows before the 10 interesting rows are returned by the database. In fact, most databases, will do just that. There is in fact a better way to execute such queries.

(Let's leave aside the discussion of whether pagination queries should be implemented this way. The observant reader will notice that we have an index on (post_id, posted_at, comment_id), and that can be used to fetch the next 10 results given the post_id, posted_at, and comment_id of the immediately preceding comment).

We know that most (of not all) databases use B-Trees to store indexed information. It is easy to augment the B-Tree to hold information such as "how many elements under this node (including the node itself)?" This information alone will let us very quickly (in O(log n) I/O look-ups) locate the node of interest. Suppose we want the entry from the COMMENTS table at offset 1000 with post_id=33, we will first perform a look-up for the first entry with post_id=33. Once we find this key, we can quickly (in O(1) time) determine how many entries are less than the located entry (since we already have this information cached at each node). Let the found node be at offset OFF1. Subsequently, we can query the B-Tree to find the entry that has an offset of OFF1 + 1000 (since we have the cumulative counts cached for every value in the B-Tree node!).

For example, consider the Binary Tree shown below:



The alphabets in the top half of every node are the keys and the number in the bottom half of the node is the count of the number of nodes in the sub-tree rooted at that node.

We can answer the following query on this tree in O(log n) time:

SELECT KEY FROM BINARY_TREE
  WHERE KEY > D
  ORDER BY KEY
  LIMIT 3,2;


i.e. We want to fetch the 5 smallest keys greater than 'D', but we want only the last 2 from this set. i.e. The 2 greatest keys that are a subset of the set containing the 5 smallest keys greater than 'D' (read that a few times to get a hang of it).

We proceed as follows:
  1. Find D
  2. Find Rank(D) = 4
  3. Find the KEY such that Rank(K) = Rank(D)+3 = 7. This happens to be the KEY 'G'
  4. Perform an in-order traversal to find the in-order successor of the node that we are on till we either run out of nodes or we have walked 2 nodes (whichever comes first)
Note: To compute the rank of a node we use the following formula:

Rank(Node) = SubtreeSize(Node->Left) + 1 [if Node->Left exists]
Rank(Node) = 1                           [otherwise]


Hence, we can see that we are able to satisfy this query in time O(log n + Result_Set_Size), where n is the number of elements in the Binary Tree.