Saturday, May 07, 2011

Computing the Running Median of an Array of numbers

The problem of computing the running median of an array of numbers involves choosing a window size for which you want to compute the median for. For example, suppose you have the following list of 10 integers:

4, 6, 99, 10, 90, 12, 17, 1, 21, 32

and you want to compute the running median for a window of size 7, you will get the following list of 4 medians:

Median for 4, 6, 99, 10, 90, 12, 17  : 12
Median for 6, 99, 10, 90, 12, 17, 1  : 12
Median for 99, 10, 90, 12, 17, 1, 21 : 17
Median for 10, 90, 12, 17, 1, 21, 32 : 17

The naive way would be to always sort the array and pick the middle element. This costs O(n.log n) per operation (when an integer leaves and enters the window). A slightly smarter solution will try to keep the array sorted and remove and insert the integers in O(n) time. Even though this is better, we hope to do better.

You can find many resources online that try to solve the problem, but surprisingly, very few mention an O(log n) solution (per insertion and removal of an integer), which I believe is quite easy to implement.
  1. http://mail.python.org/pipermail/python-list/2009-October/1222595.html
  2. http://stats.stackexchange.com/questions/134/algorithms-to-compute-the-running-median
  3. http://stats.stackexchange.com/questions/3372/is-it-possible-to-accumulate-a-set-of-statistics-that-describes-a-large-number-of/3376
  4. http://code.activestate.com/recipes/577059-running-median/

I shall first describe a general (data structure independent) procedure for computing the running median of a set of integers and provide other specific tricks that use a balanced binary tree to achieve the same.

General Technique:
The general technique involves keeping the data (we assume that all integers are distinct for the sake of simplicity) partitioned into 3 sets, namely:
  1. All elements less than the median
  2. The median
  3. All elements greater than the median

Set-1 should support the following operations with the below mentioned runtime complexity:
  • Insertion of an element: O(log n)
  • Removal of an arbitrary element (given its handle - the handle is returned by the data structure when the element is inserted): O(log n)
  • Query the maximum valued element: O(log n)

Any operation on Set-2 is pretty trivial to implement, so I'll skip to the next set.

Set-3 should support the following operations with the below mentioned runtime complexity:
  • Insertion of an element: O(log n)
  • Removal of an arbitrary element (given its handle - the handle is returned by the data structure when the element is inserted): O(log n)
  • Query the minimum valued element: O(log n)

What we shall do is initially sort the first 'k' (window size) elements in the stream and create the 3 sets as described above. We also maintain a queue of handles as returned by each of the data structures managing Sets 1 & 3. This will help us remove the oldest element from these sets.

When a new integer is to be inserted, we follow these steps:
  1. Get the handle to the oldest element and remove it from the set in which it currently is
  2. Insert the new element in Set-1 (or Set-3) and push the returned handle into the queue
  3. If the median element goes missing, we pick up the largest element from Set-1 (or the smallest element from Set-3), remove that from the set and set that as the median element
  4. If we notice that the sizes of Sets 1 & 3 are not within 1 of each other, we shift the elements via the median from one set to the other
  5. We always shift the largest element from Set-1 or the smallest element from Set-3
  6. We ensure that the handles get updated whenever the elements move from one set to the other. This requires the handle to be smart and contain a reference to its parent set

If we follow the steps mentioned above, we are ensured that the median is always present in Set-2 (Set-2 always has just 1 element).

A Max-Heap can be used to implement Set-1 and a Min-Heap can be used to implement Set-3. Alternatively, a Min-Max-Heap can be used to implement both. We can also use a balanced binary tree (such as an AVL Tree, Red-Black Tree, AA Tree) to implement Sets 1 & 3.

Balanced Binary Tree Technique:
We can use a Balanced Binary Tree such as an AVL Tree or a Red-Black Tree with order-statistic querying to implement the running median algorithm. Insertions and deletions can be performed in O(log n) time (as before, we maintain a queue of the elements - we do not need handles in this case since a balanced binary tree can delete an element in O(log n) given the element's value). Furthermore, we can query the tree for the WINDOW_SIZE/2 ranked element (which is the median) at every stage. This is by far the simplest and cleanest technique. You can find an implementation of an AVL Tree that supports querying the k-th smallest element (for any k) here.

Example code to use the AVLTree linked above:

var algo = require('algorithm');
var items = [4, 6, 99, 10, 90, 12, 17, 1, 21, 32];

function print_median(tree) {
  console.log("Median:", tree.find_by_rank(Math.ceil(tree.length / 2)));
}

function running_median(items, window_size) {
  var tree = new algo.AVLTree();
  for (var i = 0; i < items.length; ++i) {
    tree.insert(items[i]);
    if (tree.length == window_size) {
      print_median(tree);
      tree.remove(items[i - window_size + 1]);
    }
  }
}

running_median(items, 7);

2 comments:

Anant said...

Terrific post!
While I read the heap-based solutions before, it's the first time that I'm reading the BST solution.

Thanks so much!

Dhruv Matani said...

@Anant: There is actually another way to do this in a very simple manner if you are using C++'s std::multimap<> container. The observation is that the median will shift only 1 before or 1 after the current median every time you insert (or delete) an element. You can use this to (at every insertion/deletion) check which side of the median the inserted (or deleted) element lies and increment (or decrement) the median iterator. Iterators in std::multimap<> remain valid even if adjacent elements are removed. Incrementing/Decrementing an iterator is O(1) (amortized).