Tuesday, May 12, 2009

Longest Common Increasing Subsequence


I recently came across the problem of computing the Longest Common Increasing Subsequence(LCIS) of 2 given sequences. What does LCIS mean? Given 2 sequences of integers, we need to find a longest sub-sequence which is common to both the sequences, and the numbers of such a sub-sequence are in strictly increasing order.

So for example, given the 2 sequences:


  • 4,3,5,6,7,1,2 and

  • 1,2,3,50,6,4,7



Possible common increasing sub-sequence of the 2 sequences above are:

(4), (3), (7), (6), (1), (2), (1,2), (3,6), (3,7), (6,7), (4,7) and (3,6,7).

The LCIS of the 2 sequences would be (3,6,7). It so happens that the 2 sequences above have a unique LCIS, though that may not always be the case.

Now, how do we go about computing the LCIS of 2 sequences in a reasonable amount of time? If we think a bit, we can come up with the recurrence:

LCIS_length(seq1, seq2, prev_elem) is equal to:
  1. MAX(LCIS_length(seq1 less it's first element, seq2 less it's first element, first element of seq1)+1,
    LCIS_length(seq1 less it's first element, seq2, prev_elem),
    LCIS_length(seq1, seq2 less it's first element, prev_elem)).
    If the first elements of seq1 and seq2 are the same AND the first element of seq1 is > prev_elem.

  2. MAX(LCIS_length(seq1 less it's first element, seq2, prev_elem),
    LCIS_length(seq1, seq2 less it's first element, prev_elem)).
    If the first elements of seq1 and seq2 are different.



Now, if we implement it just the way it is, it is going to be quite expensive to compute the LCIS_length for any sequence having a length of more than say 20 elements. So, we want to use an approach which would work for larger input. On further thinking, we think that a faster(polynomial time) approach is possible.

Let's lay out both the sequences along the first row and column of an nXm matrix, where n and m are the lengths of the 2 sequences given.














4356712
1
2
3
50
6
4
7




Let's define each cell to hold the length of the LCIS ending at the row(or column) that that cell is defining. For example, cell [6,1] will hold the length of the LCIS that ends with the number 4. The cell [5,2] will not hold any meaningful value since there can be no sub-sequence that ends with both 6 and 3. Similarly, the cell [7,5] will hold the length of an LCIS that ends with the number 7.

What we will do is we run 2 for loops, one for the row and one for the column, and each time we find a match between an element in the top row and the left column(i.e. The input elements), we will try and find some sub-sequence computed till now that can be extended by the number that that cell represents.

It should be clear by now that the number that the cell [x,y] represents is the number in the first column in the row 'x' OR the number in the first row in column 'y' ONLY IF both these numbers are the same. For example, the cell [2,3] can NOT represent any number, whereas the cell [3,2] represents the number 3.

A sub-sequence that can be extended by a number in cell [x,y] will always lie in the matrix of which the cell [x-1,y-1] is the right-bottom most cell. Let is refer to such a matrix as the candidate matrix for cell [x,y]. For example, we have highlighted all the cells which can have potential candidate sub-sequences that can be extended by the cell [5,4](which represents the number 6).















4356712
1
2
3
50
6
4
7





Let's start working out a solution row-by-row.

We need to remember that while processing the 1st row, there is NO sub-sequence that can be extended by any sub-sequence that ends with a number in the first row! This would be the situation after we finished processing the first row.
















4356712
10000010
2
3
50
6
4
7



When processing further rows, we check the candidate matrix of each cell being processed, and always extend the LCIS in the candidate matrix that ends with a number less than the representative number of the cell being processed.

When processing the 2nd row, we notice that a sub-sequence ending at 2 can extend a subsequence ending at 1 because that sub-sequence is the LCIS that ends with a number less than 2 and lies in the candidate matrix for the cell [2,7].














4356712
10000010
20000002
3
50
6
4
7




Similarly, after processing all the rows, our matrix looks like this.














4356712
10000010
20000002
30100000
500000000
60002000
41000000
70000300




To find the LCIS, we search for the largest number in the matrix, and that gives you the length of the LCIS and also the number at which such an LCIS ends. To trace the LCIS, we need to find the largest number that lies in the candidate matrix of the cell to which this largest number belongs. In the figure below, the cell and candidate matrix have been highlighted in green and yellow colours respectively.














4356712
10000010
20000002
30100000
500000000
60002000
41000000
70000300




Tracing thus, we land up with 3 cells(marked in green).














4356712
10000010
20000002
30100000
500000000
60002000
41000000
70000300




The LCIS is just the numbers represented by the cells marked in green, when taken in increasing order of their value. i.e. (3,6,7).

The complexity of the approach presented above is O(n4). Can we bring it down?

If you notice, you will see that for each index representing numbers in the top most row, there is a unique integer associated with it. For example, indexes 1,2,3,4 are associated with numbers 4,3,5,6. There is always one integer associated with every such index. If we think a bit, it should be clear that whenever a CIS ends at some such index, and another longer CIS also ends at that index, then it is sufficient to just keep the larger of the 2. This is because the way in which we are processing the input(row-wise) allows us to track just the LCIS ending at any index.

If we adopt such an approach and use an auxiliary array to hold the LCIS ending at every such index, we land up with an O(n3) solution since we now need to scan just a single array for every element instead of an entire matrix.

However, if we think a little bit more, we can see that we don't even need that scan, since we can always maintain(for every row scanned), a single variable holding the length of the LCIS that the number(N) in the first column of that row can extend. Whenever we encounter a cell which has the same representative number(N), we store a number one more than an LCIS that can be extended by this representative number of such a cell. This brings down the complexity further to O(n2).

Re-constructing the LCIS can be trivially accomplished in time O(n) if we use extra space to the order of O(n2) and this is left as an exercise for the reader.