Wednesday, December 29, 2010

Practical SVD for the curious reader

Let's see an example of how SVD can be used perform seemingly daunting tasks.
Consider that we are implementing a rating system for a contest and we want to know player that may have similar skill. Additionally we would like to know the hardness of the tasks in the contest since they are not known apriori.

We build a a matrix that holds player number on the y-axis and task number on the x-axis. Let us suppose that we roughly know the first 3 problems to be easy, the next 3 to be of medium difficulty and the next 3 to be hard. The built matrix would look like this:

# 0  1  2  3  4  5  6  7  8
[ 1, 1, 1, 0, 0, 0, 0, 0, 0 ], #0
[ 1, 1, 1, 1, 0, 0, 1, 0, 0 ], #1
[ 1, 0, 0, 0, 0, 1, 1, 1, 0 ], #2
[ 1, 1, 1, 1, 1, 1, 0, 0, 0 ], #3
[ 1, 0, 1, 1, 0, 0, 1, 1, 1 ], #4
[ 1, 0, 1, 0, 0, 0, 0, 0, 0 ], #5
[ 1, 0, 0, 1, 1, 1, 0, 0, 0 ], #6
[ 1, 0, 0, 1, 1, 1, 1, 1, 1 ], #7

Now, we compute the SVD of this matrix and keep only the best 4 feature sets.

We get [U=]
[[-0.22828916 -0.50341088  0.06758431 -0.27577721]
 [-0.3796649  -0.38290566  0.21189598  0.16725752]
 [-0.29514798  0.33343393  0.18247298 -0.80693185]
 [-0.4283514  -0.28592541 -0.50767041  0.05970937]
 [-0.42379852  0.09624233  0.5690528   0.38393312]
 [-0.18451958 -0.30010945  0.12296957 -0.24683212]
 [-0.31511921  0.15603403 -0.56486439  0.03629706]
 [-0.46924257  0.53231036 -0.03863212  0.17781851]]

We get [S=]
[[ 4.86583872  0.          0.          0.        ]
 [ 0.          2.40125574  0.          0.        ]
 [ 0.          0.          2.02979097  0.        ]
 [ 0.          0.          0.          1.29857912]]

We get [V=]
[[-0.55984867 -0.14756061  0.02109021 -0.38852126]
 [-0.21297571 -0.48817872 -0.11242051 -0.03758748]
 [-0.33799385 -0.57307894  0.22851232  0.06799022]
 [-0.41435336  0.0482063  -0.16268579  0.63532176]
 [-0.24923004  0.16758689 -0.54742924  0.21086504]
 [-0.3098872   0.30644504 -0.45753181 -0.41053094]
 [-0.32221659  0.24115755  0.45560831 -0.06000612]
 [-0.24418998  0.40061814  0.35121531 -0.18880653]
 [-0.18353282  0.26176     0.26131788  0.43258946]]

Next, we need to compute U*S and V*S.
What is the significance of these matrices?
The matrix U*S will allow us to find user that are similar in skill to each other.
The matrix V*S will allow us to determine problems that have been misclassified or are actually similar in difficulty to each other.

After computing the U*S and V*S matrices, we will take the row-wise sum of squared differences (SOSD)and check the pairs of rows that have the least such difference. These are the rows that are of interest to us.

We get [SOSD(U*S)=]
[(0, 1, 1.0430999999999999),
 (0, 2, 4.6740000000000004),
 (0, 3, 2.7736000000000001),
 (0, 4, 4.7484000000000002),
 (0, 5, 0.29770000000000002),
 (0, 6, 4.4981999999999998),
 (0, 7, 7.9534000000000002),
 (1, 2, 4.7319000000000004),
 (1, 3, 2.2631000000000001),
 (1, 4, 1.9745999999999999),
 (1, 5, 1.2628999999999999),
 (1, 6, 4.2881999999999998),
 (1, 7, 5.2785000000000002),
 (2, 3, 5.8609),
 (2, 4, 3.7233999999999998),
 (2, 5, 3.1476999999999999),
 (2, 6, 3.6909999999999998),
 (2, 7, 2.7824),
 (3, 4, 5.7964000000000002),
 (3, 5, 3.2058),
 (3, 6, 1.4441999999999999),
 (3, 7, 4.8299000000000003),
 (4, 5, 3.7522000000000002),
 (4, 6, 5.8014999999999999),
 (4, 7, 2.7383999999999999),
 (5, 6, 3.6880000000000002),
 (5, 7, 6.3265000000000002),
 (6, 7, 2.5535000000000001)]

Analysis of SOSD(U*S): User #0 & user #5 seem to be skilled similarly. Similarly, the following pairs of users also seem to be of similar calibre: (0, 1), (0, 5), (1, 4), (1, 5), (3, 6), (4, 7), (6, 7)
Some of this analysis may seem questionable especially considering that we designated the last 3 problems to be hard. How are user #1 & #4 similar if this is the case?
Let's look at the V*S matrix.

We get [SOSD(V*S)=]
[(0, 1, 3.7989000000000002),
 (0, 2, 2.7381000000000002),
 (0, 3, 2.629),
 (0, 4, 4.7946),
 (0, 5, 3.6124999999999998),
 (0, 6, 3.1680999999999999),
 (0, 7, 4.6081000000000003),
 (0, 8, 5.6936999999999998),
 (1, 2, 0.9093),
 (1, 3, 3.3931),
 (1, 4, 3.3944000000000001),
 (1, 5, 4.5884),
 (1, 6, 4.6798999999999999),
 (1, 7, 5.5022000000000002),
 (1, 8, 4.2117000000000004),
 (2, 3, 3.5369999999999999),
 (2, 4, 5.8647999999999998),
 (2, 5, 6.8044000000000002),
 (2, 6, 4.0688000000000004),
 (2, 7, 5.8483000000000001),
 (2, 8, 4.8121),
 (3, 4, 1.6414),
 (3, 5, 2.8456000000000001),
 (3, 6, 2.806),
 (3, 7, 3.6351),
 (3, 8, 2.3344),
 (4, 5, 0.88270000000000004),
 (4, 6, 4.4261999999999997),
 (4, 7, 3.9102999999999999),
 (4, 8, 2.931),
 (5, 6, 3.6707999999999998),
 (5, 7, 2.931),
 (5, 8, 3.7172000000000001),
 (6, 7, 0.36359999999999998),
 (6, 8, 1.0225),
 (7, 8, 0.88270000000000004)]

The following pairs of problems seem to be equally difficult/easy: (3, 4), (1, 2), (4, 5), (6, 8), (7, 8), (6, 7)
You can immediately see a clique of (6, 7, 8) which were all originally designated as hard problems!
Similarly, (3, 4, 5) seem to be related and so do (1, 2). The reason that problems #0 can not be classified by the system could be that since it has been solved by everyone, not much can be said conclusively about it.
You would have noticed that users #1 & #4 have both solved problems #0, #2, #3 & #6. As it turns out, it seems that the features (as deduced by the system) have given a higher weightage to one or more of these problems, which is why users #1 & #4 have been clustered together.