Saturday, January 01, 2011

Peer-2-Peer systems at their best!!

Recently, the skype network went down. You can read all about it here.

However, the exciting take away from this is (quoting from the link above) "In order to restore Skype functionality, the Skype engineering and operations team introduced hundreds of instances of the Skype software into the P2P network to act as dedicated supernodes, which we nick-named “mega-supernodes,” to provide enough temporary supernode capacity to accelerate the recovery of the peer-to-peer cloud."

Which is exactly my idea of how a self-healing peer-2-peer system should behave. Introducing more nodes into the mix should ideally help solve scaling and availability issues, which is exactly what the Skype team did. I would like to congratulate them for that!!

Wednesday, December 29, 2010

Practical SVD for the curious reader

Let's see an example of how SVD can be used perform seemingly daunting tasks.
Consider that we are implementing a rating system for a contest and we want to know player that may have similar skill. Additionally we would like to know the hardness of the tasks in the contest since they are not known apriori.

We build a a matrix that holds player number on the y-axis and task number on the x-axis. Let us suppose that we roughly know the first 3 problems to be easy, the next 3 to be of medium difficulty and the next 3 to be hard. The built matrix would look like this:

# 0  1  2  3  4  5  6  7  8
[ 1, 1, 1, 0, 0, 0, 0, 0, 0 ], #0
[ 1, 1, 1, 1, 0, 0, 1, 0, 0 ], #1
[ 1, 0, 0, 0, 0, 1, 1, 1, 0 ], #2
[ 1, 1, 1, 1, 1, 1, 0, 0, 0 ], #3
[ 1, 0, 1, 1, 0, 0, 1, 1, 1 ], #4
[ 1, 0, 1, 0, 0, 0, 0, 0, 0 ], #5
[ 1, 0, 0, 1, 1, 1, 0, 0, 0 ], #6
[ 1, 0, 0, 1, 1, 1, 1, 1, 1 ], #7

Now, we compute the SVD of this matrix and keep only the best 4 feature sets.

We get [U=]
[[-0.22828916 -0.50341088  0.06758431 -0.27577721]
 [-0.3796649  -0.38290566  0.21189598  0.16725752]
 [-0.29514798  0.33343393  0.18247298 -0.80693185]
 [-0.4283514  -0.28592541 -0.50767041  0.05970937]
 [-0.42379852  0.09624233  0.5690528   0.38393312]
 [-0.18451958 -0.30010945  0.12296957 -0.24683212]
 [-0.31511921  0.15603403 -0.56486439  0.03629706]
 [-0.46924257  0.53231036 -0.03863212  0.17781851]]

We get [S=]
[[ 4.86583872  0.          0.          0.        ]
 [ 0.          2.40125574  0.          0.        ]
 [ 0.          0.          2.02979097  0.        ]
 [ 0.          0.          0.          1.29857912]]

We get [V=]
[[-0.55984867 -0.14756061  0.02109021 -0.38852126]
 [-0.21297571 -0.48817872 -0.11242051 -0.03758748]
 [-0.33799385 -0.57307894  0.22851232  0.06799022]
 [-0.41435336  0.0482063  -0.16268579  0.63532176]
 [-0.24923004  0.16758689 -0.54742924  0.21086504]
 [-0.3098872   0.30644504 -0.45753181 -0.41053094]
 [-0.32221659  0.24115755  0.45560831 -0.06000612]
 [-0.24418998  0.40061814  0.35121531 -0.18880653]
 [-0.18353282  0.26176     0.26131788  0.43258946]]

Next, we need to compute U*S and V*S.
What is the significance of these matrices?
The matrix U*S will allow us to find user that are similar in skill to each other.
The matrix V*S will allow us to determine problems that have been misclassified or are actually similar in difficulty to each other.

After computing the U*S and V*S matrices, we will take the row-wise sum of squared differences (SOSD)and check the pairs of rows that have the least such difference. These are the rows that are of interest to us.

We get [SOSD(U*S)=]
[(0, 1, 1.0430999999999999),
 (0, 2, 4.6740000000000004),
 (0, 3, 2.7736000000000001),
 (0, 4, 4.7484000000000002),
 (0, 5, 0.29770000000000002),
 (0, 6, 4.4981999999999998),
 (0, 7, 7.9534000000000002),
 (1, 2, 4.7319000000000004),
 (1, 3, 2.2631000000000001),
 (1, 4, 1.9745999999999999),
 (1, 5, 1.2628999999999999),
 (1, 6, 4.2881999999999998),
 (1, 7, 5.2785000000000002),
 (2, 3, 5.8609),
 (2, 4, 3.7233999999999998),
 (2, 5, 3.1476999999999999),
 (2, 6, 3.6909999999999998),
 (2, 7, 2.7824),
 (3, 4, 5.7964000000000002),
 (3, 5, 3.2058),
 (3, 6, 1.4441999999999999),
 (3, 7, 4.8299000000000003),
 (4, 5, 3.7522000000000002),
 (4, 6, 5.8014999999999999),
 (4, 7, 2.7383999999999999),
 (5, 6, 3.6880000000000002),
 (5, 7, 6.3265000000000002),
 (6, 7, 2.5535000000000001)]

Analysis of SOSD(U*S): User #0 & user #5 seem to be skilled similarly. Similarly, the following pairs of users also seem to be of similar calibre: (0, 1), (0, 5), (1, 4), (1, 5), (3, 6), (4, 7), (6, 7)
Some of this analysis may seem questionable especially considering that we designated the last 3 problems to be hard. How are user #1 & #4 similar if this is the case?
Let's look at the V*S matrix.

We get [SOSD(V*S)=]
[(0, 1, 3.7989000000000002),
 (0, 2, 2.7381000000000002),
 (0, 3, 2.629),
 (0, 4, 4.7946),
 (0, 5, 3.6124999999999998),
 (0, 6, 3.1680999999999999),
 (0, 7, 4.6081000000000003),
 (0, 8, 5.6936999999999998),
 (1, 2, 0.9093),
 (1, 3, 3.3931),
 (1, 4, 3.3944000000000001),
 (1, 5, 4.5884),
 (1, 6, 4.6798999999999999),
 (1, 7, 5.5022000000000002),
 (1, 8, 4.2117000000000004),
 (2, 3, 3.5369999999999999),
 (2, 4, 5.8647999999999998),
 (2, 5, 6.8044000000000002),
 (2, 6, 4.0688000000000004),
 (2, 7, 5.8483000000000001),
 (2, 8, 4.8121),
 (3, 4, 1.6414),
 (3, 5, 2.8456000000000001),
 (3, 6, 2.806),
 (3, 7, 3.6351),
 (3, 8, 2.3344),
 (4, 5, 0.88270000000000004),
 (4, 6, 4.4261999999999997),
 (4, 7, 3.9102999999999999),
 (4, 8, 2.931),
 (5, 6, 3.6707999999999998),
 (5, 7, 2.931),
 (5, 8, 3.7172000000000001),
 (6, 7, 0.36359999999999998),
 (6, 8, 1.0225),
 (7, 8, 0.88270000000000004)]

The following pairs of problems seem to be equally difficult/easy: (3, 4), (1, 2), (4, 5), (6, 8), (7, 8), (6, 7)
You can immediately see a clique of (6, 7, 8) which were all originally designated as hard problems!
Similarly, (3, 4, 5) seem to be related and so do (1, 2). The reason that problems #0 can not be classified by the system could be that since it has been solved by everyone, not much can be said conclusively about it.
You would have noticed that users #1 & #4 have both solved problems #0, #2, #3 & #6. As it turns out, it seems that the features (as deduced by the system) have given a higher weightage to one or more of these problems, which is why users #1 & #4 have been clustered together.

Tuesday, December 28, 2010

SVD 101

I've been trying to find out more about SVD (Singular Value Decomposition) in the last few days, and here is what I have come up with (links marked with a star[*] MUST be READ). I'll try updating it as and when possible.
  1. http://www.funpecrp.com.br/GMR/year2007/vol4-6/xm0017_full_text.html
  2. http://www.cs.um.edu.mt/~csaw/CSAW04/Proceedings/09.pdf
  3. http://www.math.umn.edu/~lerman/math5467/svd.pdf
  4. http://www.seobook.com/lsi/tdm.htm
  5. http://web.mit.edu/snikolov/www/topicweb_verbose.pdf
  6. http://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/svd.html
  7. http://en.wikipedia.org/wiki/Latent_semantic_indexing
  8. * http://web.ics.purdue.edu/~park283/wp-content/uploads/2010/10/Singular_Value_Decomposition_Tutorial.pdf
  9. http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm
  10. http://www.miislita.com/information-retrieval-tutorial/singular-value-decomposition-fast-track-tutorial.pdf
  11. http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html
  12. http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-2-computing-singular-values.html
  13. http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-3-full-svd.html
  14. * http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html
  15. * http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-5-lsi-keyword-research-co-occurrence.html (explains the significance of the U & V matrices)
  16. * http://www.dcs.shef.ac.uk/~genevieve/lsa_tutorial.htm (Thanks to Shashank Singh for this link)
SVD can be used to:
  1. Determine Document Similarity
  2. Determine User similarity in a competition (or in a rating-based system such as movie recommendations)
  3. Perform Image compression
  4. Perform Principal Component Analysis
  5. and much more...
You can use the following tools to compute the SVD of a matrix:
  1. SciLab
  2. NumPy (SciPy)