There are two aspects of the problem that I consider particularly interesting: a...

riffer · on Nov 6, 2010

scaling to a corpus of this size ... it's interesting to see how you motivate your approach and if you can scale it to 1M different vocabulary words

It looks like the sparsity of the matrix is going to be a much bigger challenge than the scale.

I understand about the focus being primarily on the approach, that makes sense; how are you intending to evaluate the results files?

bravura · on Nov 6, 2010

It looks like the sparsity of the matrix is going to be a much bigger challenge than the scale.

Sparsity is good. The sparsity is the only reason that you can keep a matrix with this many dimensions in memory.

I understand about the focus being primarily on the approach, that makes sense; how are you intending to evaluate the results files?

For any submission, I will post for a random subset of vocab words each entry's 10 related terms. I'll then ask people to vote blind.

rarestblog · on Nov 6, 2010

I think the more fair comparison would be to show THE SAME random subset for each entry. I.e. same X words for all result sets.

Otherwise, it might happen that a superior result would just show words that don't even have good results, while an inferior subset would get better covered words.

bravura · on Nov 6, 2010

Yes, it will be the same random subset for all participants. Sorry if that wasn't clear.