If I had enough time for the problem I would first use a morphological and syntactical analyzer to try (at least with a high probability) to determine the morphological structure of the words and I would tag them with their syntactical category (part of speech). So I would treat for example 'work' and 'worked' as being the same word. On the other hand I would treat 'can:aux' and 'can:noun' as different words.
Now I would completely eliminate all the words which are not nouns not adjectives and not verbs. (the, a, can:aux, may, etc...) These are 'asemantical' words.
Maybe eliminate extremely rare words also (with 1 or 2 occurences)
After this preprocessing now comes the semantical relatedness analysis.
A very primitive solution:
----
Compute the ten best B for A where the following is maximal:
f(AB) / (f(A) * f(B))
f can be the frequency. f(AB) can be the frequency of occuring together in a document or within a distance (like 5 words). But you can fine tune it: weight it with the actual distance.
You don't have to maintain the full matrix. As you go through the documents just maintain max 100 words for each word for which this measure is maximal, and in the end just take the 10 maximal from the 100.
----
A more sophisticated solution for the semantically relatedness analysis:
Let's put all the possible words into a 2 dimensional map with totally random x,y coordinates. Now as you go through the documents imagine that the words which occur near each other attract each other on the map so their distance is made *=0.8. They gravitate. Imagine that the mass of the word is its frequency: the distance is made smaller by small frequency words going towards the big frequency word a lot, and the a big frequency word goes just a litle bit. Just like gravitation. Of course time to time we change the scale: we blow the whole map up a bit to not make everything gravitate into one point. This algorithm is a 'global algorithm' in the sense that it takes into account indirect things like synonims which are semantically really related even though they not really occur in the same sentence. In the end determining the 10 nearest neighbours fast is easy: you can use a simple grid as 'space partitioning'.
Edit: A 2 dimensional map s not necessarily optimal, it might be too tight. I would also experiment with setting the map dimension to 3,4,5 etc...
If I had enough time for the problem I would first use a morphological and syntactical analyzer to try (at least with a high probability) to determine the morphological structure of the words and I would tag them with their syntactical category (part of speech). So I would treat for example 'work' and 'worked' as being the same word. On the other hand I would treat 'can:aux' and 'can:noun' as different words.
Now I would completely eliminate all the words which are not nouns not adjectives and not verbs. (the, a, can:aux, may, etc...) These are 'asemantical' words.
Maybe eliminate extremely rare words also (with 1 or 2 occurences)
After this preprocessing now comes the semantical relatedness analysis.
A very primitive solution:
----
Compute the ten best B for A where the following is maximal:
f(AB) / (f(A) * f(B))
f can be the frequency. f(AB) can be the frequency of occuring together in a document or within a distance (like 5 words). But you can fine tune it: weight it with the actual distance.
You don't have to maintain the full matrix. As you go through the documents just maintain max 100 words for each word for which this measure is maximal, and in the end just take the 10 maximal from the 100.
----
A more sophisticated solution for the semantically relatedness analysis:
Let's put all the possible words into a 2 dimensional map with totally random x,y coordinates. Now as you go through the documents imagine that the words which occur near each other attract each other on the map so their distance is made *=0.8. They gravitate. Imagine that the mass of the word is its frequency: the distance is made smaller by small frequency words going towards the big frequency word a lot, and the a big frequency word goes just a litle bit. Just like gravitation. Of course time to time we change the scale: we blow the whole map up a bit to not make everything gravitate into one point. This algorithm is a 'global algorithm' in the sense that it takes into account indirect things like synonims which are semantically really related even though they not really occur in the same sentence. In the end determining the 10 nearest neighbours fast is easy: you can use a simple grid as 'space partitioning'.
Edit: A 2 dimensional map s not necessarily optimal, it might be too tight. I would also experiment with setting the map dimension to 3,4,5 etc...