Automatic association of Web directories to word senses

Santamaría, C., Gonzalo, J. and Verdejo, M. F.
UNED group in Natural Language Processing

The aim of this research is the development and application of algorithms to combine lexical information with web directories, in order to associate Wordnet word senses with ODP (Open Directory Project) directories.  Such associations can be used as rich domain labels and to acquire sense-tagged corpora automatically, cluster topically-related senses and detect sense specializations.

Our current algorithm has been evaluated for the 29 nouns (147 senses) used in the Senseval 2 competition, obtaining 148 word sense/ Internet directory associations covering 88% of the domain-specific word senses in the test data with  86% accuracy.

The richness of Internet directories as sense characterizations is evaluated in a supervised Word Sense Disambiguation task with the Senseval 2 test suite. The results indicate that, when the directory/word sense association is correct, the training samples acquired automatically from the Internet directories are as valid for training as the original Senseval 2 training instances.

The following data is currently available:

Related publications:

Santamaría, C., Gonzalo, J. and Verdejo, M. F. Automatic Association of Web directories to word senses. (2003) Computational Linguistics 29 (3), MIT Press.