Resources

News Corpus for document clustering

The three corpora are comparable, not parallel.

HE189 Corpus

The corpus is a subset of news compiled by the HERMES project (http://nlp.uned.es/hermes/index.html). The corpus contains 189 news, 99 of them in Spanish and the rest in English. It requires about 6.38 MB for storage of the uncompressed files.

The reference solution is composed by 35 clusters, 33 of them are bilingual and the rest monolingual.
CR231 Corpus

The corpus contains 231 news stories in two languages (English and Spanish), and all the news were published in 2007 and collected from different online newspapers. It requires about 12.9 MB for storage of the uncompressed files.

The reference solution is composed by 22 clusters, 17 of them are bilingual and the rest monolingual.
CL214 Corpus

The corpus is a subset of news of the Reuters Corpus, and contains 214 news, 108 in Spanish and 106 in English. It requires about 12.1 MB for storage of the uncompressed files.

The reference solution is composed by 11 clusters, 4 of them are bilingual and the rest monolingual.

If you use the corpus in your research, please include a citation to the paper:

Soto Montalvo, Víctor Fresno, Raquel Martínez. NESM: a Named Entity based Proximity Measure for Multilingual News Clustering. Revista Española para el Procesamiento del Lenguaje Natural, 48, pp. 81-88, 2012.

Data sets for cognate identification

The following data sets are composed by pairs of words that can be unrelated or cognates/false friends, and are distributed in two files. The languages involved are different in each case:

A data set composed by pairs of Spanish and English words. [Download]

A data set composed by pairs of Italian and English words. [Download]

A data set composed by pairs of Spanish and Portuguese words. [Download]

A data set composed by pairs of Portuguese and Italian words. [Download]

If you use the data sets in your research, please include a citation to the paper:

Soto Montalvo, Eduardo G. Pardo, Raquel Martínez, and Víctor Fresno. Automatic cognate identification based on a Fuzzy combination of string similarity measures. IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2012.

Data sets for equivalent Named Entities identification

Three data sets are available. All of them contain, for each type of named entities, two files (one text file with the equivalent entities, and the other one with the not equivalent entities):

DS-LinNeMw1. Data set with three types of named entities: location, organization and miscellany.
DS-LinNeMw2. Data set with four types of named entities: person, location, organization and miscellany.
DS-LinNeNoMw. Data set with four types of named entities: person, location, organization and miscellany. All entities consist of a single word.

If you use the data sets in your research, please include a citation to the article:

Soto Montalvo, Raquel Martínez, Víctor Fresno, Agustín Delgado. Exploiting Named Entities for Bilingual News Clustering. Journal of the Association for Information Science and Technology, 66(2): 363-376, 2015.