HERMES_192 dataset: a comparable corpus for multilingual news clustering evaluation.

Martínez, R., Casillas, A. and Montalvo, S.

This corpus is a compilation of news written in Spanish and English in the same time period from the news agency EFE, and compiled by the HERMES project. Manual clustered HERMES_192 dataset is made up of 35 clusters, 2 monolingual and 33 multilingual.

The news were automatically categorized and belong to a variety of IPTC categories, including: "politics", "crime law / justice", "disasters / accidents", "sports", "lifestyle / leisure", "social issues", "health", "environmental issues", "science / technology" and "unrest conflicts \ war", but without subcategories. Some news belong to more than one IPTC category according to the automatic categorization. Since we were interested in a multilingual document clustering which goes beyond the high level IPTC categories, making clusters of smaller granularity, we carried out a manual clustering with each subset. Three persons read the news and grouped them considering the content of each one. They judged independently and only the identical resultant clusters were selected.

The following data is currently available:

Analyzed corpus: PoS tagging and Named Entity detection and classification - Some stats

Linguistic analysis of each document was done by means of FreeLing tool (specifically: morpho-syntactic analysis, lemmatization, and recognition and classification of Named Entities).

News + Summary - XML format

If you want to reference this dataset in your academic works, you may give the URL for this web page, and reference the following paper:

Montalvo, S., Martínez, R., Casillas, A. and Fresno, V. Multilingual news clustering: Feature translation vs. identification of cognate named entities. Pattern Recognition Letters 28 (16) 2305-2311, Elsevier (2007).