HERMES_192 dataset: a comparable corpus for multilingual news clustering evaluation.

Martínez, R., Casillas, A. and Montalvo, S.


This corpus is a compilation of news written in Spanish and English in the same time period from the news agency EFE, and compiled by the HERMES project. Manual clustered HERMES_192 dataset is made up of 35 clusters, 2 monolingual and 33 multilingual.

The news were automatically categorized and belong to a variety of IPTC categories, including: "politics", "crime law / justice", "disasters / accidents", "sports", "lifestyle / leisure", "social issues", "health", "environmental issues", "science / technology" and "unrest conflicts \ war", but without subcategories. Some news belong to more than one IPTC category according to the automatic categorization. Since we were interested in a multilingual document clustering which goes beyond the high level IPTC categories, making clusters of smaller granularity, we carried out a manual clustering with each subset. Three persons read the news and grouped them considering the content of each one. They judged independently and only the identical resultant clusters were selected.


The following data is currently available:


If you want to reference this dataset in your academic works, you may give the URL for this web page, and reference the following paper: