• This email address is being protected from spambots. You need JavaScript enabled to view it. .

    MC4WEPS (Multilingual Corpus for WEb People Search) corpus provides a real scenario to train and evaluate systems to disambiguate web people searches. The two main features of this corpus are: it includes multilingual results, and it keeps the social networking profiles.

    We would like to keep track of who has downloaded the corpus. Please This email address is being protected from spambots. You need JavaScript enabled to view it. in order to download it.

  • Heterogeneity Based Ranking.

    The heterogeneity property of text evaluation measures states that the probability of a real (i.e. human assessed) similarity increase is directly related to the heterogeneity of the set of automatic similarity measures that corroborate such increase. This script implements a method for similarity measures that is based on the heterogeneity principle. The method is completely unsupervised (it does not use any kind of human assessments on the quality of the measures to be combined) and leads to top performing combined similarity measures in multiple tasks like Document Clustering, Textual Entailment, Semantic Text Similarity, and automatic MT and Summarization.

  • Unanimous Improvement Ratio.

    Many Artificial Intelligence tasks cannot be evaluated with a single quality criterion and some sort of weighted combination is needed to provide system rankings. A problem of weighted combination measures is that slight changes in the relative weights may produce substantial changes in the system rankings. This software implements the Unanimous Improvement Ratio (UIR), a measure that complements standard metric combination criteria (such as van Rijsbergen's F-measure) and indicates how robust the measured differences are to changes in the relative weights of the individual metrics. UIR is meant to elucidate whether a perceived difference between two systems is an artifact of how individual metrics are weighted.

  • Reliability and Sensitivity (extended BCubed) (New Version!)

    Some key Information Access tasks -- Document Retrieval, Clustering, Filtering, etc. -- can be seen as instances of a generic "document organization" problem that establishes priority and relatedness relationships between documents. In this paper we propose two complementary evaluation measures -- Reliability and Sensitivity -- for the generic Document Organization task which are derived from a proposed set of formal constraints (properties that any suitable measure must satisfy).

    For each of the tasks subsumed under the document organization problem, Reliability and Sensitivity satisfy more formal constraints than previously existing evaluation metrics. Their most characteristic feature, in addition, is their strictness: in order to reach high Reliability and Sensitivity values, a system must also achieve high values with all standard evaluation measures.

  • A Corpus for Entity Profiling in Microblog Posts.

    Two sets of annotations for evaluating the task of entity profiling in Microblog Posts. The first dataset is created using a pooling methodology, for which various methods for automatically extracting aspects from tweets that are relevant for an entity have been implemented. Human assessors have labeled each of the candidates as being relevant or not. The second dataset contains opinion targets for which annotators consider individual tweets related to an entity and manually identify whether the tweet is opinionated or not. If so, they annotate which part of the tweet is subjective and what the target of the sentiment is.

  • HotelReview Corpus.

    1,000 reviews extracted from booking.com.

  • SentiSense Affective Lexicon.

    5,496 words and 2,190 synsets from WordNet 2.1 labeled with an emotional category.

  • SentiSense Tagger and SentiSense Visualizer.

    SentiSense Tagger and SentiSense Visualizer are included in the SentiSense Tools package.

  • Automatic Association Of Web Directories To Word Senses

    The aim of this research is the development and application of algorithms to combine lexical information with web directories, in order to associate Wordnet word senses with ODP (Open Directory Project) directories.


    Test-suite for Information Synthesis studies, made up of 72 manually-generated reports (topic-oriented summaries of large sets of relevant documents).

  • iCLEF 2008-2009 User logs

    User logs capturing all the information relevant to user interaction with the search interface during the iCLEF 2008-2009 campaigns.

  • WePS (Web People Search Corpus)

    A corpus testbed for people searching algorithms.

  • Hermes_192 (multilingual clustering benchmark)

    A comparable corpus for multilingual news clustering evaluation.

  • Open online doctorate course on Natural Language Processing and Information Retrieval

    Within the inter-universitary program in Cognitive Science (UNED/UAM/UCM).