Jorge Carrillo-de-Albornoz

Jorge Carrillo-de-Albornoz

Associate Professor and Researcher in NLP

UNED

Bio

Welcome to my personal website. I’m Assistant Professor and Researcher in NLP at the Department of Lenguajes y Sistemas Informáticos at UNED. I am also member of the research group Natural Language Processing and Information Retrieval. I finish my Ph.D on the use of linguistic and semantic information for modeling emotions in text for polarity classification.

My research interests are Natural Language Processing, specially Sentiment Analysis, Negation Detection, and eHealth in Social Networks, and Systems Evaluation for multiple IA tasks. At present, I am working in Controversy Detection and Sexism Detection in Social Network in the MISMIS Project. Also, I am working in EvALL, a online service for Information Systems Evaluation.

Interests

  • Controversy Detection
  • Sexism Identification
  • Bias Understanding
  • Natural Language Processing
  • Machine Learning
  • Systems Evaluation

Education

  • Ph.D. in Computer Science, 2011

    Universidad Complutense de Madrid (UCM)

  • MSc in Artificial Intelligence, 2008

    Universidad Complutense de Madrid (UCM)

  • BSc in Computer Science, 2006

    Universidad Complutense de Madrid (UCM)

Resources

*
MeTwo: Machismo and sExism TWitter identificatiOn dataset

MeTwo: Machismo and sExism TWitter identificatiOn dataset

The MeTwo dataset is a corpus for the detection of sexist expressions and attitudes in Twitter. MeTwo is the first corpus in Spanish designed to identify sexism in a broad sense, from hostile to much more subtle sexism. The MeTwo dataset is available for research at Github.

CEM-Ord: Closeness Evaluation Measure for Ordinal Classification

CEM-Ord: Closeness Evaluation Measure for Ordinal Classification

We propose a new metric for Ordinal Classification, Closeness Evaluation Measure, that is rooted on Measurement Theory and Information Theory. Our theoretical analysis and experimental results over both synthetic data and data from NLP shared tasks indicate that the proposed metric captures quality aspects from different traditional tasks simultaneously.

RepLab Summarization Dataset

RepLab Summarization Dataset

The RepLab summarization dataset contains companies data from the RepLab 2013 dataset, where users from Twitter talk about different topics of the companies. Each topic consists of a different number of tweets posted by Twitter users. The collection comprises tweets about 31 entities from two domains: automotive and banking. As a result, our subset of RepLab 2013 comprises 71,303 English and Spanish tweets.

eDiseases Dataset

eDiseases Dataset

The eDiseases dataset contains patient data from the MedHelp health site from three communities: allergies, crohn and breast cancer. In total, we extracted 146 posts for allergies, 191 posts for crohn, and 142 posts for breast cancer; which include 983 sentences for allergies, 1780 sentences for crohn, and 1029 sentences for breast cancer, covering a 6 years time interval. Three frequent users of health forums annotated each sentence in the dataset as: Factuality (OPINION, FACT, EXPERIENCE), and Polarity (POSITIVE, NEUTRAL, NEGATIVE).

Rank-Biased Utility Metric

Rank-Biased Utility Metric

We define a constraint-based axiomatic framework to study the suitability of existing metrics in search result diversification scenarios. The analysis informed the definition of Rank-Biased Utility (RBU) – an adaptation of the well-known Rank-Biased Precision metric – that takes into account redundancy and the user effort associated to the inspection of documents in the ranking.

EvALL: Open Access Evaluation for Information Access Systems

EvALL: Open Access Evaluation for Information Access Systems

The EvALL online evaluation service aims to provide a unified evaluation framework for Information Access systems. EvALL allows to: (i) evaluate results in a way compliant with measurement theory and with state-of-the-art evaluation practices in the field; (ii) quantitatively and qualitatively compare their results with the state of the art; (iii) provide their results as reusable data to the scientific community; (iv) automatically generate evaluation figures and (low-level) interpretation of the results, both as a pdf report and as a latex source.

The RepLab 2014 Dataset

The RepLab 2014 Dataset

RepLab 2014 focuses on Reputation Monitoring on Twitter, targeting two new tasks: the categorization of messages with respect to standard reputation dimensions (Performance, Leadership, Innovation, etc.) and the characterization of Twitter profiles (author profiling) with respect to a certain activity domain, classifying authors as journalists, professionals, etc. and finding the opinion makers in the domain. The dataset contains tweets in two languages: English and Spanish.

ORMA: A Semi-Automatic Tool for Online Reputation Monitoring in Twitter

ORMA: A Semi-Automatic Tool for Online Reputation Monitoring in Twitter

We present a semi-automatic tool that assists experts in their daily work of monitoring the reputation of entities—companies, organizations or public figures—in Twitter. The tool automatically annotates tweets for relevance (Is the tweet about the entity?), reputational polarity (Does the tweet convey positive or negative implications for the reputation of the entity?), groups tweets in topics and display topics in decreasing order of relevance from a reputational perspective.

The RepLab 2013 Dataset

The RepLab 2013 Dataset

The RepLab 2013 task is a (multilingual) evaluation exercise for Online Reputation Management systems. RepLab 2013 focused on monitoring the reputation of entities (companies, organizations, etc.) on Twitter. The monitoring task consists of searching the stream of tweets for potential mentions to the entity, filtering those that do refer to the entity, detecting topics (i.e., clustering tweets by subject) and ranking them based on the degree to which they signal reputation alerts (i.e., issues that may have a substantial impact on the reputation of the entity).

SentiSense Affective Lexicon

SentiSense Affective Lexicon

The SentiSense Affective Lexicon consists of 5,496 words and 2,190 synsets from WordNet 2.1 labeled with an emotional category. The main part of the lexicon consists of nouns and adjectives, followed by verbs and a small set of adverbs. SentiSense is available in English (WordNet 2.1 and WordNet 3.0) and in Spanish (WordNet 3.0). Also, Polar words are provided in both languages.

SentiSense Affective Tools

SentiSense Affective Tools

SentiSense is endowed with a set of tools that allow users to visualize the lexicon and some statistics about the distribution of synsets and emotions in SentiSense, as well as to easily expand the lexicon. This tool is only available for the SentiSense version in English that uses WordNet 2.1.

Hotel Review Corpus

Hotel Review Corpus

The HotelReview Corpus is a corpus of 1000 reviews extracted from booking.com where each review has been manually tagged with a 5-classes category within the set Excellent, Good, Fair, Poor, Very poor and with a 3-classes category within the set Good, Fair, Poor.

Contact