I’m Assistant Professor at the Department of Languages and Information Systems at UNED and researcher at the NLP & IR UNED group.
My expertise includes different fields of Natural Language Processing, with special interest in practical applications in the biomedical domain and social networks. Currently, my interest focus on extracting and summarizing information from online patient forums, as well as detecting and analyzing sexist expressions and behaviors in social networks.
I am also a member of the Observatory for AI in Spanish, whose aim is to promote research in language technologies and resources, and therefore an important part of my research is devoted to the development of textual corpora in Spanish for training NLP systems.
During 2022 and 2023 I am collaborating with Damiano Spina as a Visiting Research Fellow at the Royal Melbourne Institute of Technology.
BsC in Business Administration, 2016
Universidad Nacional de Educación a Distancia (UNED)
Ph.D. in Computer Science, 2011
Universidad Complutense de Madrid (UCM)
MSc in Artificial Intelligence, 2008
Universidad Complutense de Madrid (UCM)
BSc in Computer Science, 2006
Universidad Carlos III de Madrid (UC3M)
Associate Professor, School of Computer Science
Researcher in Language Technologies, School of Computer Science (UNED)
Visiting Research Fellow, School of Computing Technologies
Sexism comprises any form of oppression or prejudice against women because of their sex. The aim of the EXIST dataset is to cover sexism in a broad sense, from explicit misogyny to other subtle expressions that involve implicit sexist behaviours.
The MeTwo dataset is a corpus for the detection of sexist expressions and attitudes in Twitter. MeTwo is the first corpus in Spanish designed to identify sexism in a broad sense, from hostile to much more subtle sexism.
The RepLab summarization dataset contains companies data from the RepLab 2013 dataset. The collection comprises tweets about 31 entities from two domains: automotive and banking. As a result, our subset of RepLab 2013 comprises 71,303 English and Spanish tweets.
The eDiseases dataset contains patient data from the MedHelp. We extracted 146 posts for allergies, 191 posts for crohn, and 142 posts for breast cancer; which include 983 sentences for allergies, 1780 sentences for crohn, and 1029 sentences for breast cancer. Each sentence in the dataset is labeled with Factuality (OPINION, FACT, EXPERIENCE) and Polarity (POSITIVE, NEUTRAL, NEGATIVE).
The SentiSense Affective Lexicon consists of 5,496 words and 2,190 synsets from WordNet 2.1 labeled with an emotional category. The main part of the lexicon consists of nouns and adjectives, followed by verbs and a small set of adverbs. SentiSense is available in English (WordNet 2.1 and WordNet 3.0) and in Spanish (WordNet 3.0). Also, Polar words are provided in both languages.
SentiSense is endowed with a set of tools that allow users to visualize the lexicon and some statistics about the distribution of synsets and emotions in SentiSense, as well as to easily expand the lexicon. This tool is only available for the SentiSense version in English that uses WordNet 2.1.
The HotelReview Corpus is a corpus of 1000 reviews extracted from booking.com where each review has been manually tagged with a 5-classes category within the set Excellent, Good, Fair, Poor, Very poor and with a 3-classes category within the set Good, Fair, Poor.