Welcome to my personal website. I am an Associate Professor at the Department of Languages and Information Systems at UNED and a senior researcher at the NLP & IR Research Group.
My work bridges academic research and industrial application through the technical leadership of production-grade AI platforms. I am the lead architect and technical coordinator of the main applications within the ODESIA initiative (funded by Red.es), including the Spanish NLP Portal (ODESIA Portal) and the AI Model Leaderboard (ODESIA Leaderboard), both designed to support large-scale benchmarking and informed decision-making around LLMs. I am also the principal lead of EvALL, a specialized service for the comprehensive evaluation of Information Systems, with a strong focus on language technologies. In addition to leading the platform, I am the main developer of PyEvALL, the evaluation framework powering EvALL and enabling reproducible, extensible, and fine-grained assessment of AI systems.
My current research and development focus on Human-Centric AI (HCAI) and the design of advanced RAG (Retrieval-Augmented Generation) architectures to improve the reliability and factual accuracy of generative models. Through the ANNOTATE project, I also work on detecting sexism and bias in multimedia environments.
I am actively open to industrial collaborations, technical consulting, and R&D partnerships—specifically in LLM implementation, RAG optimization, and AI fairness auditing.
Ph.D. in Computer Science, 2011
Universidad Complutense de Madrid (UCM)
MSc in Artificial Intelligence, 2008
Universidad Complutense de Madrid (UCM)
BSc in Computer Science, 2006
Universidad Complutense de Madrid (UCM)
I help organizations deploy reliable, auditable, and high-performing language technologies when off-the-shelf AI is not enough. My work bridges state-of-the-art NLP research and business-critical deployment, especially in high-risk, multilingual, or regulated environments.
I have served as technical lead and Principal Investigator in competitive R&D projects for over six years, coordinating multidisciplinary teams and delivering results in large-scale public and industry-funded initiatives. Previously, I worked as a Systems Analyst at SATEC, gaining hands-on experience with enterprise IT environments and real-world deployment constraints.
Interested in a partnership?
Initial conversations are exploratory and non-binding.
Contact me via email to discuss how advanced NLP can support your organization’s needs.
Sexism comprises any form of oppression or prejudice against women because of their sex. The aim of the EXIST datasets is to cover sexism in a broad sense, from explicit misogyny to other subtle expressions that involve implicit sexist behaviours.
PyEvALL (The Python library to Evaluate ALL) is an evaluation tool for information systems that allows assessing a wide range of metrics covering various evaluation contexts, including classification, ranking, or LeWiDi (Learning with disagreement).
The MeTwo dataset is a corpus for the detection of sexist expressions and attitudes in Twitter. MeTwo is the first corpus in Spanish designed to identify sexism in a broad sense, from hostile to much more subtle sexism.
We propose a new metric for Ordinal Classification, Closeness Evaluation Measure, that is rooted on Measurement Theory and Information Theory
The RepLab summarization dataset contains companies data from the RepLab 2013 dataset. The collection comprises tweets about 31 entities from two domains: automotive and banking. As a result, our subset of RepLab 2013 comprises 71,303 English and Spanish tweets.
The eDiseases dataset contains patient data from the MedHelp. We extracted 146 posts for allergies, 191 posts for crohn, and 142 posts for breast cancer; which include 983 sentences for allergies, 1780 sentences for crohn, and 1029 sentences for breast cancer. Each sentence in the dataset is labeled with Factuality (OPINION, FACT, EXPERIENCE) and Polarity (POSITIVE, NEUTRAL, NEGATIVE).
We define the Rank-Biased Utility (RBU) metric – an adaptation of the well-known Rank-Biased Precision metric – that takes into account redundancy and the user effort associated to the inspection of documents in the ranking with diversity task.
The EvALL online evaluation service aims to provide a unified evaluation framework for Information Access systems. EvALL allows to: (i) evaluate results in a way compliant with measurement theor; (ii) provide their results as reusable data to the scientific community; (ii) automatically generate evaluation figures and (low-level) interpretation of the results, both as a pdf report and as a latex source.
RepLab 2014 focuses on Reputation Monitoring on Twitter, targeting two new tasks: the categorization of messages with respect to standard reputation dimensions (Performance, Leadership, Innovation, etc.) and the characterization of Twitter profiles (author profiling) with respect to a certain activity domain.
We present a semi-automatic tool that assists experts in their daily work of monitoring the reputation of entities —companies, organizations or public figures- in Twitter.
The RepLab 2013 task is a (multilingual) evaluation exercise for Online Reputation Management systems. RepLab 2013 focused on monitoring the reputation of entities (companies, organizations, etc.) on Twitter. The monitoring task consists of filtering those that do refer to the entity, detecting topics (i.e., clustering tweets by subject) and ranking them based on the degree to which they signal reputation alerts.
The SentiSense Affective Lexicon consists of 5,496 words and 2,190 synsets from WordNet 2.1 labeled with an emotional category. The main part of the lexicon consists of nouns and adjectives, followed by verbs and a small set of adverbs. SentiSense is available in English (WordNet 2.1 and WordNet 3.0) and in Spanish (WordNet 3.0). Also, Polar words are provided in both languages.
SentiSense is endowed with a set of tools that allow users to visualize the lexicon and some statistics about the distribution of synsets and emotions in SentiSense, as well as to easily expand the lexicon. This tool is only available for the SentiSense version in English that uses WordNet 2.1.
The HotelReview Corpus is a corpus of 1000 reviews extracted from booking.com where each review has been manually tagged with a 5-classes category within the set Excellent, Good, Fair, Poor, Very poor and with a 3-classes category within the set Good, Fair, Poor.