Current Projects


Duration: 2018-2019
Financing institution: IMIENS
Summary: En este proyecto nos proponemos diseñar algoritmos que ayuden a la identificación de relaciones relevantes entre distintas enfermedades. Esta información es muy útil para realizar nuevos diagnósticos, probar nuevos tratamientos o fármacos, o para prever la posible evolución de la enfermedad, etc. . Muchas enfermedades comparten uno, o varios aspectos, como síntomas, evolución, tratamiento, etc., pero esto no siempre significa que exista una relación entre ellas. Por ello, lo que proponemos es un sistema capaz de detectar relaciones entre enfermedades que se pueden considerar significativas. La significatividad vendrá dada por la coincidencia de aspectos más allá de la casualidad que se capturará definiendo un modelo estadístico apropiado. Las relaciones entre distintas enfermedades se pueden establecer en base a distintos patrones, separada o conjuntamente: aparición conjunta, sí­ntomas comunes, similitudes de tratamientos, etc. Estas relaciones entre enfermedades se pueden codificar como Reglas de Asociación (RA), que se pueden considerar formas de representar el conocimiento médico subyacente en el conjunto de HCE almacenadas en el repositorio de información clínica.
Este proyecto se enmarca en la Convocatoria IMIENS de Ayudas para la realización de Proyectos de Investigación Conjuntos entre grupos de investigación de la UNED y el Instituto de Salud Carlos III.


Duration: 2016-2018 
Financing institution: Ministerio de Economía, Industria y Competitividad.

es una red temática de excelencia financiada por el Ministerio de Economía, Industria y Competitividad (referencia TIN2016-81739-REDT) para la creación de foros de comunicación vivos entre los investigadores del Procesamiento del Lenguaje Natural, donde llegar a puntos de encuentro en el proceso de estandarización de sus servicios.

El grupo NLP&IR es uno de los integrantes, para más información:


Análisis Automático del Significado y la Autoridad en Social Media

Duration: 2016-2018 
Financing institution: Ministerio de Economía y Competitividad
Convocatoria 2015, Modalidad 1: Proyectos DE I+D+I, del Programa Estatal de Investigación, Desarrollo e Innovación Orientada a los Retos de la Sociedad.
Summary: For an average citizen of our digital era, the problem is no longer finding relevant information, but assimilating the massive amount of relevant available information at any moment in time. This is not possible without the help of a new generation of machines able to digest all relevant sources into a readable, personalized synthesis of the stream of relevant information. And such machines need to acquire two crucial, interdependent skills: (i) the ability to automatically discern when different texts convey approximately the same message; and (ii) the ability to discern the credibility of messages.
Our goal is to address the challenge of computing both textual similarity and source authority in online media, focusing on three different and challenging tasks in three relevant application scenarios: Identification and synthesis of controversy in the medical domain, Generation of reputation profiles for companies/brandS and Recommendation of instructional materials in e-learning environments.


Modelado y AutoMatización de exTracción de Relaciones y cAtegorización de informes MEDicos para la recomendación de códigos CIE-10 (TIN2016-77820-C3-2-R)

Duration: 2017-2018 
Financing institution: Ministerio de Economía y Competitividad

Summary: The automatic processing of Electronic Medical Records (EMR) poses challenges for the field of Natural Language Processing (NLP) which to a great extent are related to the adaptation of existing techniques to the domain of medicine. On the other hand, tasks such as assigning diagnostic codes and procedures to the EMRs, carried out manually by experts, raise the question of the need to explore and suggest Text Mining and Information Recovery techniques which allow for automatic inference of the relevant codes for EMR descriptions.

The Alcorcon Foundation University Hospital (HUFA in Spanish), with which the sub-project will collaborate, is a public university hospital which is part of the Madrid Health Service (SERMAS in Spanish). Like all Madrid Health Service hospitals it moved from the old CIE-9 discharge report coding scheme to the newer CIE-10 scheme on 1 January 2016. This change has resulted in a 75% decrease in coding team performance. Said teams are made up of personnel trained for the task. There are commercial applications available which aid in assigning CIE-10 codes by using existing mapping between CIE-9 and CIE-10. Nevertheless, the greater detail and comprehensiveness of CIE-10, combined with the fact that there are combination codes present in CIE-9 with no corresponding code in CIE-10, makes this mapping impossible in a large number of cases. All hospitals would benefit from having a tool which is able to automatically assign codes to diagnostics and procedures directly from the free text found in medical reports. This health-sector related problem will be the main focus and use case of this subproject.

We propose to a study, adapt and develop NLP and unsupervised learning techniques - which this group has a great deal of experience with in order to develop a tool which recommends and assigns CIE-10 codes to discharge reports. An unsupervised approach is imperative with the current limited availability of manually written records to train supervised systems with. As records written in Spanish will be readily available, we will focus on this language, although the methods can be applied to other languages and it is expected that the methods will be validated by the work done with other languages on the coordinated project.

The development of this tool encompasses investigative challenges of several diverse fields: anonymization of reports, lexical normalization within the domain, disambiguation of domain acronyms, representation of the documents, identification of concepts/expressions, extraction of relationships, structured information recovery and unsupervised learning. The use of unsupervised learning techniques will be studied in order to categorize discharge reports with CIE-10 codes, assessing data modeling by means of distributed representations with deep learning algorithms and Information Retrieval techniques. Likewise, statistical models will be applied in order to identify the underlying relationships among reports written with CIE-10 codes. This knowledge base of relationships will make it possible to recommend codes for new reports. The ideal method for combining the different code recommendation algorithms will be
analyzed by studying techniques based on automatic and heuristic learning.


Museología e integración social: la difusión del patrimonio artístico y cultural del Museo del Prado a colectivos con especial accesibilidad (invidentes, sordos y reclusos)

Duration: 2016-2018 
Financing institution:

Convocatoria 2015 de Programas de Actividades de I+D entre Grupos de investigación de la Comunidad de Madrid, organizada por la Dirección General de Universidades e Investigación de la Consejería de Educación, Juventud y Deporte, en la Comunidad de Madrid. (S2015/HUM3494)


The work is structured around three focal points of attention: the first will detect the specific needs and interests of different groups; the second will deal with the design and the creation of applications, systems and virtual exhibitions adapted for these three groups, from some virtual thematic tours or visits of the Museo del Prado; finally, the third focus will seek to invigorate an international network that relates the social projection of museology and its application to the accessibility of the culture to specific groups, all of it through the development of the new technological commodities.


The concern about the patrimonial dimension of the Community of Madrid, especially the art collection of the Museo del Prado, leads us to consider the museum as "cultural artifact" that goes beyond its investigative and conservative function, to seek to bring the museum to the viewer, whatever its diversity and condition, making it a sharer of the contact with the artistic reality and inviting him not only to a direct contemplation of a work of art, but to an interaction with the institution and its collections, with the purpose of exceeding the barrier of the sacredness of the works of art and saving the elitist character that the nineteenth century perception of the traditional collections can suppose. 


EXTracción de RElaciones entre Conceptos Médicos en fuentes de información heterogéneas

Duration: 2014-2017
Financing institution: MINECO (TIN2013-46616-C2-2-R)
Summary: The overall objective of this project is to address the generation of techniques and tools to allow efficient and intelligent access to the contents of medical documents of multilingual nature such as i) general scientific documents, ii) medical records and iii) general information on the Internet. The project will demonstrate, through a series of use cases, the benefits of the application of language technology in the health sector, using advanced Natural Language Processing techniques such as information retrieval applied to large amounts of resources about medical information on the Internet.


Duration: 2014-2016 
Financing institution: Ministerio de Economía y Competitividad (TIN2013-4709-C3-1P)
Summary: Online Reputation Management has recently become a fundamental aspect of Public Relations for organizations, personalities and entities in general. The very reason why the online dimension of reputation is now essential the fact that it is the biggest, richest and most updated source of information, opinions and attitudes around any entity it is the reason why a manual analysis of information streams in media and social networks is not viable. Automatic processing of online information crucially depends of the advancements in many research fields (data structures and algorithms for real time Natural Language Processing, Opinion Mining, Textual Synthesis, Novelty Detection and Recommendation, multimedia search, social network analysis, etc.) that, up to now, have paid little attention to the online reputation scenario. For instance, opinion mining has been focused on product reviews, and its results are not applicable to the (much more complex) problem of evaluating how the content of information streams in sial networks may affect the reputation of a company. The project aims towards the creation of a new generation of online reputation monitoring systems, able to understand, process, aggregate and synthesize, in real time, facts, opinions and attitudes around an entity, of presenting such information in multiple dimensions, and of interacting with reputation experts so that they can accomplish their task better and faster. Our research will go from fundamental problems such as textual similarity or data structures for real time Natural Language Processing to prototype validation with reputation experts. Besides algorithms and prototypes, we will also create and distribute test collections to evaluate all relevant technologies in the reputation management scenario.

Past Projects


Readers: Evaluation And DEvelopment of Reading Systems

Duration: 2013 - 2015
Financing institution:  EU (CHIST-ERA 2011) + Mineco (PCIN-2013-002-C02-01)
Summary: The READERS project proposes new unsupervised computational models to automatically extract background knowledge after reading large amounts of unstructured text. This knowledge will be in the form of classes, categorized entities and predicates whose arguments are typified by probability distributions over classes. Classes themselves will be automatically organized into taxonomies related to the predicates in which they participate.


Linguistically Motivated Semantic Aggregation Engines

Duration: 2011-2014
Financing institution:  European Comission, FP7-ICT
Summary: The LiMoSINe vision is to transition access to online information from a document-centric search paradigm focused on returning disconnected atomic pieces to a truly semantic aggregation paradigm. In this new paradigm, machines will understand a user's intent, discover and organize facts, identify opinions, experiences and trends, all from inherently multilingual online sources and open knowledge repositories. LiMoSINe's aggregation engines will automatically organize search results in semantically meaningful ways.


Evaluating Information Access Systems

Duration: 2011-2016
Financing institution:  European Science Foundation
Summary: ELIAS will define a new measurement paradigm for the evaluation of search engines based on so-called living laboratories. This paradigm involves (i) exploitation of novel market places and forums where large numbers of users are recruited into early stage evaluation experiments to test a particular aspect of an information access system; and (ii) using operational systems as experimental platforms on which to conduct user-based experiments at scale.


The automatic encyclopedia of people and organizations.

Duration: 2010-2012
Financing institution: MICINN (TIN2010-21128-C02)
Summary: The main goal of the project is to develop algorithms, techniques and systems able to mine and aggregate information relative to people and organizations from unstructured and structured web sources, such as social networks, blogs, news, semantic web data, and websites in general.


Mejorando el Acceso, el Análisis y la Visibilidad de la Información y los Contenidos Multilingüe y Multimedia en Red para la Comunidad de Madrid

Duration: 2010-2013
Financing institution: Regional Government of Madrid (S2009/TIC-1542)
Summary: Improving access, analysis and visibility of multilingual and multimedia Web contents.


Duration: 2009-2012
Financing institution: CDTI (CEN-20091026)
Summary: Development of a true Multimedia Semantic Search Engine.


Financing institution: Sub-contracts by Grupo ALMA
Summary: Online Reputation Managing


Quantitative Evaluation of Academic Websites Visibility

Duration: 2008-2010
Financing institution: CICYT (TIN 2007-67581-C02-01)
Summary: Automated Classification of academic websites by topic and language, in order to create ranks with them. The main goal of the project is to improve the accessibility and visibility of academic information on the World Wide Web.


Evaluation Best Practice and Collaboration for Multilingual Information Access

Financing institution: European Commission
Summary: TrebleCLEF supports the development and consolidation of expertise in the multidisciplinary research area of multilingual information access (MLIA) and disseminates this knowhow to the application communities through a set of complementary activities.

Text-Mess (subproyecto INES)


Duration: 2007-2009
Financing institution: CICYT (TIN2006-15265-C06-02)


Multilingual/Multimedia Access To Cultural Heritage

Duration: 2006-2009
Financing institution: European Commission, 6FP (STREP 033104)
Summary: MultiMatch plans to develop a multilingual search engine specifically designed for access, organisation and personalised presentation of cultural heritage information.


Mejorando el acceso y visibilidad de la información multilingüe en red para la Comunidad de Madrid

Duration: 2006-2009
Financing institution: Comunidad de Madrid, IV PRICIT, (S-0505/TID/0267)
Summary: MAVIR es una red de investigación formada por un equipo multidisciplinar de científicos, técnicos, lingüistas y documentalistas para desarrollar un esfuerzo integrador en las líneas de investigación, formación y transferencia de tecnología.


Quality Labelling of Medical Web Content using Multilingual Information Extraction.

Duration: 2006-2008
Financing institution: European Commission (EC Programme: Public Health 61383)
Summary: Quality Labelling of Medical Web Content using Multilingual Information Extraction


Speech Web and Images Interactive Search Assitants

Duration: 2006-2007
Financing institution: UNED
Summary: Estudio de aplicación de asistentes interactivos a tres línas: búsqueda translingüe sobre imágenes, sobre la Web y sobre transcripciones automáticas de reconocedores de habla.

R2D2 (subproyecto Syembra)

Recuperación de Respuestas en Documentos Digitalizados

Duration: 2003-2006
Financing institution: CICYT (TIC2003-07158-C04)
Summary: Evaluation of cross-lingual answer retrieval systems.


Recuperación de Información en Bibliotecas Digitales

Duration: 2001-2004
Financing institution: CYTED VII.19
Summary: Cooperación iberoamericana en investigación y desarrollo de tecnologías para recuperación de información y bibliotecas digitales.


Cross-Language Evaluation Forum

Duration: 2001-2003
Financing institution: European Commission, 5FP (IST-2000-31002)
Summary: Evaluation of Cross-Language Information Retrieval Systems for European Languages


European Schools Treasury Browser

Duration: 2000-2002
Financing institution: European Commission, 5FP (IST Programme)
Summary: Access to meta-information about educational resources and new technologies in Europe.


DELOS: a Network of Excellence on Digital Libraries

Duration: 2000-2002
Financing institution: European Commission, IST Programme
Summary: The main objective of DELOS is to coordinate a joint programme of activities of the major European teams working in digital library related areas.


News Agencies Multilingual Information Categorization

Duration: 1999-2002
Financing institution: European Commission, 5FP (IST-1999-12392)
Summary: NAMIC main objective is to develop and bring to marketable stage advanced NLP technologies for multilingual news customization and broadcasting throughout distributed services.


Duration: 1996-1999
Financing institution: European Commission, 4FP (Telematics, LE 4003)
Summary: The project aimed at building a multilingual lexical database with semantic relations between words in 8 european languages (Spanish, English, Italian, Dutch, French, German, Estonian and Czech). Every monolingual wordnet is linked to the others via an InterLingual Index derived from Wordnet 1.5.

ELSNET LE Training Showcase

Financing institution: ACO*HUM (Socrates), ELSENET, European Commission
Summary: A project under the auspices of ELSNET and ACO*HUM excellence networks to develop 6 specialization courses around Natural Language Processing and Speech Recognition and synthesis. Our task was to develop an open distance learning course on Natural Language Processing and Information Retrieval.


Duration: 2001-2003
Financing institution: CICYT (TIC2000-0335-C03-01)
Summary: Multilingual named-entity recognition, hyperlinking, phrase extraction, summarization and semantic indexing for information access on a digital news archive.


Servidor de Recursos para el Desarrollo de la Ingeniería Lingüística en Español

Duration: 1999-2000
Financing institution: M.I.N.E.R.
Summary: The goal of RILE is to develop a pilot for a server with resources, tools and information related to the development of applications in the field of Language Engineering for Spanish.


Recuperación de Información Textual en un Entorno Multilíngüe

Duration: 1996-1999
Financing institution: CICyT (TIC96-1243-C03-01)
Summary: Development and integration of Language Engineering resources and tools for Spanish, Catalan, Basque and English and demonstration of such tools in a multilingual search engine with NLP capabilities.



Duration: 1993-1995
Financing institution: European Commission (Esprit BRA 7315)
Summary: The goal was to explore the utility of constructing a multilingual lexical knowledge base from machine-readable versions of conventional dictionaries by exploring the utility of machine readable textual corpora as a source of lexical information not coded in conventional dictionaries, and by adding dictionary publishing partners to exploit the lexical database and corpus extraction software developed by the projects for conventional lexicography.