Project Background
The CINDOC Cybermetrics Research Group has developed new indicators of the scientific activity based on the publication in the Web. It has been possible thanks to methodological developments in the field of the automatic harvesting of data Web by means of robots.
The most important previous results have been published as different Web sites:
- Webometrics Ranking of World Universities
- European Indicators, Cyberspace and the Science-Technology-Economy System
- Web Indicators Portal
The UNED Natural Language Processing and Information Retrieval Group has participated in several European and National projects in the field of Human Languages Technologies (HLT) involving applications with tasks such as Information Extraction, Terminology Extraction, Lexical Knowledge Acquisition, Multilingual Information Retrieval, Categorization, Clustering, Automatic Evaluation of Text Summarization, etc. whose results are available at http://nlp.uned.es. HLT have been successfully applied either in restricted domains and/ or well-defined tasks. Still there is a problem of tailorability and scalability, as well as a trade-off between cost save and rate of precision. This project is a promising test case to combine the potential of HLT and cybermetrics to improve the state of the art.
UNED is participating in the Text-Mess project (TIN2006-15265-C06) that involves many tasks related to the Human-Language Technologies and Information Access. We expect to reuse as much results as possible for this project, especially in the fields of Automatic Classification, Information Extraction (IE) and Named Entities Recognition (NER) (where two researchers share their dedication in both projects, Text-Mess and QEAvis). However, the particularities of the QEAVis domain of application will make difficult the extensive exploitation of the Text-Mess results. First, the clustering tools don?t fit in the classification problem, even if they were tuned for the academic disciplines (Text-Mess is focused on search engine results). Second, the IE and NER tasks in QEAVis introduce the need of web pages processing where the entities are not part of a readable and quite long plain texts, but part of html structures like tables that need different sets of representation characteristics and algorithms. Third, QEAVis can?t make use of the development of wrappers for semi-structured documents. Finally, QEAVis IE and NER tasks are not related to the identification of Weak Name Entities, the clustering of entities and their disambiguation, that center the activity in Text-Mess regarding NER.