Web publication is still out of the standardized processes of the evaluation of the research activity, even when it is already one of the main channels for both informal and formal scholarly communication.

Not taking into account these channels is causing digital divides even among the developed countries and some regions are in danger of scientific and cultural colonialism due to the great preponderance of English in the web contents.

The project proposal intends to know the visibility and impact of the academic Web and the distribution by institutional, thematic or geographic criteria of these contents. The true degree of implantation of the new methods of electronic communication is not known in detail, as well as the use of Spanish as primary vehicle for scientific communication. These gaps could be exploited with the help of quantitative indicators, an approach that allows not only a global synthesis but also describe local scenarios like the Spanish one. The underlying hypothesis of the proposed approach are:

  1. The use of quantitative indicators will permit to approve the academic results decreasing the subjective bias.
  2. The quantitative indicators can be obtained automatically from the study of presence, visibility, impact and popularity of the academic websites.
  3. The publication of rankings of academic departments, together with the publication of the criteria to build them, will promote the improvement of academic websites visibility.

The development of a complete methodology and the tools for automating the process are crucial to achieve this goal considering as much academic fields and different languages as possible.

In order to cope with the increasing size of the web, an automatic, scalable approach is needed to extract the data required to build the indicators. Human Language Technologies for mining the web will help in the classification of the web sites and will allow to extract richer multilingual textual information, opening the way for more fine grained indicators.

Main Goals

The main goals of the project are the following:

  1. To advance the state of the art of classification and extraction techniques for automating effectively the process of identifying and obtaining relevant information from websites. In particular the data needed to evaluate the impact of that website in the context of a research community.
  2. To advance the state of the art on web indicators and the methodological approach to test their reliability.
  3. To gain insight on the presence and impact of the Spanish Humanities fields in the WWW.

To cope with these goals, the competences needed include background and knowledge on (a) Cybermetrics, (b) Information Processing, (c) Internet and (d) Human Language Technologies.

The current proposal involves two teams covering this range, one from the CINDOC (a,b,c), and the other from the UNED (b,c,d). Each group provides his own background and experience, both complementary and necessary for the main goals of this project. UNED expertise includes crawling, multilingual information retrieval, classification and extraction techniques. CINDOC has developed cybermetrics methods to analyse and rank (mainly in an intellectual way) web sites visibility. The challenges are:

  1. To automatize some of the tasks involved in this analysis, to be able to cope with the size of the growing number of web sites. Some of the automatic classification and extraction techniques are based on machine learning, requiring an annotated corpus, in this case the contribution of CINDOC is fully relevant for creating the corpus and validating the results, providing feedback to UNED to tune and improve the automatic techniques.
  2. To improve information classification and extraction techniques for websites, so that other cybermetric indicators beyond the current ones could be explored, in order to better evaluate the impact of the information offered for a targeted community.

The union of both groups aims at:

  1. Pushing the positioning and visibility of Spanish websites
  2. Produce a significant theoretical and practical contribution both to cybermetrics and the application of HLT, to strength our international presence in the research community.

The project is organized in two subprojects, to be carried out by UNED and CINDOC respectively. Each subproject includes a set of work packages. Each work package is described as a set of tasks with a responsible and the participants assigned. There are two common work packages: coordination management and dissemination. For the rest of the tasks, the interactions between the two subprojects (prerequisites and feedbacks among work packages) are identified as milestones, and described in each task. The coordination task will be in charge of the monitoring of each task milestone, through meetings with the task leaders. The entire project will be managed with the help of a project management tool.

CINDOC (Subproject e-Humanities) will provide the background in cybermetrics needed for the quantitative evaluation of the size, visibility, impact and popularity of the academic websites. On the other hand, UNED (subproject Catiex) will provide the background needed in Human Language Technologies in order to perform the automatic classification and extraction of the information required.

This collaboration will open the opportunity to extend the cybermetric studies to more academic fields and more countries and languages in the web. The collaboration between CINDOC and UNED is a fruit of the MAVIR excellence network after its first year. Thus, the goal of the project is the result of the collaborative study, evaluation and selection of a use case where the Human Language Technologies can be applied to a very well defined need.

The two subprojects with their specific goals and their interaction are:

  1. Catiex: Multilingual Web Categorization and Information Extraction, carried out by the UNED Natural Language Processing and Information Retrieval Group. The general goal is to provide the classification and descriptive information of the academic websites in order to build a database as source for the application of cybermetrics. In order to achieve this general goal, this subproject must:
    1. Crawl websites in controlled domains in order to extract their terminology and characterize the academic fields. The hierarchy of academic fields and the terminological seeds must be provided by e-Humanities.
    2. Extract automatically the terminology that characterizes each academic field, as input for the automatic classification.
    3. Crawl the academic websites under study in order to extract their descriptive information, determine their academic profile and categorize their academic field. The catalog of websites must be provided by e-Humanities.
    4. Extract automatically the group/department/entity names, the research and academic areas, the profile and the people involved in the group described by the website.
    5. Classify automatically the websites under study in order to group those in the same academic field under the same ranking.
  2. e-Humanities: Web mediators in the scholarly communication processes, carried out by the Cybermetrics research Group of CINDOC-CSIC. The general goal is to determine the web mediators in humanities and apply to them the web indicators that quantify their presence and visibility in the web. The specific goals of the project are:
    1. Identification and classification of the Web mediators in order to build a web catalogue of domains (University level).
    2. Automatic compilation of the Academic Web subdomains (department level)
    3. Manual filtering of automatic classification in the field of humanities. Catiex must provide the automatic classification of the websites.
    4. Institutional and geographical assignation.
    5. Development and application of web indicators in order to build a websites ranking per academic field. The rankings will contain the descriptive information given by Catiex. Catiex will provide also further criteria for the filtering of the rankings such as the academic profile.
    6. Study the digital divide by country and discipline comparing the web indicators with the bibliometric data.

In order to ensure the appropriate coordination and collaboration between the two subprojects, a specific coordination activity involving both groups is defined in the work plan.