QA4MRE:

Question Answering for Machine Reading Evaluation

Summary

The main objective of this exercise is to develop a methodology for evaluating Machine Reading systems through Question Answering and Reading Comprehension Tests.
Systems should be able to extract knowledge from large volumes of text and use this knowledge to answer questions. This methodology should allow the comparison of systems' performance and the study of the best approaches.

Task Overview

The Machine Reading task addresses the problem of building a bridge between knowledge encoded as natural text and the formal reasoning systems that need such knowledge. The knowledge contained in naturally occurring texts should be made available in forms that machines can use to perform some kind of reasoning and expand the system's inference capabilities. In contrast to text mining (or text harvesting, sometimes also called macro-reading), where the system reads and combines evidence from hundreds or thousands of texts, MR is the task of obtaining an in-depth understanding of just one, or a small number, of texts. In fact, the task will focus on the reading of single documents, where correct answers require some inference and the consideration of previously acquired background knowledge.

Test Data

As in the previous campaign, the task focuses on the reading of single documents and the identification of the answers to a set of questions about information that is stated or implied in the text. Systems should be able to use knowledge obtained automatically from given texts to answer a set of questions posed for single documents at a time. Questions are in the form of multiple choice, where a significant portion of questions have no correct answer among the given alternatives proposed. While the principal answer is to be found among the facts contained in the test documents provided, systems may use knowledge from additional given texts (the Background Corpus) to assist them with answering the questions. Some questions will also test a system's ability to understand certain propositional aspects of meaning such as modality and negation.

The 2013 test set will be composed of 4 topics, namely Aids, Climate change and Music and Society and Alzheimer. Each topic will include 4 reading tests. Each reading test will consist of one single document, with at least 15 questions and a set of five choices per question.

Collection

Associated with each topic, the organization will provide a reference corpus consisting of un-annotated documents related to the topic. These comparable collections, available in the various languages, should be used by the systems to acquire the reading capabilities and the knowledge needed to fill in the gaps required to answer a test on the topic. Similar to last year, ad-hoc collections in all the languages involved in the exercise will be provided. Texts will be drawn from a diverse range of sources i.e.: newspapers, newswire, web, blogs, Wikipedia entries.
  • The background collections of additional text about the domain, created in the different languages, will be available to all participants who sign a license agreement. Thus, the learning and use of additional knowledge could be in one language or several
  • The 2013 background collections are based on but not identical to the 2012 collections

Languages

Document collections and reading tests will be available in Arabic, Bulgarian, English, Spanish, and Romanian.

Important:
  • The 2013 background collections will be made available to all participants at the beginning of March subject to signing a licence agreement so that they can be used to acquire domain specific knowledge in one language or several prior to taking part in the QA4MRE task.
  • The 2011-2012 background collections are already available for training purposes at Downloads
  • The reading tests will be exactly the same in all languages (parallel translations).

Evaluation

Evaluation will be performed automatically by comparing the answers given by systems to the ones given by humans. No manual assessment will be required.

Each test will receive an evaluation score between 0 and 1 using c@1. This measure, already tried in previous CLEF QA Tracks, encourages systems to reduce the amount of incorrect answers and maintain the number of correct ones by leaving some questions unanswered.