Answer Validation Exercise

Summary

Systems must emulate human assessment of QA responses and decide whether an Answer to a Question is correct or not according to a given Text.

Participant systems will receive a set of triplets (Question, Answer, Supporting Text) and they must return a boolean value for each triplet. Results will be evaluated against the QA human assessments. See Exercise Description for more details.

Keywords: Answer Validation, Question Answering, Recognising Textual Entailment (RTE)
Related Work: QA at CLEF, RTE Pascal Challenge

Objective

Promote the development and evaluation of subsystems aimed at validating the correctness of the answers given by a QA system. More information in
Anselmo Peñas, Álvaro Rodrigo, Valentín Sama, Felisa Verdejo. Testing the Reasoning for Question Answering Validation. Special Issue on Natural Language and Knowledge Representation, Journal of Logic and Computation. To appear (Draft version )

Exercise Description

Languages available (subtasks)

There is a subtask for each language involved in QA:
  • Basque
  • Bulgarian
  • German
  • English
  • Spanish
  • French
  • Italian
  • Dutch
  • Portuguese
  • Romanian
  • Greek

Test data format

Systems have to considere triplets (Question, Answer, Supporting Text) and decide whether the Answer to the Question is correct and supported or not according to the given Supporting Text. Input format will be similar to: 

<q id="1" lang="EN">     

    <q_str>Who was Yasser Arafat?</q_str>

    <a id="1" value="">

        <a_str>Palestine Liberation Organization Chairman
        </a_str>

        <t_str doc="LA030394-0270">President Clinton appealed personally to Palestine Liberation Organization Chairman Yasser Arafat and angry Palestinians on Wednesday to resume peace talks with Israel </t_str>

    </a>

    <a id="2" value="">

                ....

    </a>

                ....

</q>

<q id="2" lang="EN">

 ....

</q>

....


where
pairs (Answer, Supporting Text) are grouped by question. Systems must consider the question and validate each of these pairs according to the response format:
 
Response format
The systems' output will have a format similar to:

q_id a_id [SELECTED|VALIDATED|REJECTED] confidence

where all the answers for all questions receive one and only one of the following values:
  • VALIDATED indicates that the answer is correct and supported although not the one selected. There is no restriction in the number of VALIDATED answers (from zero to all).
  • SELECTED indicates that the answer is VALIDATED and it is the one chosen as the output of a hypothetical QA system. The SELECTED answers will be evaluated against the QA systems of the Main Track (see evaluation below). No more than one answer per question can be marked as SELECTED. At least one of the VALIDATED answers must be marked as SELECTED.
  • REJECTED indicates that the answer is incorrect or there is no enough evidence of its correctness. There is no restriction in the number of REJECTED answers (from zero to all).
and the confidence score is in the range [0-1] as usual:
  • 0 (not sure)
  • 1 (completely sure)

Number of Runs

Participating teams will be allowed to submit results of up to 2 systems.

Number of samples

The answers are grouped by questions. The exercise contains between 150 and 200 questions but the number of answers to validate per question depends on the participation in the Question Answering main task.

Evaluation

Answers will be judged by humans as CORRECT, INCORRECT or UNKNOWN. UNKNOWN answers will be ignored in the evaluation.  
 
In order to evaluate systems' performance, we have two groups of measures. In the first groups we will use precision, recall and F-measure (harmonic mean) over answers that must be VALIDATED (SELECTED answers are consider as VALIDATED ones with these measures) [1,2].  

 
 

The second group of measures aims at comparing QA systems performance with the potential gain that the participant Answer Validation systems could add to them. Besides the qa_accuracy measure we used last year [3], we will also use the measures qa_rej_accuracy, qa_accuracy_max and estimated_qa_performance that acknowledge the identification of questions with a set of answers in which no correct one has been found.
 
 
F-measure, qa_accuracy and estimated_qa_performance will be the ones used to rank participant systems. 
 
References

[1] Anselmo Peñas; Alvaro Rodrigo; Valentin Sama; Felisa Verdejo. Testing the Reasoning for Question Answering Validation. Journal of Logic and Computation 2007
    doi: 10.1093/logcom/exm072
    http://logcom.oxfordjournals.org/cgi/reprint/exm072

[2] Anselmo Peñas, Álvaro Rodrigo, Valentín Sama, Felisa Verdejo. Overview of the Answer Validation Exercise 2006. CLEF 2006, Lecture Notes in Computer Science LNCS 4730. Springer, Berlin.
    doi: 10.1007/978-3-540-74999-8_32
    http://www.springerlink.com/content/a1x43676577483u4/

[3] Anselmo Peñas, Álvaro Rodrigo, Felisa Verdejo. Overview of the Answer Validation Exercise 2007. CLEF 2007, Lecture Notes in Computer Science LNCS 5152. Springer, Berlin.
    doi: 10.1007/978-3-540-85760-0_2  _32
    http://www.springerlink.com/content/m87n2r1m37618377/

Resources

 Hypothesis patterns


The hypothesis in the AVE 2006 pairs were built automatically from a "hypothesis pattern" manually generated. The following zip file contains all the question-pattern pairs for all the languages in the following format:

<question id="0001" type="OBJECT">What is Atlantis</question>
<pattern id="0001">Atlantis is <answer/>.</pattern>

Public Spanish Collection

A training corpus named SPARTE has been developed from the Spanish assessments produced during 2003, 2004 and 2005 editions of QA@CLEF. SPARTE contains 2804 text-hypothesis pairs from 635 different questions (4.42 average number of pairs per question ). All the pairs have a document label and a TRUE/FALSE value indicating whether the document entails the hypothesis or not.

Download SPARTE

The number of pairs with a validation value equals to TRUE is 676 (24%) and the number of pairs FALSE is 2128 (76%). Notice that the percentage of pairs FALSE is much larger than the percentage of pairs TRUE. We decided to keep this proportion since this is the result of real QA systems submission.
For this reason, we propose an evaluation based on the detection of pairs with entailment (entailment value equals to TRUE). Turning this into the Answer Validation problem, the proposed evaluation would be focussed on the detection of correct answers.

A. Peñas, A. Rodrigo, and F. Verdejo. SPARTE, a Test Suite for Recognising Textual Entailment in Spanish In A. Gelbukh, editor, Computational Linguistics and Intelligent Text Processing. CICLing 2006., Lecture Notes in Computer Science, 2006.

Public English Collection


A similar corpus has been developed for English.

Download ENGARTE