Guidelines

Track: The iCLEF challenge

The high-level goal of the Interactive Track in CLEF-2001 is investigation of cross-language searching (by users who cannot read the document language) as an interactive task, examining the process as well as the outcome. To this end, an experimental framework has been designed with the following common features:

an interactive search task: foreign-language document selection. Each participant will compare two CLIR systems (one being probably a baseline) regarding their ability to inform the users about the content of (foreign-language) documents they cannot read.
4 written topic descriptions
a minimum of 4 searchers (more can be added in groups of 4)
a French or English news document collection to be searched
a required set of searcher questionnaires
4 classes of data to be collected at each site and submitted to UNED
1 measure of effectiveness to be calculated by UNED

The framework will allow groups to estimate the effect of their experimental manipulation free of the main (additive) effects of participant and topic and it will reduce the effect of interactions.

In CLEF 2001, the emphasis will be on each group's exploration of different approaches to supporting the common searcher task and understanding the reasons for the results they get. No formal coordination of hypotheses or comparison of systems across sites is planned for CLEF 2001, but groups are encouraged to seek out and exploit synergies. As a first step, groups are strongly encouraged to make the focus of their planned investigations known to other track participants as soon as possible, preferably via the track listserv at iclef@listserv.uned.es. Contact Julio Gonzalo to join.

Track - Questions -Data provided -Searcher task -Instructions -Questionnaires- Data to be submitted -Evaluation - Exp. Schedule - Analysis - Schedule

Questions

The track will look at two types of questions:

Broad questions, asking about a general subject
Narrow questions, asking about a specific event

The questions will be selected from the CLEF-2000 topics topics that had good coverage in the English and French collections and will be balanced among the two types.

Track - Questions -Data provided -Searcher task -Instructions -Questionnaires- Data to be submitted -Evaluation - Exp. Schedule - Analysis - Schedule

Data provided

Document collection: we'll use the CLEF 2000 French and English data, which includes AP English newswire articles and Le Monde French newspaper articles, both from 1994. Each participant will select one of these collections that is non-native to their human searchers.
the 4 topics in all languages used in CLEF 2000. Each participant will use the topics in the native language of the searchers.
frozen ranked lists provided by a CLIR system for both collections and for all queries.
Standard Systran translations of the documents in the ranked lists into the (searcher) languages used by participants, wherever possible. Such translations can be provided to searchers as a standard baseline for the document selection task.

Track - Questions -Data provided -Searcher task -Instructions -Questionnaires- Data to be submitted -Evaluation - Exp. Schedule - Analysis - Schedule

Searcher task

The searcher's task will be to begin at the top of a ranked list that was produced by a cross-language retrieval system and examine a translation of each foreign language document in the list to determine whether the document is relevant, somewhat relevant, or irrelevant to a topic described by a written topic description. A maximum of 20 minutes is allowed for each ranked list. The user will also be afforded the ability to indicate if they are unsure of their assessment for particular documents, and they may choose to leave some documents unassessed.

Track - Questions -Data provided -Searcher task -Instructions -Questionnaires- Data to be submitted -Evaluation - Exp. Schedule - Analysis - Schedule

Instructions to be given to the searchers

The goal of this experiment is to determine how well an information retrieval system can provide you with information about foreign language documents that would allow you to reliably decide whether each document is relevant to a topic (which we define as "a written statement of a searcher's information need").

You will be asked to judge documents with respect to 2 topics with one system and 2 topics with another. For each topic, you will be shown information about 50 documents. The documents are arranged so that the documents that an automatic search systems has determined are most likely to be relevant will be in the most prominent positions (for example, they may be placed near the top of a ranked list). You may select any individual document for closer examination, or you may judge the relevance of a document based on the summary information that is initially displayed. More credit will be awarded for accurately assessing relevant documents than for the number of documents that are assessed, because in a real application you might need to pay for a high-quality translation prepared for each selected document. You may indicate each document as relevant, not relevant, unsure,, or you may leave it unassessed. You will have twenty minutes for each search, with one brief break in the middle of the session.

You will also be asked to complete several additional questionnaires:

Before the experiment - computer/searching experience and attitudes
After assessing documents with respect to each topic
After completing the use of each system
After the experiment - system comparison and experiment feedback

Track - Questions -Data provided -Searcher task -Instructions -Questionnaires - Data to be submitted -Evaluation - Exp. Schedule - Analysis - Schedule

Searcher questionnaires

The questionnaires can be downloaded here.

In the questionnaires, <ENGLISH/SPANISH> must be substituted for the native language of the searchers.

Track - Questions -Data provided -Searcher task -Instructions -Questionnaires- Data to be submitted -Evaluation - Exp. Schedule - Analysis - Schedule

Data to be collected and submitted to UNED

Several sorts of result data will be collected for evaluation/analysis (for all questions unless otherwise specified):

Due at UNED by 10 July 2001
1. The relevance judgment for each document with respect to each topic by each searcher.
Due at UNED by 13 August 2001
1. a full narrative description of one interactive session for a question to be determined by each site.
2. any further guidance or refinement of the task specification given to the searchers.
3. data from the common searcher questionnaires.

Instructions about where to submit all data will be mailed to the distribution list.

Track - Questions -Data provided -Searcher task -Instructions -Questionnaires- Data to be submitted - Evaluation - Exp. Schedule - Analysis - Schedule

Evaluation of data submitted to UNED

The CLEF-2000 relevance assessment for the document language used by the participant will be used as ground truth. The primary measure of a searcher's effectiveness will be van Rijsbergen's F_ALPHA measure: F_ALPHA = 1/[(ALPHA/P + (1-ALPHA)/R] where P is precision and R is recall. Values of ALPHA below 0.5 emphasize precision, values greater than 0.5 emphasize recall. For this evaluation, ALPHA=0.2 will be the default value, modeling the case in which missing some relevant documents would be less objectionable than paying to obtain fluent translations of many documents that later turn out not to be relevant. RELEVANCEJUDGMENTs of 2 will be treated as relevant and all other RELEVANCEJUDGMENTs will be treated as not relevant for purposes of computing the F_ALPHA measure. This is an exploratory track in which one of our most important goals is to develop good evaluation measures for this task. Participating teams are therefore encouraged to compute and report any other measures that they believe offer a useful degree of insight.

Track - Questions -Data provided -Searcher task -Instructions -Questionnaires- Data to be submitted -Evaluation - Exp. Schedule - Analysis - Schedule

Experiment schedule

The design will be a within-subject design like that used for the TREC interactive track, but with a different number of topics and a different task.

Each user will be presented with all of the topics. The presentation order for topics will be varied systematically, with 2 variations to ensure that each topic is searched in a different position, but that the same presentation order is used for each system. The topic presentation order will be

The experiment will take about three hours. Each topic will take about 25 minutes: 2 minutes before to examine the topic description, 20 minutes during the search, 3 minutes afterwards to complete the post-search survey. Searchers should not be asked to work for more than an hour without a break. An example schedule for an experimental session would be as follows:

Introductory stuff	10 minutes
Initial survey	5 minutes
Tutorials (2 systems)	30 minutes total
Break	10 minutes
Searching (system A, 2 topics)	50 minutes
Post-system survey	5 minutes
Break
Searching (system B, 2 topics)	50 minutes
Post-system survey	5 minutes
Final survey	10 minutes

Track - Questions -Data provided -Searcher task -Instructions -Questionnaires- Data to be submitted -Evaluation - Exp. Schedule - Analysis - Schedule

Analysis

The nature of the detailed analysis is up to each site, but sites are strongly encouraged to take advantage of the experimental design and undertake exploratory data analysis to examine the patterns of correlation, interaction, etc. involving the major factors. Some example plots for the TREC-6 interactive data (recall or precision by searcher or topic) are available on the Interactive Track web site at http://www.itl.nist.gov/iad/894.02/projects/t10i/ under "Interactive Track History." The computation of analysis of variance (ANOVA), where appropriate, can provide useful insights into the separate contributions of searcher, topic and system as a first step in understanding why the results of one search are different from those of another.

Track - Questions -Data provided -Searcher task -Instructions -Questionnaires- Data to be submitted -Evaluation - Exp. Schedule - Analysis - Schedule

Schedule

ASAP	Join the iCLEF mailing list
6 Jun	Topics and documents available Systran translations available
10 July	Submit relevance judgments to UNED
25 July	Results available from UNED
6 August	Submit notebook papers to CNR
13 August	Submit additional results to UNED
3-4 September	CLEF Workshop in Darmstadt. Germany

Fernando López Ostenero - Webmaster
- Design and implementation