Multilingual Web Person Name Disambiguation (M-WePNaD)

Introduction

M-WePNaD  is a shared task on the disambiguation of person names on the Web, which takes into account its multilingual nature.

Nowadays, many of the queries entered into web search engines are composed of person names. For a search of this kind including a person name, one may want to be able to know the number of different individuals included in the search results as well as to see a breakdown of results clustered for each specific individual. This is a task that has attracted substantial interest in the scientific community in recent years, evident in a number of share tasks that have tackled it (WePS-1, WePS-2, WePS-3). Despite the Web’s multilingual nature, existing work on person name disambiguation has not considered search results in multiple languages. The objective of this task will be centered around the person name disambiguation task from web search results, with the additional challenge that results for a query, as well as each individual, can be written in multiple languages. 

The organizing committee will provide the participants with the training corpus to be used for the development of their systems. A test corpus will be provided later to evaluate the performance of the systems they developed.

Final results of this task will be presented and discussed as part of the IberEval 2017 workshop, which will take place in Murcia, Spain, on the 19th September, 2017, co-located with SEPLN 2017.

 

Description

The person name disambiguation task on the Web consists in distinguishing the different individuals that are contained within the search results for a person name search query. It can be defined as a clustering task, where the input is a ranked list of n search results, and the output needs to provide both the number of different individuals identified within those results, as well as the set of pages associated with each of the individuals. While the task has been limited to monolingual scenarios so far, this task will be assessed in a multilingual setting. 

We have compiled an evaluation corpus called MC4WePS, which has been manually annotated by three experts. This corpus will be used to evaluate the performance of multilingual disambiguation systems. The evaluation can be carried out for different genres as the corpus includes not only web pages but also social media posts. 

The MC4WePS has been split into two parts, one for training and one for testing. Participants will have nearly two months to develop their systems making use of the training corpus (see Important Dates). Afterwards, the test corpus will be released, whereupon participants will run their systems to then send the results back to the task organizers. The organizers will evaluate the performance of the participants and put together a ranked list of systems.

More details about the corpus can be found on the Resources section. It is worth noting that different person names in the corpus have different degrees of ambiguity, a web page can be multilingual, and not all the contents in the corpus are HTML pages, but also other kinds of contents are included, including social media posts. To facilitate access to different types of documents, all the contents have been processed with Apache Tika (https://tika.apache.org/), and the resulting texts have also been included in the corpus.

We also provide the performance scores for different baseline approaches. Participants will be restricted to the submission of five different result sets. They will then write a paper (working note) describing their systems. Further guidelines describing the format of the result files and the paper can be found in the Submission section.

 

Important dates 

13 Mar, 2017: Training set released.
12 May, 2017: Test set released.

16 May, 2017: Deadline for submission of results. Extended deadline: 18 May, 2017.

22 May, 2017: Competition results announced: 23 May, 2017.
05 Jun, 2017: Deadline for participants to submit Working notes. Extended Deadline: 15 Jun, 2017
15 Jun, 2017: Reviews of Working notes sent out to authors. Extended Deadline: 22 Jun, 2017
01 Jul, 2017: Deadline to submit revised Working notes.

19 Sep, 2017: Workshop at SEPLN 2017.

 

Resources

The corpus was collected in 2014, issuing numerous search queries and storing those that met the requirements of ambiguity and multilingualism. 

Each query included a first name and last name, with no quotes, and searches were issued in both Google and Yahoo. The criteria to choose the queries were as follows:

Ambiguity: non-ambiguous, ambiguous or highly ambiguous names. A person’s name is considered highly ambiguous where it has results for more than 10 individuals within the first 100 to 300 results. Cases with 2 to 9 individuals were considered ambiguous, while those with a single individual are deemed non-ambiguous.

Language: results can be monolingual, where all pages are written in the same language, or multilingual, where pages are written in more than one language. Moreover, results can be monolingual or multilingual when it comes to the documents pertaining to a certain individual.

More details related with the corpus can be found here.

Training Corpus

 M-WeP-NaD Training Corpus

The M-WeP-NaD training corpus includes 65 different person names belonging to the MC4WePS benchmark corpus [Montalvo et al., 2016], which have been randomly sampled.

The corpus comprises the following person names:

PersonName

#Webs

#C

%S

%NRs

L

%OL

Adam Rosales

110

8

9.09%

10.0%

EN

0.91%

Albert Claude

106

9

10.38%

24.53%

EN

13.21%

Álex Rovira

95

20

23.16%

6.32%

EN

43.16%

Alfred Nowak

109

15

3.67%

66.06%

EN

30.28%

Almudena Sierra

100

22

12.0%

63.0%

ES

1.0%

AmberRodríguez

106

73

11.32%

10.38%

EN

9.43%

Andrea Alonso

105

49

9.52%

20.95%

ES

6.67%

Antonio Camacho

109

39

24.77%

46.79%

EN

29.36%

Brian Fuentes

100

12

7.0%

3.0%

EN

2.0%

Chris Andersen

100

6

5.0%

2.0%

EN

26.0%

Cicely Saunders

110

2

7.27%

10.91%

EN

1.82%

Claudio Reyna

107

5

7.48%

2.8%

EN

4.67%

David Cutler

98

37

15.31%

19.39%

EN

0.0%

Elena Ochoa

110

15

8.18%

4.55%

ES

10.0%

Emily Dickinson

107

1

3.74%

0.93%

EN

0.0%

Francisco Bernis

100

4

4.0%

29.0%

EN

50.0%

Franco Modigliani

109

2

2.75%

1.83%

EN

38.53%

Frederick Sanger

100

2

0.0%

5.0%

EN

0.0%

Gaspar Zarrías

110

3

4.55%

0.0%

ES

2.73%

George Bush

108

4

2.78%

13.89%

EN

25.0%

Gorka Larrumbide

109

3

4.59%

32.11%

ES

9.17%

Henri Michaux

98

1

3.06%

1.02%

EN

7.14%

James Martin

100

48

5.0%

14.0%

EN

2.0%

Javi Nieves

106

3

4.72%

1.89%

ES

3.77%

JesseGarcia

109

26

6.42%

16.51%

EN

31.19%

John Harrison

109

50

15.6%

19.27%

EN

11.01%

John Orozco

100

9

11.0%

20.0%

EN

4.0%

John Smith

101

52

10.89%

10.89%

EN

0.0%

Joseph Murray

105

47

7.62%

20.0%

EN

0.95%

Julián López

109

28

4.59%

1.83%

ES

6.42%

Julio Iglesias

109

2

2.75%

0.92%

ES

14.68%

Katia Guerreiro

110

8

10.91%

0.0%

EN

26.36%

Ken Olsen

100

41

5.0%

6.0%

EN

0.0%

Lauren Tamayo

101

8

11.88%

10.89%

EN

3.96%

Leonor García

100

53

9.0%

12.0%

ES

3.0%

Manuel Alvar

109

4

3.67%

34.86%

ES

0.92%

Manuel Campo

103

7

3.88%

2.91%

ES

0.0%

María Dueñas

100

5

6.0%

0.0%

ES

14.0%

Mary Lasker

103

3

1.94%

15.53%

EN

0.0%

MattBiondi

106

12

10.38%

5.66%

EN

9.43%

Michael Bloomberg

110

2

6.36%

1.82%

EN

0.0%

Michael Collins

108

31

1.85%

13.89%

EN

0.0%

Michael Hammond

100

79

20.0%

11.0%

EN

1.0%

Michael Portillo

105

2

4.76%

0.95%

EN

7.62%

Michel Bernard

100

5

0.0%

95.0%

FR

39.0%

Michelle Bachelet

107

2

8.41%

4.67%

EN

16.82%

Miguel Cabrera

108

3

5.56%

3.7%

EN

0.93%

Miriam González

110

43

11.82%

5.45%

ES

29.09%

Olegario Martínez

100

38

12.0%

10.0%

ES

15.0%

OswaldAvery

110

2

7.27%

3.64%

EN

9.09%

Palmira Hernández

105

37

8.57%

60.95%

ES

20.95%

Paul Erhlich

99

9

4.04%

7.07%

EN

16.16%

Paul Zamecnik

102

6

1.96%

6.86%

EN

2.94%

Pedro Duque

110

5

4.55%

12.73%

ES

4.55%

Pierre Dumont

99

39

10.1%

15.15%

EN

41.41%

Rafael Matesanz

110

6

7.27%

2.73%

EN

44.55%

Randy Miller

99

52

12.12%

33.33%

EN

0.0%

Raúl González

107

32

4.67%

1.87%

ES

10.28%

Richard Rogers

100

40

13.0%

16.0%

EN

9.0%

Richard Vaughan

108

5

4.63%

5.56%

ES

7.41%

Rita Levi

104

2

1.92%

1.92%

ES

47.12%

Robin López

102

10

12.75%

13.73%

EN

1.96%

Roger Becker

103

29

4.85%

18.45%

EN

13.59%

Virginia Díaz

106

40

11.32%

16.04%

ES

17.92%

William Miller

107

40

7.48%

37.38%

EN

0.0%

AVG

104.69

19.95

7.66%

14.88%

-

12.29%

 
Where:

        - #Webs: Number of search results associated with the person in question.
        - #C: Number of different individuals (clusters) occurring in the search results for a given person name.
        - %S: Percentage of web pages pertaining to social media.
        - %NR: Percentage of unrelated web pages.
        - L: Most common language for a given person, based on the annotations performed by linguists.
        - OL%: percentage of web pages written in a language other than the most frequent (L).

The corpus is structured in directories. Each directory belongs to a specific search query that matches the pattern “name_lastname”, and includes the search results associated with that person name. Each search result is in turn stored in a separate directory, whose name reflects the rank of that particular result in the entire list of search results. A directory with a search result contains the following files:

-       The web page linked by the search result. Note that not all search results point to HTML web pages, but there are also other document formats: pdf, doc, etc.

-       A metadata.xml file with the following information:

  • URL of searchresult.
  • ISO 639-1 codes for languages the web page is written in. Comma-separated list of languages where several were found.
  • Download date.
  • Name of annotator.

An example of this file is as follows:

 

-       A file with the plain text of the search results, which was extracted using Apache TiKa (https://tika.apache.org/).

There can be overlaps between clusters where a search result belongs to two or more different individuals with the same name. When a search result doesn’t belong to any individuals, then this is annotated as “Not Related”.

The corpus is available in the Downloads section.

 

Test Corpus

The M-WeP-NaD test corpus includes 35 different person names belonging to the MC4WePS benchmark corpus [Montalvo et al., 2016], which have been randomly sampled.

         The corpus is available in the Downloads section.
 

Registration

Registration for M-WePNaD task is now open and will stay open until May 8, 2017.

To register please send an email to the following address: m-wepnadorganizers@listserv.uned.es

Once registered you will receive data access details.

 

Submission

The submission should be a single file, formatted as follows. Each line has the following fields separated by means of tabulator spaces:

Person name        Web ID        Cluster ID

Where:

·     The first column contains the person name to disambiguate.

·     The second column contains the web page ID, that is the name of the subfolder that contains the html page and other files associated with the web page.

·     The third column contains the cluster ID. The cluster IDs could be any string with no spaces.

 

Overlapping clusters are allowed. For example, if Adam Rosales's web page 001 is included in  clusters 0 and 1, the file would contain the following lines:

adam rosales     001         0

adam rosales     001         1

 

The procedure of submission will be as follows:

 - You should choose a team name that identifies your company or institution.

- The filename for each run will include the team name and a correlative submission number starting in 1. For example, TEAM_1, TEAM_2,...

- The maximum number of runs for each team is 5.

- The runs should be sent to soto.montalvo@urjc.es (Email subject: M-WePeNaD RUN SUBMISSION) in addition to the name and affiliations of the participants.

 

Evaluation 

We will use a set of evaluation metrics that take into account overlapping clusters. The metrics are: Reliability (R), Sensibility (S) and their harmonic mean F0.5(R,S) [Amigó et al., 2013]. In this task the final value of the evaluation will be the average of F0.5(R,S) in all person names.

The Java Archive (JAR) containing the application to run the evaluation can be found in Downloads section.

The evaluator receives three parameters in the following order:

-        Path where the gold standard is (pathGoldStandard).

-        Path with the runs to be evaluated (pathRuns). It must be a folder that contains one or more runs. Each run is a file that follows the submission format.

-        Path of the output folder.

The way to use the evaluator is the following:

java –jar Evaluator.jar pathGoldStandard  pathRuns  output

The output of the evaluation consists of as many files as runs plus a file called GLOBAL_RESULTS.txt which contains the average results of each run. Each run output file contains the values of R, S and F for each person name, and the mean of these values for all person names. Moreover, the RS and F values are generated both by considering not related pages in the evaluation as well as by not considering them.

 

Downloads

- Training Corpus:                                   MWePNaDTraining.zip

- Gold standard for Training Corpus:       GoldStandardTraining.txt

- JAVA application to run the evaluation: Evaluator.jar

- Test Corpus:                                         MWePNaDTest.zip

 

Baseline Results

We provide the results of the two following baselines applied to the training set:

- ALL IN ONE returns one cluster containing all the search results.
- ONE IN ONE returns each search result as a singleton cluster.

System BP(Related) BR(Related)   F0.5(Related) BP(ALL) BR(ALL) F0.5(ALL)
ALL_IN_ONE(Training) 0.54 1.0 0.62  0.51 1.0 0.61
ONE_IN_ONE(Training) 1.0 0.25 0.34 1.0 0.2 0.29
                   
                        

Final Results

We provide the ranking of the results after evaluating all submissions received.

First table shows the results by not considering not related pages in the evaluation, and second table shows the results by considering them.

 

Results considering only related web pages:

System

R

S

F0.5

ATMC_UNED - run 3

0.80

0.84

0.81

ATMC_UNED - run 4

0.79

0.85

0.81

ATMC_UNED - run 2

0.82

0.79

0.80

ATMC_UNED - run 1

0.79

0.83

0.79

LSI_UNED - run 3

0.59

0.85

0.61

LSI_UNED - run 4

0.74

0.71

0.61

LSI_UNED - run 2

0.52

0.93

0.58

LSI_UNED - run 5

0.52

0.92

0.57

PanMorCresp_Team - run 4

0.53

0.87

0.57

LSI_UNED - run 1

0.49

0.97

0.56

Baseline - ALL-IN-ONE

0.47

0.99

0.54

Loz_Team - run 3

0.57

0.71

0.52

Loz_Team - run 5

0.51

0.83

0.52

Loz_Team - run 2

0.55

0.65

0.50

Loz_Team - run 4

0.50

0.81

0.50

PanMorCresp_Team - run 3

0.53

0.82

0.47

Loz_Team - run 1

0.50

0.76

0.46

PanMorCresp_Team - run 1

0.80

0.51

0.43

Baseline - ONE-IN-ONE

1.0

0.32

0.42

PanMorCresp_Team - run 2

0.50

0.65

0.41

 

Results considering all web pages:

System

R

S

F0.5

ATMC_UNED - run 3

0.79

0.74

0.75

ATMC_UNED - run 4

0.78

0.75

0.75

ATMC_UNED - run 1

0.78

0.73

0.74

ATMC_UNED - run 2

0.82

0.69

0.73

LSI_UNED - run 3

0.59

0.81

0.60

LSI_UNED - run 2

0.52

0.92

0.59

LSI_UNED - run 5

0.52

0.90

0.59

LSI_UNED - run 1

0.49

0.97

0.58

LSI_UNED - run 4

0.74

0.66

0.58

Loz_Team - run 1

0.49

0.73

0.58

PanMorCresp_Team - run 4

0.52

0.86

0.58

Baseline - ALL-IN-ONE

0.47

1.0

0.56

Loz_Team - run 5

0.50

0.80

0.54

Loz_Team - run 3

0.56

0.66

0.53

Loz_Team - run 4

0.49

0.78

0.52

Loz_Team - run 2

0.54

0.61

0.50

PanMorCresp_Team - run 3

0.53

0.81

0.50

PanMorCresp_Team - run 2

0.49

0.62

0.43

PanMorCresp_Team - run 1

0.79

0.46

0.40

Baseline - ONE-IN-ONE

1.0

0.25

0.36

 

The M-WePNaD Organizing Committee appreciates the interest on this task of all the participants.

 

References

[Amigó et al., 2013] E. Amigó, J. Gonzalo, F. Verdejo. A General Evaluation Measure for Document Organization Tasks. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2013), pp. 643-652. (2013). http://doi.acm.org/10.1145/2484028.2484081.
 
[Montalvo et al., 2016] S. Montalvo, R. Martínez, L. Campillos, A. D. Delgado, V. Fresno, F. Verdejo. MC4WePS: a multilingual corpus for web people search disambiguation, Language Resources and Evaluation (2016). http://dx.doi.org/10.1007/s10579-016-9365-4