WePS-3

WePS 3: searching information about entities in the Web

WePS 3 will be a competitive evaluation campaign including two tasks concerning the Web entity search problem:

Task 1 is related to Web People Search and focuses on person name ambiguity and person attribute extraction on Web pages
Task 2 is related to Online Reputation Management (ORM) for organizations and focuses on the problem of ambiguity for organization names and the relevance of Web data for reputation management purposes.

The results of the evaluation campaign will be discussed in a one day workshop as a CLEF 2010 Lab in Padova (Italy), 22 or 23 September 2010.

See the WePS-3 Call for Participation

Background

The WePS campaign has been focused on the Web People Search problem in its first two editions: WePS 1 was run as a Semeval 1 task in 2007, where 16 teams submitted results (being one of the largest tasks in Semeval) and WePS 2 was run as a workshop of the WWW 2009 Conference, with the participation of 19 research teams.

The Web People Search task was defined in WePS as a problem of organization of web search results for a given person name. Web search engines return a ranked list of URLs which typically refer to various people sharing the same name. Ideally, the user would rather see documents in different clusters grouping documents that refer to the same individual, possibly with a list of person attributes that help deciding who is the actual person intended by the user.

From a practical point of view, the task is highly relevant: between 11 and 17% of web queries include a person name, 4% of web queries are just a person name, and person names are highly ambiguous: according to the US Census Bureau, only 90,000 different names are shared by more than 100,000,000 people. An indirect proof of the relevance of the problem is the fact that, since 2005, a number of web startups have been created precisely to address it (Spock.com and Zoominfo.com being the best known).

From a research point of view, the task is challenging (the number of clusters is not known a priori; the degree of ambiguity does not seem to follow a normal distribution; and web pages are noisy sources from which attributes and other indexes are difficult to extract) and has connections with Natural Language Processing and Information Retrieval tasks (Text Clustering, Information Extraction, Word Sense Discrimination) in the context of the WWW as data source.

Goals

Our current proposal represents a third step in a growth path for WePS which is illustrated in the following figure.

WePS 3 Tasks

WePS 1 and WePS 2 were focused on the people search task: in the first campaign we addressed only the name coreference problem, defining the task as clustering of web search results for a given person name. In the second campaign we refined the evaluation metrics and added an attribute extraction task for web documents returned by the search engine for a given person name.

For this third campaign we aim at merging both problems into one single task, where the system must return both the documents and the attributes for each of the different people sharing a given name. This is not a trivial step from the point of view of evaluation: a system may correctly extract attribute profiles from different URLs but then incorrectly merge profiles.

In addition, we want to consider another type of entity: organizations. Name ambiguity for organizations is a highly relevant problem faced by Online Reputation Management systems. Take, for instance, the online company Amazon. In order to trace mentions and opinions about Amazon in web data (including news and blog feeds and input from social networks), the system must filter out alternative senses of “Amazon” (the South American river, the nation of female warriors, etc.). But such filtering cannot be done by liberally adding keywords to a query (e.g. “amazon online store”), because that may harm recall, and recall is crucial for reputation management.

WePS 3 Focus: implication of industrial stakeholders

WePS 1 and WePS 2 focused on consolidating a research community around the problem and an optimal evaluation methodology. In WePS 3 the focus is on implicating industrial stakeholders in the evaluation campaign, as providers of input to the task design phase and also as providers of realistic scale datasets. To reach this goal we have incorporated a representative from industry in each of the tasks:

For the Web People Search Task, co-coordinator Andrew Borthwick is principal scientist at Intelius, Inc., one of the main Web People Search services, which provides advanced people attribute extraction and profile matching from web pages.
For the Online Reputation Management task, co-coordinator Adolfo Corujo is Senior Director of Online Communication at Llorente & Cuenca, the leading communications consultancy firm in Spain and Latin America.

Organizers

The general lab coordinators are:

Julio Gonzalo (UNED, Madrid), This e-mail address is being protected from spambots. You need JavaScript enabled to view it
Satoshi Sekine (NYU, New York), This e-mail address is being protected from spambots. You need JavaScript enabled to view it

The coordinators for Task 1 (people search) are:

Javier Artiles (UNED, Madrid), This e-mail address is being protected from spambots. You need JavaScript enabled to view it
Andrew Borthwick (Intelius, Inc., Bellevue, Washington), This e-mail address is being protected from spambots. You need JavaScript enabled to view it

The coordinators for Task 2 (organizations search) are:

Bing Liu (University of Illinois at Chicago), This e-mail address is being protected from spambots. You need JavaScript enabled to view it
Enrique Amigó (UNED, Madrid), This e-mail address is being protected from spambots. You need JavaScript enabled to view it
Adolfo Corujo (Llorente & Cuenca, Madrid), This e-mail address is being protected from spambots. You need JavaScript enabled to view it

Besides the track coordinators, WePS has a representative Steering Committee.

WePS 3 Agenda

This is the tentative agenda for WePS 3:

Release of trial data	15 February 2010
Release of test data	7 June 2010
Submissions due	21 June 2010
Release of official results	15 July 2010
Papers due	15 August, 2010
Workshop	23 September (CLEF 2010, Padua)

The results of the evaluation campaign will be discussed in a one day workshop as a CLEF 2010 Lab in Padova (Italy), 22 or 23 September 2010.

The organization of the workshop will follow the successful model used for WePS 2, and will include (i) overviews of the two tasks, (ii) selected presentations from participants, focusing on successful strategies and innovative proposals, (iii) invited talks by leading researchers, industrial stakeholders and experts in evaluation methodologies, (iv) poster session where all participants can present and discuss their approaches, and (v) discussion sessions to shape future WePS campaigns.

WePS-3 is sponsored by Intelius

Person attribute extraction and clustering are core technologies for Intelius. Intelius' support of WePS-3 continues its history of support for research in this area, as shown by the $50,000 Spock Challenge (2007), which was sponsored by Intelius subsidiary, spock.com. Intelius is actively hiring people with expertise in people record linkage and attribute extraction for its data research team. Those interested should see our ad or contact Dr. Borthwick for more information.