A Corpus for Entity Profiling in Microblog Posts

In this page you can find the datasets presented in the paper A Corpus for Entity Profiling in Microblog Posts. It includes two manually annotated corpora to evaluate the task of identifying aspects on Twitter, both of them based upon the WePS-3 ORM task dataset.

Aspects dataset

[entityProfiling_ORM_Twitter_aspects_dataset.zip]

The aspects dataset has been annotated using a pooling methodology, for which we have implemented various methods for automatically extracting aspects from tweets that are relevant for an entity.

The dataset is organized in the three following files:

aspects_terms_annotations.tsv: A tab-separated values file including the annotations. Each line corresponds to a term, while the columns include the entity name, the term itself, and the assesments given by the three judges (J₁,J₂ and J₃). Assessments are encoded as follows: 1 = relevant, 2 = not relevant, 3 = competitor, 4 = unknown.
aspects_goldstandard_qrels: This file contains the terms annotated as relevant/competitor by two or more judges. It is a typical TREC qrels file, so it can be used as goldstandard in evaluation tools such as trec_eval.
aspects_queries_ids.tsv: A table that maps each query_id used in the qrels file above to the company name in the WePS-3 ORM task dataset.

Opinion Targets dataset

[entityProfiling_ORM_Twitter_opinionTargets_dataset.zip]

The opinion targets dataset have been annotated considering individual tweets related to an entity and manually identifying whether the tweet is opinionated and, if so, which part of the tweet is subjective and what the target of the sentiment is, if any.

The dataset consists of an XML file (opinion_targets_annotation.xml) that includes all the annotations. For each annotated entity, the list of the annotated tweets is given. Each tweet includes the following information:

id: The id of the tweet
weps3-label: The label in the WePS-3 ORM dataset. (related if the tweet refers to the entity, unrelated if the tweet does not refer to the given entity).
subjectivity: true if the tweet contains an explicit opinionated expression, false otherwise.
subjective-phrases: If present, it contains the phrases annotated as expressing subjectivity. Each phrase contains the character offsets in the content of the tweet (start and end attributes), as well as the phrase itself.
opinion-targets: If present, it contains the phrases annotated as opinion targets. Each phrase contains the character offsets in the content of the tweet (start and end attributes), as well as the phrase itself.

The XML Schema that validates the annotation's file is also provided (opinion_targets_schema.xsd).

In order to respect Twitter's TOS, tweets are not redistributed and only tweets ids and author usernames are provided. The opinion_targets_tweets.dat file is contains the tweet ID, the username and the tweet URL of the tweets annotated with opinion targets. Then, original tweets can be downloaded using the TREC Microblog Corpus Tool:


java -Xmx4g -cp 'twitter-corpus-tools/lib/*:twitter-corpus-tools/dist/twitter-corpus-tools-0.0.1.jar' com.twitter.corpus.download.AsyncEmbeddedJsonStatusBlockCrawler   -data opinion_targets_tweets.dat -output opinion_targets_tweets.json.gz

A newer version of this tool is available at https://github.com/lintool/twitter-tools.

Citation

Please cite the article below if you use these resources in your research:

Spina D., Meij E., Oghina A., Thuong B. M., Breuss M., and de Rijke M.
A Corpus for Entity Profiling in Microblog Posts.
LREC Workshop on Language Engineering for Online Reputation Management, 2012

BibTex

@Proceedings{ spina2012corpus,
	title = "{A Corpus for Entity Profiling in Microblog Posts}",
	booktitle = "{LREC Workshop on Language Engineering for Online Reputation Management}",
	author = "{D. Spina and E. Meij and A. Oghina and M.T. Bui and M. Breuss and M. de Rijke}",
	year = "2012"
}