In this page you can find the datasets presented in the paper A Corpus for Entity Profiling in Microblog Posts. It includes two manually annotated corpora to evaluate the task of identifying aspects on Twitter, both of them based upon the WePS-3 ORM task dataset.

Aspects dataset

The aspects dataset has been annotated using a pooling methodology, for which we have implemented various methods for automatically extracting aspects from tweets that are relevant for an entity.

The dataset is organized in the three following files:


Opinion Targets dataset

The opinion targets dataset have been annotated considering individual tweets related to an entity and manually identifying whether the tweet is opinionated and, if so, which part of the tweet is subjective and what the target of the sentiment is, if any.

The dataset consists of an XML file (opinion_targets_annotation.xml) that includes all the annotations. For each annotated entity, the list of the annotated tweets is given. Each tweet includes the following information:


The XML Schema that validates the annotation's file is also provided (opinion_targets_schema.xsd).

In order to respect Twitter's TOS, tweets are not redistributed and only tweets ids and author usernames are provided. The opinion_targets_tweets.dat file is contains the tweet ID, the username and the tweet URL of the tweets annotated with opinion targets. Then, original tweets can be downloaded using the TREC Microblog Corpus Tool:

java -Xmx4g -cp 'twitter-corpus-tools/lib/*:twitter-corpus-tools/dist/twitter-corpus-tools-0.0.1.jar' com.twitter.corpus.download.AsyncEmbeddedJsonStatusBlockCrawler -data opinion_targets_tweets.dat -output opinion_targets_tweets.json.gz

A newer version of this tool is available at https://github.com/lintool/twitter-tools.


Citation

Please cite the article below if you use these resources in your research:
Spina D., Meij E., Oghina A., Thuong B. M., Breuss M., and de Rijke M.
A Corpus for Entity Profiling in Microblog Posts.
LREC Workshop on Language Engineering for Online Reputation Management, 2012

BibTex

@Proceedings{ spina2012corpus,
	title = "{A Corpus for Entity Profiling in Microblog Posts}",
	booktitle = "{LREC Workshop on Language Engineering for Online Reputation Management}",
	author = "{D. Spina and E. Meij and A. Oghina and M.T. Bui and M. Breuss and M. de Rijke}",
	year = "2012"
}