eswa13_malicious_tweets.html

Detecting malicious tweets in trending topics using a statistical analysis of language.
Juan Martinez-Romo and Lourdes Araujo.
Expert Syst. Appl. 40(8): 2992-3000 (2013)

Twitter spam detection is a recent area of research in which most previous works had focused on the
identification of malicious user accounts and honeypot-based approaches. However, in this paper we
present a methodology based on two new aspects: the detection of spam tweets in isolation and without
previous information of the user; and the application of a statistical analysis of language to detect spam
in trending topics. Trending topics capture the emerging Internet trends and topics of discussion that are
in everybody’s lips. This growing microblogging phenomenon therefore allows spammers to disseminate
malicious tweets quickly and massively. In this paper we present the first work that tries to detect spam
tweets in real time using language as the primary tool. We first collected and labeled a large dataset with
34 K trending topics and 20 million tweets. Then, we have proposed a reduced set of features hardly
manipulated by spammers. In addition, we have developed a machine learning system with some orthogonal
features that can be combined with other sets of features with the aim of analyzing emergent characteristics
of spam in social networks. We have also conducted an extensive evaluation process that has
allowed us to show how our system is able to obtain an F-measure at the same level as the best state-ofthe-
art systems based on the detection of spam accounts. Thus, our system can be applied to Twitter
spam detection in trending topics in real time due mainly to the analysis of tweets instead of user
accounts.