Web spam identification through language model analysis.
Juan Martinez-Romo, Lourdes Araujo
Proc. Fifth International Workshop on Adversarial Information Retrieval on the Web (AirWeb 2009).
ACM International Conference Proceeding Series, pp. 21-28 (2009).

This paper applies a language model approach to different sources of information extracted from a Web page,
in order to provide high quality indicators in the detection of Web Spam. Two pages linked by a hyperlink should
be topically related, even though this were a weak contextual relation. For this reason we have analysed different
sources of information of a Web page that belongs to the context of a link and we have applied Kullback-Leibler
divergence on them for characterising the relationship between two linked pages. Moreover, we combine some of these
sources of information in order to obtain richer language models. Given the different nature of internal and external
links, in our study we also distinguished these types of links getting a significant improvement in classification
tasks. The result is a system that improves the detection of Web Spam on two large and public datasets such as
WEBSPAM-UK2006 and WEBSPAM-UK2007.