Web spam detection: new classification features based on qualified link analysis and language models.
Lourdes Araujo, Juan Martinez-Romo
IEEE Transactions on Information Forensics and Security 5(3): 581-590 (2010)

Web spam is a serious problem for search engines because the quality of their results can be severely
degraded by the presence of this kind of page. In this paper, we present an efficient spam detection system
based on a classifier that combines new link-based features with language-model (LM)-based ones. These features
are not only related to quantitative data extracted from the Web pages, but also to qualitative properties,
mainly of the page links. We consider, for instance, the ability of a search engine to find, using information
provided by the page for a given link, the page that the link actually points at. This can be regarded as
indicative of the link reliability. We also check the coherence between a page and another one pointed at by any
of its links. Two pages linked by a hyperlink should be semantically related, by at least a weak contextual relation.
Thus, we apply an LM approach to different sources of information from a Web page that belongs to the context
of a link, in order to provide high-quality indicators of Web spam. We have specifically applied the Kullback-Leibler
divergence on different combinations of these sources of information in order to characterize the relationship
between two linked pages. The result is a system that significantly improves the detection of Web spam using
fewer features, on two large and public datasets SUchasWEBSPAM-UK2006 and WEBSPAM-UK2007.