Discovering taxonomies in Wikipedia by means of grammatical evolution.
Lourdes Araujo, Juan Martinez-Romo, Andres Duque
Soft Comput. 22(9): 2907-2919 (2018)

Wikipedia is a free encyclopedia created as an international collaborative project.
One of its peculiarities is that any user can edit its contents almost without restrictions,
what has given rise to a phenomenon known as vandalism. Vandalism
is any attempt that seeks to damage the integrity of the encyclopedia deliberately.
To address this problem, in recent years several automatic detection
systems and associated features have been developed. This work implements
one of these systems, which uses three sets of new features based on dierent
techniques. Specically we study the applicability of a leading technology
as deep learning to the problem of vandalism detection. The rst set is obtained
by expanding a list of vandal terms taking advantage of the existing
semantic-similarity relations in word embeddings and deep neural networks.
Deep learning techniques are applied to the second set of features, specically
Stacked Denoising Autoencoders (SDA), in order to reduce the dimensionality
of a bag of words model obtained from a set of edits taken from Wikipedia. The
last set uses graph-based ranking algorithms to generate a list of vandal terms
from a vandalism corpus extracted from Wikipedia. These three sets of new
features are evaluated separately as well as together to study their complementarity,
improving the results in the state of the art. The system evaluation has
been carried out on a corpus extracted from Wikipedia (WP Vandal) as well
as on another called PAN-WVC-2010 that was used in a vandalism detection
competition held at CLEF conference.