NLP Group at UNED - Víctor Fresno - PhD thesis Víctor Fresno Resume Publications Links Víctor Fresno • PhD thesis

"Representación Autocontenida de documentos HTML: una propuesta basada en combinaciones heurísticas de criterios"

A copy of the document can be found here (in Spanish).


Abstract:

In this dissertation, a new approach to self-content web page representation is proposed. Two term weighting fuctions are presented as a part of a document representation model definition. This approach is built on a main hypothesis: reading is an active process where author and reader contribute their experience and knowledge to the communication.

Internet can be seen as a huge amount of online unstructured information. Due to this inherent chaos, the necessity of developing systems based on Information Technologies emerges, being necessary to create systems that aid us in the processes of searching and efficient accessing to information. The main aim of this research is the development of web page representations only based in text content, and the field of application is automatic web page classification and document clustering. These tasks are applied on the creation of web directories and to obtain a clustering of the documents retrieved by a search engine. In these contexts, repesentations use to be mixed; they are based on an analysis of the hypergraph structure and on the page content. The proposed approach can be complementary, exploring the text content analysis.

One function, called ACC (Analytical Combination of Criteria), is based on a linear combination of heuristical criteria extracted from the text reading and writing processes. The other one, FCC (Fuzzy Combination of Criteria), is build on a fuzzy engine that combine the same criteria. ACC and FCC allow us to represent HTML documents without any analysis of a reference document collection. It is not needed to count the term frequencies in different documents into a collection; representations are generated without need to download any web page. Furthermore, the ACC and FCC design is independent from the document type; the same heuristics are applied for any web page.

The evaluation is carried out in web page classification and clustering processes. A Naïve Bayes classifier is selected for the supervised machine learning process and a partition algortihm is chosen for the clustering process. Naïve Bayes algorithm is very simple and has previously obtained good results in many researches. The selected clustering algorithm, belonging to CLUTO toolbox, has been applied in many different document clustering tasks obtaining very good results too.

After an experimental analysis, ACC and FCC showed the best general behaviour. In Naïve Bayes classification, four prior probabilities functions were analyzed. The F-measure results showed different behaviours depending on the selected term weighting functions. In general, ACC and FCC showed the most stable F-measure results and one of the best when the dimwensions of the representations were minimum. Therefore, similar rates can be obtained using smaller dimensions of the representations in the classification tasks. In general, in the partitional clustering problems the results obtained by ACC and FCC functions were the best ones, and these were better when the number of clusters was increased.