Evolving natural language grammars without supervision.
Lourdes Araujo, Jesús Santamaría

Proc. IEEE Congress on Evolutionary Computation (CEC 2010)
IEEE Press, pp 1-8 (2010)

Unsupervised grammar induction is one of the most difficult works of language processing. Its goal is to
extract a grammar representing the language structure using texts without annotations of this structure.
We have devised an evolutionary algorithm which for each sentence evolves a population of trees that represent
different parse trees of that sentence. Each of these trees represent a part of a grammar. The evaluation
function takes into account the contexts in which each sequence of Part-Of-Speech tags (POSseq) appears in the
training corpus, as well as the frequencies of those POSseqs and contexts. The grammar for the whole training
corpus is constructed in an incremental manner. The algorithm has been evaluated using a well known Annotated
English corpus, though the annotation have only been used for evaluation purposes. Results indicate that the
proposed algorithm is able to improve the results of a classical optimization algorithm, such as EM (Expectation
Maximization), for short grammar constituents (right side of the grammar rules), and its precision is better in general.