Integrating the Probabilistic Model BM25/BM25F into Lucene.

Joaquín Pérez-Iglesias


Please if you want to cite this software, use this bibtex.

Introduction

This document describes the BM25 and BM25F implementation using the Lucene Java Framework. The implementation described here can be downloaded from http://nlp.uned.es/~jperezi/Lucene-BM25/jar/Lucene-BM25-1.0.jar. Both models have stood out at TREC by their performance and are considered as state-of-the-art in the IR community. BM25 is applied to plain documents, that is documents without fields, on the other hand BM25F is applied to documents with structure.

Apache Lucene is a high-performance and full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Lucene is scalable and offers high-performance indexing, and has become one of the most used search engine libraries in both academia and industry.

Lucene ranking function, the core of any search engine applied to determine how relevant a document is to a given query, is built on a combination of the Vector Space Model (VSM) and the Boolean model of Information Retrieval. The main idea behind VSM approach is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. Lucene uses also the Boolean model to first narrow down the documents that need to be scored based on the use of boolean logic in the query specification.

In this paper, the implementation of BM25 probabilistic model and its extension for semi-structured IR, BM25F, is described in detail.

One of the main Lucene's constraints to be widely used by IR community is the lack of different retrieval models implementations. Our goal with this work is to offer to IR community a more advanced ranking model which can be compared with other IR libraries available.

Motivation

There exists previous implementations of alternative Information Retrieval Models for Lucene. The most representative case of that is the Language Model implementationhttp://ilps.science.uva.nl/resources/lm-lucene from Intelligent Systems Lab Amsterdam. Another example is described at http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf where Lucene is compared with Juru system. At this example Lucene document length normalization is improved in order to evaluate Lucene ranking function performance.

BM25 has been widely use by IR researchers and engineers to improve search engine relevance, so from our point of view, a BM25/BM25F implementation for Lucene become necessary to make Lucene more popular for IR community.

Included Models

The developed models are based on the information that can be found at http://www.zaragozas.info/hugo/academic/pdf/tutorial_sigir07_2d.pdf. More specifically the implemented ranking functions are as next:

BM25

$\displaystyle R(q,d) = \sum_{t \in q} \frac{occurs_{t}^{d}}{k_1 ((1-b)+b \frac{l_d}{avl_d}) + occurs_{t}^{d}}$    

where $ occurs_t^d$ is the term frequency of $ t$ in $ d$; $ l_d$ is the document $ d$ length; $ avl_{d}$ is the document average length along the collection; $ k_1$ is a free parameter usually 2 and $ b \in[0,1]$ (usually 0.75). Assigning 0 to $ b$ is equivalent to avoid the process of normalisation and therefore the document length will not affect the final score. If $ b$ takes 1, we will be carrying out a full length normalisation.

$\displaystyle idf(t) = \log \frac{N - df(t) + 0.5}{df(t) + 0.5}$    

where $ N$ is the number of document in the collection and $ df$ is the number of documents where appears the term $ t$.

A different version of this formula, appears at Wikipedia, in this case the obtained bm25 weight is multiplied by the constant $ (k_1 + 1)$ in order to normalize the weight of terms with a frequency equals to 1 that occurs in documents with an average length.

BM25F

First we obtain the accumulated weight of a term over all fields as next:

$\displaystyle \displaystyle \textit{weight(t,d)} = \sum_{\textit{c in d}} \frac{occurs_{t,c}^d \cdot boost_c}{((1-b_c)+b_c \cdot \frac{l_c}{avl_c})}$    

where $ l_c$ is the field length; $ avl_{c}$ is the average length for the field $ c$; $ b_{c}$ is a constant related to the field length, similar to $ b$ in BM25 and $ boost_c$ is the boost factor applied to field $ c$.

Next, a non-linear saturation is applied $ \frac{weight}{k_1 + weight}$:

$\displaystyle \displaystyle \textit{R(q,d)} = \sum_{\textit{t in q}} \frac{weight(t,d)}{k_1 + weight(t,d) \cdot idf(t)}$    

and equal equation for $ idf(t)$

$\displaystyle \displaystyle idf(t)= \log{\frac{N-df(t)+0.5}{df(t)+0.5}}$    

where $ N$ is the number of document in the collection and $ df$ is the number of documents where appears the term $ t$.

Implementation

The main purpose of this implementation was to integrate the new model ranking into the search Lucene functionalities. In order to accomplish this objective a new Query, Weight, and several Scorers were developed. The main functionalities are implemented at Scorer level, since the main responsibilities of Query and Weight are to prepare the necessary parameters for the Scorers, and create Scorers instances when the search method is invoked. More information in the Query-Weight-Scorer model can be found at http://lucene.apache.org/java/2_4_0/scoring.html.

Query

The execution of a query can be spit into two parts, a boolean filtering and the ranking evaluation. The boolean filtering is carried out by the Scorers ShouldBooleanScorer, MustBooleanScorer and NotBooleanScorer depending on the logic operators applied, while ranking functions are implemented in the score method of BM25TermScorer and BM25FTermScorer.

BM25BooleanScorer will create BM25TermScorer or BM25FTermScorer instances depending on the invoked constructor, as next:

BM25BooleanScorer will ignore any information related to fields that is treated by Lucene QueryParser, so the search will be carried out only amongst the field(s), passed as parameters in the constructor. Besides only boolean queries are supported, any other query type will be split into terms and executed as a boolean query.

It should be noted that both ranking functions do not use query weights, therefore all computation can be done at scorer level.

Scoring

How to use it

The supplied implementation can be used in a similar way as searches are carried out with Lucene, except that BM25Parameters or BM25FParameters must be set before the query is executed, this has to be done in order to set the average lengths, others parameters can be omitted since they are set to default values.

A couple examples of use can be seen below:

BM25

	IndexSearcher searcher = new IndexSearcher("IndexPath");

	//Load average length
	BM25Parameters.load(avgLengthPath);
	BM25BooleanQuery query = new BM25BooleanQuery("This is my Query", 
		"Search-Field",
		new StandardAnalyzer());
	
	TopDocs top = searcher.search(query, null, 10);
	ScoreDoc[] docs = top.scoreDocs;
	
	//Print results
	for (int i = 0; i $<$ top.scoreDocs.length; i++) {
	      System.out.println(docs[i].doc + ":"+docs[i].score);
	}

BM25F

	String[] fields ={"FIELD1","FIELD2"};
	IndexSearcher searcher = new IndexSearcher("IndexPath");

	//Set explicit average Length for each field
	BM25FParameters.setAverageLength("FIELD1", 123.5f);
	BM25FParameters.setAverageLength("FIELD2", 42.2f);
	
	//Set explicit k1 parameter
	BM25FParameters.setK1(1.2f);
	
	//Using boost and b defaults parameters
	BM25BooleanQuery queryF = new BM25BooleanQuery("This is my query",
		fields, new StandardAnalyzer());
	
	//Retrieving NOT normalized scorer values
	TopDocs top = searcher.search(queryF, null, 10);
	ScoreDoc[] docs = top.scoreDocs;
	
	//Print results
	for (int i = 0; i $<$ top.scoreDocs.length; i++) {
	      System.out.println(docs[i].doc + ":"+docs[i].score);
	}

Acknowledgement

Authors want to thank Hugo Zaragoza for his review and comments.



Please send any comments to Joaquin Perez-Iglesias, LSI,UNED. Last update: 12-1-09 11:50