Joaquín Pérez-Iglesias
Apache Lucene is a high-performance and full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Lucene is scalable and offers high-performance indexing, and has become one of the most used search engine libraries in both academia and industry.
Lucene ranking function, the core of any search engine applied to determine how relevant a document is to a given query, is built on a combination of the Vector Space Model (VSM) and the Boolean model of Information Retrieval. The main idea behind VSM approach is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. Lucene uses also the Boolean model to first narrow down the documents that need to be scored based on the use of boolean logic in the query specification.
In this paper, the implementation of BM25 probabilistic model and its extension for semi-structured IR, BM25F, is described in detail.
One of the main Lucene's constraints to be widely used by IR community is the lack of different retrieval models implementations. Our goal with this work is to offer to IR community a more advanced ranking model which can be compared with other IR libraries available.
There exists previous implementations of alternative Information Retrieval Models for Lucene. The most representative case of that is the Language Model implementationhttp://ilps.science.uva.nl/resources/lm-lucene from Intelligent Systems Lab Amsterdam. Another example is described at http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf where Lucene is compared with Juru system. At this example Lucene document length normalization is improved in order to evaluate Lucene ranking function performance.
BM25 has been widely use by IR researchers and engineers to improve search engine relevance, so from our point of view, a BM25/BM25F implementation for Lucene become necessary to make Lucene more popular for IR community.
![]() |
![]() |
A different version of this formula, appears at Wikipedia,
in this case the obtained bm25 weight is multiplied by the constant
in order to normalize the weight of terms with a frequency equals to 1 that occurs in documents with an average length.
First we obtain the accumulated weight of a term over all fields as next:
![]() |
Next, a non-linear saturation is applied
:
![]() |
and equal equation for
![]() |
BM25BooleanScorer will create BM25TermScorer or BM25FTermScorer instances depending on the invoked constructor, as next:
public BM25BooleanQuery(String query, String field, Analyzer analyzer) throws ParseException, IOException
public BM25BooleanQuery(String query, String[] fields, Analyzer analyzer) throws ParseException,IOException
BM25BooleanScorer will ignore any information related to fields that is treated by Lucene QueryParser, so the search will be carried out only amongst the field(s), passed as parameters in the constructor. Besides only boolean queries are supported, any other query type will be split into terms and executed as a boolean query.
It should be noted that both ranking functions do not use query weights, therefore all computation can be done at scorer level.
public class CollectionSimilarityIndexer extends DefaultSimilarity {
private static Map< String,Long> length = new HashMap<String, Long>();
@Override
public float lengthNorm(String fieldName, int numTokens) {
Long aux = CollectionSimilarityIndexer.length.get(fieldName);
if (aux==null)
aux = new Long(0);
aux+=numTokens;
CollectionSimilarityIndexer.length.put(fieldName,aux);
return super.lengthNorm(fieldName, numTokens);
}
public static long getLength(String field){
return CollectionSimilarityIndexer.length.get(field);
}
}
After the indexing process we can retrieve the length of a specific field, it can be divided by collection numdocs and save the computed value to a file, where can be read when a Searcher is opened. In the provided implementation a method load(String filePath) is supplied in BM25Parameters in order to load average lengths, more details can be found in the javadoc documentation at http://nlp.uned.es/~jperezi/Lucene-BM25/javadoc.
A couple examples of use can be seen below:
IndexSearcher searcher = new IndexSearcher("IndexPath");
//Load average length
BM25Parameters.load(avgLengthPath);
BM25BooleanQuery query = new BM25BooleanQuery("This is my Query",
"Search-Field",
new StandardAnalyzer());
TopDocs top = searcher.search(query, null, 10);
ScoreDoc[] docs = top.scoreDocs;
//Print results
for (int i = 0; i $<$ top.scoreDocs.length; i++) {
System.out.println(docs[i].doc + ":"+docs[i].score);
}
String[] fields ={"FIELD1","FIELD2"};
IndexSearcher searcher = new IndexSearcher("IndexPath");
//Set explicit average Length for each field
BM25FParameters.setAverageLength("FIELD1", 123.5f);
BM25FParameters.setAverageLength("FIELD2", 42.2f);
//Set explicit k1 parameter
BM25FParameters.setK1(1.2f);
//Using boost and b defaults parameters
BM25BooleanQuery queryF = new BM25BooleanQuery("This is my query",
fields, new StandardAnalyzer());
//Retrieving NOT normalized scorer values
TopDocs top = searcher.search(queryF, null, 10);
ScoreDoc[] docs = top.scoreDocs;
//Print results
for (int i = 0; i $<$ top.scoreDocs.length; i++) {
System.out.println(docs[i].doc + ":"+docs[i].score);
}