ApacheCon Europe 2012

Rhein-Neckar-Arena, Sinsheim, Germany

5–8 November 2012

Text categorization with Lucene and Solr

Tommaso Teofili

Audience level:
Lucene, Solr & Friends

Thursday 11 a.m.–11:45 a.m. in Level 1 Right


This talk will highlight how Lucene indexes can be used as a knowledge base for generating effective NLP classifiers using different approaches like vector space model, naive bayes and others and how that can be leveraged in Solr for tasks like automatic text categorization of input documents.


Apache Lucene indexes can count lots of documents along with other metadata (like, for example, the term vectors) which are usually used for searching, scoring, and other search engine common tasks. Interestingly all the existing information in the index can be seen as a corpus of data with features (i.e. the fields) and some labels (i.e. specific values of some fields) thus such data can be used to create models for NLP / machine learning tasks in a very performant way. The talk will present how to use existing Lucene / Solr facilities to create some NLP classifiers (based on vector space and naive bayes models) which can be used to classify unseen text and therefore to automatically categorize input documents.