We recently had a presentation at Activate 2018 about entity extraction in the context of a product search. For example: how to tell, when the user typed in Activate 2018, that the intent was to run conference:Activate AND date:2018?
One of the technologies to solve this problem is OpenNLP. We ran a demo of OpenNLP during our Activate presentation, and you can find the commands under our Github account. In this blog post, we’ll run through that same demo and give some more details on the thinking behind it.
OpenNLP is, to quote the website, a machine learning based toolkit for the processing of natural language text. It provides lots of functionality, like tokenization, lemmatization and part-of-speech (PoS) tagging. Of this functionality, Named Entity Extraction (NER) can help us with query understanding.
OpenNLP is open-source (Apache license) and it’s already integrated, to various degrees in our favorite search engines: Solr and Elasticsearch. Solr has OpenNLP-based tokenizing, lemmatizing, sentence and PoS detection in its analysis chain. There is also an OpenNLP NER update request processor. Elasticsearch, on the other hand, has a well-maintained Ingest plugin based on OpenNLP NER.
Setup and basic usage
Once you download and extract OpenNLP, you can go ahead and use the command line tool (bin/opennlp) to test and build models. You won’t use this tool in production though, for two reasons:
- if you’re running a Java application (which includes Solr/Elasticsearch), you will likely prefer the Name Finder Java API. It has more options than the command line tool.
- running bin/opennlp loads the model every time, which adds latency. If you expose NER functionality through a REST API, you only need to load the model on startup. This is what the current Solr/Elasticsearch implementations do.
We’ll still use the command-line tool here, because it makes it easier to explore OpenNLP’s functionality. You can build models with bin/opennlp and use them with the Java API as well.
To get started, we’ll pass a string to bin/opennlp’s standard input. We’ll then provide the class name (TokenNameFinder for NER) and the model file as parameters:
echo "introduction to solr 2018" | bin/opennlp TokenNameFinder en-ner-date.bin
While the pre-built date detection model will pick up 2018 as a date:
Loading Token Name Finder model ... done (1.416s) introduction to solr <START:date> 2018 <END> Average: 27.8 sent/s Total: 1 sent Runtime: 0.036s Execution time: 2.209 seconds
For anything more sophisticated, you’ll likely need your own model. For example, if we hope to get “youtube” back as a URL part. We can try to use the Organization pre-built model, but it won’t get us anything:
$ echo "solr elasticsearch youtube" | bin/opennlp TokenNameFinder en-ner-organization.bin Loading Token Name Finder model ... done (1.411s) solr elasticsearch youtube Average: 24.4 sent/s Total: 1 sent Runtime: 0.041s Execution time: 2.159 seconds
To use OpenNLP for detecting URL chunks, we need to provide a custom model.
Building a new model
We’ll need a few ingredients for our model:
- some data that’s already labeled with the entities we want to extract (URL parts in this case)
- optionally, change how OpenNLP extracts features from the training data
- optionally, change the algorithm used for building the model
The training data set can look like our GitHub sample. Here’s a snippet:
elasticsearch introduction demo on <START:url> youtube <END> solr docker tutorial on <START:url> youtube <END>
The most important characteristics are:
- entities need to be surrounded by tags. Here, we want to identify youtube as a url
- add spaces between tags (START/END) and labeled data
- if possible, use one label per model (here, url). Multiple labels are possible, but not recommended
- have lots of data. Documentation recommends a minimum of 15000 sentences
- each line is a “sentence”. Some features (we’ll touch on them below) look at the position of the entity in the sentence. Does it tend to be at the beginning or the end? If you do entity extraction on queries (like we do here), the query is usually one sentence. For index-time entity extraction, you could have multiple sentences in a document
- empty lines delimit documents. This is more relevant for index-time entity extraction, where there’s a difference between documents and sentences. Document boundaries are relevant for document-level feature generators (like DocumentBegin) and those influenced by previous outcomes within the document (usually, feature generators extending AdaptiveFeatureGenerator)
The training tool runs through the data set, extracts some features and feeds them to the machine learning algorithm. A feature could be whether a token is a number or a string. Or whether the previous tokens were numbers or strings. In OpenNLP, such features are generated by feature generators. You can find all options here. That said, you can always implement your own feature generators.
Once you’ve identified the feature generators to use and their parameters, put them in an XML file. Check out our GitHub account for a feature generation example close to the default one.
Algorithm selection and tuning
OpenNLP comes out of the box with classifiers based on maximum entropy (default), perceptron-based, and naive Bayes. To choose the classifier, you’d provide a parameters file. There are examples for all supported algorithms here.
In the parameters file, there are at least three important aspects to look at:
- algorithm choice. Naive Bayes will train the fastest, but will work as if the provided features are independent. This might or might not be the case. The maximum entropy and perceptron-based classifiers are more expensive to run, but tend to give better results. Especially when features depend on each other
- number of iterations. The more times you go through the training data, the more influence provided features will have on the output. This is a trade-off between how much will be learned on one hand, and overfitting on the other hand. And of course training will take longer with more iterations.
- cutoff. Features that are encountered less than N times are ignored, to reduce noise.
Training and testing the model
Now we can put everything together and build our model. We’ll use the TokenNameFinderTrainer class this time:
bin/opennlp TokenNameFinderTrainer -model urls.bin -lang ml -params params.txt -featuregen features.xml -data queries -encoding UTF8
Where the parameters are:
- –model filename. The output file name for our model
- -lang language. Only relevant if you want to use different models for different languages
- -params params.txt. Parameters file for the algorithm options
- -featuregen features.xml. Feature generation XML file
- -data queries. File with labled training data
- -encoding UTF8. Encoding for the training data file
Finally, we can use the new model to make sure “youtube” is detected as a URL component:
$ echo "solr elasticsearch youtube" | bin/opennlp TokenNameFinder urls.bin Loading Token Name Finder model ... done (0.135s) solr elasticsearch <START:url> youtube <END> Average: 28.6 sent/s Total: 1 sent Runtime: 0.035s Execution time: 0.955 seconds
To properly test the model, we can use the Evaluation Tool on another labeled dataset (written in the same format as the training dataset). We’ll use the TokenNameFinderEvaluator class, with parameters similar to the TokenNameFinderTrainer command (provide the model, dataset and encoding):
$ bin/opennlp TokenNameFinderEvaluator -model urls.bin -data test_queries -encoding UTF-8 Loading Token Name Finder model ... done (0.025s) Average: 635.6 sent/s Total: 75 sent Runtime: 0.118s Evaluated 74 samples with 74 entities; found: 74 entities; correct: 74. TOTAL: precision: 100.00%; recall: 100.00%; F1: 100.00%. url: precision: 100.00%; recall: 100.00%; F1: 100.00%. [target: 74; tp: 74; fp: 0] Execution time: 0.605 seconds
OpenNLP is a versatile tool for entity extraction. Default options and built-in feature generators work well for natural language, like picking up entities from books or articles at index time. That’s why current OpenNLP integrations for Solr and Elasticsearch are on the indexing side, rather than the query side. For query understanding, it’s usually more work to build a model that can accurately extract entities from a small context. But it can definitely be done by providing enough data and good features and algorithms for the use-case.
If you find this stuff exciting, please join us: we’re hiring worldwide. If you need entity extraction, relevancy tuning, or any other help with your search infrastructure, please reach out, because we provide:
- Solr, Elasticsearch and Elastic Stack consulting
- Solr, Elasticsearch and Elastic Stack production support
- Solr, Elasticsearch and Elastic Stack training classes (on site and remote, public and private)
- Monitoring, log centralization and tracing for not only Solr and Elasticsearch, but for other applications (e.g. Kafka, Zookeeper), infrastructure and containers