What is Entity Extraction?
Entity extraction is, in the context of search, the process of figuring out which fields a query should target, as opposed to always hitting all fields. The reason we may want to involve entity extraction in search is to improve precision. For example: how do we tell that, when the user typed in Apple iPhone, the intent was to run company:Apple AND product:iPhone? And not bring back phone stickers in the shape of an apple?
Where do I start?
During Activate 2018, we had a presentation designed to answer this question. For example, if you’re using Solr already to serve your product searches, you can use Text Tagger to perform dictionary-based entity extraction. We also wrote a Solr Text Tagger tutorial. If you’re using Elasticsearch instead of Solr, you may be interested in our Elasticsearch for Product Searches online training.
Outside Solr/Elasticsearch, we mentioned three technologies that can help:
- OpenNLP – a Java-based natural language processing framework, that’s already integrated in Solr for index-time entity extraction and analysis. We also wrote a tutorial for query-time entity extraction with OpenNLP.
- spaCy – a Python-based natural language processing framework. It provides powerful out-of-the-box functionality. For the use-case of product search, you’ll find more details in the spaCy entity extraction howto.
- Scikit-learn – a Python-based machine learning library. While not purpose-built for NLP, it does provide a wide array of classifiers. In our scikit-learn tutorial, we argue that, in the context of product search, you can treat entity extraction as a classification problem.
Next steps
Entity extraction is not the only thing that you may want to add on top of your keyword search in order to improve relevance. As we showed in our Activate presentation, there are at least two other important pieces:
- Query expansion. In other words, how do I get my search for dog food to also match puppy food? Static synonyms are one way, but we also explored using word2vec to dynamically generate synonyms.
- Result re-ranking. Different users – or groups of users – have different preferences, and may expect different results. We showed how to use Solr’s Learning to Rank to re-score the top N documents from a regular query, based on a model that fits your use-case. As an example, we’ve built a model on top of data fed by the Significant Terms Streaming Expression.
Video and slides
The blog posts mentioned above are detailed tutorials on how to deal with entity extraction, query expansion, and re-ranking. But if you’re in a hurry, check out our presentation from Activate 2018. Warning: It’s packed with demos!
You can find the slides here:
And finally, feel free to play with the demo entity extraction code.
Final words
If you want to improve product search on top of Elasticsearch, don’t forget to have a look at our 2-hour use case class Elasticsearch for Product Searches Online Training, you can join our upcoming classes. Or, if you’re into search relevancy, please join us: we’re hiring worldwide. If you need entity extraction, relevancy tuning, or any other help with your search infrastructure, please reach out, because we provide:
- Search Relevance Consulting (which includes Entity Extraction, of course)
- Solr, Elasticsearch and Elastic Stack consulting
- Solr, Elasticsearch and Elastic Stack production support
- Solr, Elasticsearch and Elastic Stack training classes (on-site and remote, public and private)
- Monitoring, log centralization and tracing for not only Solr and Elasticsearch but for other applications (e.g. Kafka, Zookeeper), infrastructure and containers