At the end of November, we’ll be migrating the Sematext Logs backend from Elasticsearch to OpenSearch

Entity Extraction with spaCy

April 12, 2019

Table of contents

What is Entity Extraction?

Entity extraction is, in the context of search, the process of figuring out which fields a query should target, as opposed to always hitting all fields. The reason we may want to involve entity extraction in search is to improve precision. For example: how do we tell that, when the user typed in Apple iPhone, the intent was to run company:Apple AND product:iPhone? And not bring back phone stickers in the shape of an apple?

What is spaCy?

spaCy is a Python framework that can do many Natural Language Processing (NLP) tasks. Named Entity Extraction (NER) is one of them, along with text classification, part-of-speech tagging, and others.

If this sounds familiar, that may be because we previously wrote about a different Python framework that can help us with entity extraction: Scikit-learn. Though Scikit-learn is more a collection of machine learning tools, rather than an NLP framework. spaCy is closer, in terms of functionality, to OpenNLP. We used all three for entity extraction during our Activate 2018 presentation.

Getting spaCy is as easy as:

pip install spacy

In this post, we’ll use a pre-built model to extract entities, then we’ll build our own model.

Using a pre-built model

spaCy comes with pre-built models for lots of languages. For example, to get the English one, you’d do:

python -m spacy download en_core_web_sm

Then, in your Python application, it’s a matter of loading it:

nlp = spacy.load('en_core_web_sm')

And then you can use it to extract entities. In our Activate example, we did:

doc = nlp(u"#bbuzz 2016: Rafał Kuć - Running High Performance And Fault Tolerant Elasticsearch")
for entity in doc.ents:
  print(entity.label_, ' | ', entity.text)

Which outputs:

MONEY  |  #bbuzz
DATE  |  2016
PERSON  |  Rafał Kuć - Running High

For this particular example, this result is “approximate” at best. 2016 is indeed a date, but #bbuzz isn’t money. And I doubt that Rafał was Running High while giving that presentation.

For this use-case, we’d need to build our own model.

Training a new model

To train a new model, we first need to create a pipeline that defines how we process data. In this case, we want to extract entities. Then, we’ll train a model by running test data through this pipeline. Once the model is trained, we can use it to extract entities from new data as well.

Let’s zoom into each step.

spaCy pipelines

With spaCy you can do much more than just entity extraction. For example, before extracting entities, you may need to pre-process text, for example via stemming. Or we may want to do part-of-speech tagging: is this word a verb or a noun?

For the scope of our tutorial, we’ll create an empty model, give it a name, then add a simple pipeline to it. That simple pipeline will only do named entity extraction (NER):

nlp = spacy.blank('en')  # new, empty model. Let’s say it’s for the English language
nlp.vocab.vectors.name = 'example_model_training'   # give a name to our list of vectors
# add NER pipeline
ner = nlp.create_pipe('ner')  # our pipeline would just do NER
nlp.add_pipe(ner, last=True)  # we add the pipeline to the model

Data and labels

To train the model, we’ll need some training data. In the case of product search, these would be queries, where we pre-label entities. For example:

DATA = [
  (u"Search Analytics: Business Value & BigData NoSQL Backend, Otis Gospodnetic ", {'entities': [ (58,75,'PERSON') ] }),
  (u"Introduction to Elasticsearch by Radu ", {'entities': [ (16,29,'TECH'), (32, 36, 'PERSON') ] }),
  # …
]

Our training data has a few characteristics:

  • The text itself is Unicode
  • The entities array contains a list of tuples. Each tuple is an entity labeled from the text
  • Each tuple contains three elements: start offset, end offset and entity name

Training the model

Before training, we need to make our model aware of the possible entities. To do that, we add all the labels we’re aware of:

nlp.entity.add_label('PERSON')
nlp.entity.add_label('TECH')
# ...

Now we can begin training. We’ll need to allocate the models and get an optimizer via our model’s begin_training() method:

optimizer = nlp.begin_training()

Then we update the model with our training data. Each text, with its annotations (those labeled entities), would be passed to the update() function of our model. Along with the newly created optimizer:

nlp.update([text], [annotations], sgd=optimizer)

In our Activate example, because we have little training data, we just loop through it a few times, in random order:

for i in range(20):
    random.shuffle(DATA)
    for text, annotations in DATA:
        nlp.update([text], [annotations], sgd=optimizer)

And that’s it! Now we have a model built for our own use-case.

Predicting entities

The model we just built is already loaded in memory. If you don’t want to train it every time, you can save it to disk and load it when needed. With the model loaded, you’ll use it to predict entities just as you would with a pre-built model:

doc = nlp(u"#bbuzz 2016: Rafał Kuć - Running High Performance And Fault Tolerant Elasticsearch")
for entity in doc.ents:
  print(entity.label_, ' | ', entity.text)

Even with this small dataset, results typically look better than with the default model:

PERSON  |  Rafał Kuć
TECH  |  Elasticsearch

I’ve mentioned typically because on different runs, because of the randomization, the model looks different. Ultimately, if you want accurate results, there’s no substitute for training set size. Unless something was indeed fishy with Rafał in 2016, because at times I get:

PERSON  |  Rafał Kuć
TECH  |  High

Conclusions and next steps

Like in the OpenNLP example we showed before, spaCy comes with pre-built models and makes it easy to build your own. It also comes with a command-line training tool. That said, it’s less configurable – or at least you don’t have all the options as accessible as in a purpose-built tool, like Scikit-learn. For entity extraction, spaCy will use a Convolutional Neural Network, but you can plug in your own model if you need to.

If you find this stuff exciting, please join us: we’re hiring worldwide. If you need entity extraction, relevancy tuning, or any other help with your search infrastructure, please reach out, because we provide:

Java Logging Basics: Concepts, Tools, and Best Practices

Imagine you're a detective trying to solve a crime, but...

Best Web Transaction Monitoring Tools in 2024

Websites are no longer static pages.  They’re dynamic, transaction-heavy ecosystems...

17 Linux Log Files You Must Be Monitoring

Imagine waking up to a critical system failure that has...