At the end of November, we’ll be migrating the Sematext Logs backend from Elasticsearch to OpenSearch

Generating Word Embeddings with Gensim’s word2vec

December 17, 2018

Table of contents

During our Activate presentation, we talked about how to do query expansion by dynamically generating synonyms. Instead of statically defining synonyms lists, we showed a demo of how you could use word2vec to derive synonyms from a dataset.

Before we start, check out a useful Solr Cheat Sheets to guide you through Solr and help boost your productivity and save time.

In this post, we’ll expand on that demo to explain what word2vec is and how it works, where you can use it in your search infrastructure and how. Let’s dive in!

What is word2vec

Word2vec is a tool that creates word embeddings: given an input text, it will create a vector representation of each word. Word2vec was originally implemented at Google by Tomáš Mikolov; et. al. but nowadays you can find lots of other implementations.

To create word embeddings, word2vec uses a neural network with a single hidden layer. The input is each word, along with a configurable context (typically 5 to 10 words). You’d train this neural network to either predict the word from its context or the other way around (predict the context from the word). This depends on the training mode, but let’s illustrate the first approach: say you have the word solr out of introduction to solr, and we’re trying to predict it with three neurons:

word2vec's feedforward neural network using context to predict the word with continuous bag-of-words
Using context to predict the word (CBOW)

Most implementations allow you to choose between two training modes: continuous bag of words (CBOW) and continuous skip-gram. With CBOW you predict the word from the context, like in the figure above. With skip-gram you predict the context from the word. We don’t care about the predictions here, because we don’t use word2vec to predict either. Instead, when training is done, word2vec takes the weights of the hidden layer neurons for each word. In the figure above, we have [0.2, 0.3, 0.1] for the word solr. These weights are the vectors we’re looking for: the generated word embeddings.

Still, you might say, how to choose between CBOW or skip-gram? The key difference is that CBOW doesn’t care about distance between words, while skip-gram does. That’s where the bag of words and gram naming comes from. In practice, this translates to better performance for CBOW and, if the training set is large enough, better accuracy for skip-gram. Especially on low-frequency words.

Where to use word2vec

If all goes well, word embeddings capture the semantics of each word pretty well. That’s because we assume that the context typically defines the word, which is true for most use-cases. For example, “Intel CPU, 16GB of RAM” and “Intel processor, 16GB of RAM” will create similar embeddings for CPU and processor. That said, sometimes the surrounding words don’t provide enough information, or that information could be misleading. For example, “this tool rocks” and “this tool stinks” will make rocks similar to stinks.

Still, if embeddings do capture the semantics to a reasonable level, you could use them for:

  • dynamic synonym expansion. If someone is searching for CPU, you may want to include processor in the query. This is what we aimed to do in our demo at Activate.
  • use distance between vectors to measure similarity. If we come up with a good way to squash multiple word vectors into one (averaging is the simplest option), then we can compare squashed vectors from the query string with squashed vectors from the document string. The cosine similarity of the two vectors could be a relevancy score. Or at least a component of it. At Activate, there were a few talks exploring this idea (examples here and here).
  • classification or clustering. If vectors of two words are closer (by cosine similarity), they are more likely to belong to the same group. The original implementation of word2vec also allows you to cluster words using K-means

How to use word2vec

Now that we dealt with the background, let’s look at each step of our demo from Activate.

Installing Gensim

Out of the existing word2vec implementations, we took Gensim: a Python library that does a lot of NLP tasks, from phrase detection to topic modeling and of course, word2vec.

To install Gensim you’d do:

pip install gensim

It’s a good idea to have Cython installed beforehand: it makes parallel training much faster than with regular Python.

Preparing and pre-processing data

Word2vec processes words. But what exactly is a word? Should we care about casing? Probably not. What about compound words? As with search, you’ll want to pre-process your data before running it through word2vec.

In our demo, we had some sample queries referring to various videos we posted:

logging solr tuning performance on youtube
elasticsearch introduction demo on youtube
...

We just split the text of each query by whitespace and lowercased it. That’s a simple definition of a word, but you might want to eliminate punctuation, perform stemming or lemmatization, etc.

Training the model

In the demo code, we train the model here:

model = gensim.models.Word2Vec(sentences=words,
                               min_count=5,
                               window=5,
                               iter=5000)

These are the most important options. First of all, data comes from sentences, an iterator of arrays. Each emitted array is a sentence: this prevents word2vec from crossing sentence boundaries when considering the context of a word.

The sentences iterator could be a simple array of arrays: this would be OK with our small set of 74 queries. However, on large datasets you’d want to stream: for example, we read the file line by line and emit one sentence per line.

Besides the input sentences, we have some training parameters:

  • min_count tells word2vec to ignore rare words. These are words with lower frequency compared to the specified value.
  • window gives the width of the context: how many surrounding words to consider.
  • iter tells word2vec how many times to go over the dataset. Note that the default is 5, so with 5000 we’re effectively doing a massive overfit. We could make things a little better by tuning other parameters (see below), but ultimately you’d just need a lot of (good) data to generate relevant word embeddings for your use-case

Using the model

Our Word2Vec object now contains the embeddings of each word. For example:

> print(model.wv["solr"])
[-0.6374694   0.3056958  -1.5692029   0.39552835 -0.06371246 ...

We can also find similar words, calculated by cosine similarity. Use topn to specify how many of the top matches you want to show:

> print(model.most_similar("youtube", topn=1))
[('vimeo', 0.9817387461662292)]
> print(model.most_similar("solr", topn=1))
[('elasticsearch', 0.9409826397895813)]

Those similar words can be used for query or synonym expansion. Did the user look for solr? You may want to suggest some elasticsearch approaches. Or vice-versa.

Next steps

Once you got a proof of concept running, you can experiment with different training parameters, to see which works better for you in terms of performance and accuracy. We’ve already mentioned min_count, iter and window. Here are others you might want to tweak:

  • size. This is the size of the hidden layer which, in turn, determines the size of the word embedding vector. Defaults to 100. You might think that more neurons might capture more nuances, but beyond a certain point (higher with the amount of training data) there might be too much noise coming down from their initial state. And of course training time and resources will grow as size grows.
  • alpha and min_alpha. Learning rate starts at alpha and drops linearly, with each iteration, to min_alpha. Defaults are 0.025 and 0.0001 respectively. The learning rate determines how easily the hidden layer adjusts to new input. Higher values will make training still effective with less data (like in our case), but will likely incorporate more noise (i.e. overfit).
  • sg. By default, CBOW is used, but set sg=1 to use skip-gram.
  • hs. If you set hs=1, word2vec will use hierarchical softmax to train the model. Otherwise it will use negative sampling. The rule of thumb here is that hierarchical softmax works better for infrequent words, as it builds a tree over the whole vocabulary. Negative sampling works better for frequent words, because it tries to differentiate signal from noise. hs=0 is the default, in which case negative sampling will be used.
  • negative and ns_exponent. For negative sampling, the signal vs noise differentiation is done by comparing words from the context with random words from outside the context. A default negative=5 tells word2vec to pick 5 random words outside the context. ns_exponent tells word2vec which words to prefer when sampling: 1.0 means sampled words will have a proportional frequency to the context words. ns_exponent=0.0 means no frequency preference, while negative values will prefer lower frequency words. The default is 0.75, which should work well for natural language: you’d pick “similar” words (by frequency) and try to figure out which ones actually define the word we’re looking at. Negative values of ns_exponent might work better for E-commerce and other short texts. That’s because word2vec would do a better job of delimiting “context words” (e.g. “where”, “is”, “cheap”) from the important, typically rare, words (e.g. a brand name).
  • workers and batch_words. These will influence training performance: how many threads you have and how many words to send at once for each worker thread.

You may ask yourself: should I really care about these parameters? Our answer is a definite yes, because some small tweaks might make the difference between relevant and useless vectors. For example, we would get similar results to the demo model, on the same dataset, using these parameters:

model = gensim.models.Word2Vec(sentences=words,
                               min_count=5,
                               window=5,
                               iter=100,
                               alpha=0.25,
                               min_alpha=0.01,
                               size=30,
                               negative=20,
                               ns_exponent=0.9)

Note how we reduced the number of iterations from 5000 to 100. Granted, 100 is still a lot and we have high learning rates, but we’d have to move those neuron weights somehow, even if it implies some degree of overfit. This ultimately only highlights that there’s no replacement dataset size.

Conclusions

Word embeddings can be a good path to improve relevance: whether it’s synonym/query expansion, classification or ranking. If relevance is critical to your business, you’d likely want to give word2vec a spin. Whether it’s provided by Gensim, like we did here, or TensorFlow, or others.

You’d actually need more than one spin to get the right embeddings for your use-case: there are many parameters to tune. You’d typically run many experiments to get to good results.

Do you find relevancy, or search in general, exciting? You might want to join us: we’re hiring worldwide. Or if you just need relevancy tuning or any other help with your search infrastructure, please reach out, because we provide:

Before you go, don’t forget to download your Solr Cheat Sheet:

Presentation: Large Scale Log Analytics with Solr

In this presentation from Lucene/Solr Revolution 2015, Sematext engineers --...

Exception Handling in Java: How-to Tutorial with Examples & Best Practices

As developers, we would like our users to interact with...

IoT: Air Pollution Tracking with Node.js, Elastic Stack, and MQTT

What can you do with a couple of IoT devices,...