Products :: Morphological Analyzer
aka. Morpho Analyzer
Morphological Analyzer is a software component capable of
detecting morphemes in a piece of text. English text is
commonly pre-processed before it is indexed and one of the
common pre-processing steps is stemming. Stemming is a process
through which suffixes are removed and words are converted to
their stems. For example, the word "caring" might be stemmed to
"car". Stemming rules for English language are simple. Several
known algorithms have been published and their implementations
are freely available. Many other languages have more complex
morphosyntactic characteristics (e.g. different suffixes or
prefixes can be used with a single word depending on the tense,
gender, number, case, etc.) and thus more complex rules for
their stemming are needed. In most cases there are no publicly
known algorithms and/or no available stemming products. Our
Morphological Analyzer uses Statistical Natural Language
Processing (NLP) to learn about the language morphosyntactic
structure and uses that knowledge to detect morphemes. It works
exceptionally well for highly-inflected languages - languages
whose words tend to have lots of affixes, such as Polish,
Czech, Slovak, Croatian, Serbian, etc.
Business Value / Benefits
- Makes it possible to index content in various languages and make it searchable
- Provides a single component that can be used for all languages you need to handle
- Frees your developers from spending weeks or months
figuring out rules for each language you need to handle
and then writing software for it from scratch
Do You Need It?
How do you determine if Morphological Analyzer is for you?
- You need to handle content in multiple languages
- You currently treat content in all languages in the same fashion
(but may know this is sub-optimal)
- Your search results don't seem to retrieve all content you think
they should
- You do not want your developers to spend weeks or months
learning the rules of each language you need to support and
writing custom code for each of them
Integration
Morphological Analyzer integrates tightly with Lucene and Solr.
It exposes the typical Analyzer and Filter APIs for Lucene and
additional FilterFactory for Solr. The ability to detect
morphemes for a given language requires Morphological Analyzer
to first be trained using content in that language, which is
what we have already done for all supported languages.
FAQ
Q: Which languages can Morphological Analyzer handle?
A: It is most suitable for highly inflected
languages, such as the Slavic family of languages.
Q: How accurate is the Morphological Analyzer?
A: Accuracy depends on the quality and size
of the training set. In our experiments, we have achieved
results that matched state of the art precision and
recall.
See also