Key Phrase Extractor

Key Phrase Extractor is a toolkit for extracting key terms (key words) and phrases from text. aka. Keyword Extractor, Key Word Extractor, Concept Extractor, Collocation Extractor, SIP Extractor

It is designed to be used in two main modes:

Mode 1: Extractor of common (frequently occurring) phrases. These phrases are known as Collocations.

In this mode the Key Phrase Extractor identifies key phrases in the input text. For example, if Key Phrase Extractor were to analyze the content of Lucene in Action, it would find terms like "Lucene" and "search", as well as phrases such as "inverted index", "information retrieval", "query parser", and so on.

Mode 2: Extractor of phrases based on the comparison of two sets of documents (also known as background and foreground corpora). These phrases are known as Statistically Improbable Phrases or SIPs.

In this mode the Key Phrase Extractor finds the most differentiating phrases between two document sets. For example, when given news articles from the last 7 days and articles from the last 24 hours, the Key Phrase Extractor will identify key terms and phrases in news from the last 24 hours. Key terms and phrases may end up being names of people such as "Steve Jobs" or "Warren Buffett", as well as phrases such as "Swine Flu" or "Somali Pirates", thus identifying people and concepts that have more mentions today than they were yesterday.

Used in this mode, the Key Phrase Extractor is an excellent tool for extraction of popular terms and phrases from a text data stream, such as from news and social media (e.g. blogs, tweets, feeds)!

  • News & Media: Phrase and term extraction from a continous content stream
  • Content enrichment: Content tagging (auto-tagging)
  • Search Results Relevance: Key Phrases can be indexed in separate fields whose matches are weighted higher than matches in other indexed fields, thus increasing the quality of search results.
  • Search Experience: Key Phrases can be used to power AutoComplete functionality, which helps people search faster, reduces misspellings and typos and thus improves the overall search experience for the end user.
  • Search Experience: Key Phrases can be used to populate fields used for faceted search, thus increasing the findability and browsability of content and improving overall search experience.
  • Extracts key concepts from content
  • Extracts key concepts from multiple pieces of content based on content difference
  • Identifies key terms and phrases useful for describing main concepts from a larger piece of text
  • Finds key terms and phrases for search results enhancement by providing additional navigational meta-data

Key Phrase Extractor exposes a simple Java API, as well as an HTTP API. Given a piece of text it returns a list of phrases ordered by their computed score. The API includes the ability to filter out the returned phrases and the KPE package includes several useful filters. The extensible and very simple filter API lets you write and plug in your own filters, too.

None - ask us !