Some of our products are available for evaluation. Please contact us to inquire.

Contact Sales:
+1 347-480-1610
info@sematext.com

Products :: Language Identifier

aka. LangID, Language Guesser, Language Detector

Language Identifier detects the language used in a piece of text (e.g. document, email, web page). It uses statistical NLP (Natural Language Processing) to learn about languages and to identify them.

Business Value / Benefits

Do You Need It?

How do you determine if Language Identifier is for you?

Integration

Language Identifier exposes a simple Java API. Given a piece of text it returns a list of languages ordered by confidence score. It seamlessly integrates with Lucene and Solr, but is not tied to search and can be used in applications that have nothing to do with search. It also runs as a REST/Web service, thus allowing integration with any software component that can invoke it over HTTP.

FAQ

Q: Which languages can Language Identifier recognize?
A: It can detect any language it has been trained for, regardless of type of character set used, encoding, etc. This can easily be done with Wikipedia dumps for example or any other custom corpora.
Q: How accurate is the Language Identifier?
A: Accuracy depends on the quality and size of the training set.
Q: How does one integrate or use Language Identifier?
A: Via very simple Java or REST APIs.

See also