Some of our products are available for evaluation. Please contact us to inquire.

Contact Sales:
+1 347-480-1610
info@sematext.com

Technology

We are heavily involved in and/or are active developers of several excellent open-source search products: Lucene, Solr, Nutch (LSN), ElasticSearch, Hadoop, HBase, Flume, and Mahout, and most of them we know inside-out.

The LSN trio is a mature set of products developed over the years under the Apache Software Foundation umbrella, and ElasticSearch is rapidly gaining in popularity. Lucene, Solr, Nutch, and ElasticSearch are used by giants such as AOL, Apple, Comcast, SalesForce, and a number of other companies, some of which you can see on Sematext client list. All together, the Lucene family of products sees over 5000 downloads every day. That is nearly 2 million downloads a year!

A number of our own search-related products seamlessly integrate with Lucene, Solr, ElasticSearch, and Nutch, but are designed to also be search-provider agnostic whenever possible. That makes them usable with Endeca, FAST, Google Search Appliance, Autonomy, Attivio, Vivisimo, or any other commercial/enterprise search solution. Our products are built on top of core search and are designed to enhance the overall search experience, be it through providing query spellchecking, offering of related searches, allowing search auto-completion, and so on.

Apache Lucene

Apache Lucene is a high performance, scalable Information Retrieval (IR) library. Information retrieval refers to the process of searching for documents, information within documents or metadata about documents. Lucene lets you add searching capabilities to your applications. It is a mature, free, open-source project implemented in Java; it’s a project in the Apache Software Foundation, licensed under the liberal Apache Software License. As such, Lucene is currently, and has been for quite a few years, the most popular free IR library.

Sematext founder is a Lucene developer of 10+ years and the co-author of Lucene in Action (1st and 2nd ed.), the best selling Lucene book.

Apache Solr

Apache Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g. Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat or Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.

Sematext provides Apache Solr consulting, tech-support, and runs the popular SPM for Solr service used for monitoring performance of Solr clusters.

Sematext founder has been a Solr developer since 2006. One of Sematext engineers is the author of the Solr Cookbook.

ElasticSearch

ElasticSearch is a cluster and cloud-aware, high-performance, open source search server. It features cluster node auto-discovery, index sharding and replication, distributed search, cluster monitoring, faceting, filtering, highlighting, etc.

It is written in Java and runs as a standalone full-text search server. ElasticSearch uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP and JSON API that make it easy to use from virtually any programming language. It has extensible architecture and a growing list of pluggable modules.

Sematext provides ElasticSearch consulting, tech-support, and runs the popular SPM for ElasticSearch service used for monitoring performance of ElasticSearch clusters.

Mahout

Mahout is a scalable Machine Learning Java library. It contains parallelizable implementations of a number of Machine Learning algoriths for Classification / Categorization, Clustering, Recommendations, Pattern Mining, Regression, Dimension Reduction, Evolutionary Algorithms, etc. Because it makes use of Hadoop and can run on many machines in parallel, it is suitable for very large data sets.

Hadoop / HDFS / MapReduce

Hadoop is an open-source solution for reliable, scalable, distributed computing. HDFS (Hadoop Distributed File System) is a core Hadoop sub-system that provides high throughput access to application data. Hadoop's MapReduce implementation is a framework suitable for distributed processing of large data sets on compute clusters and HDFS.

Sematext employs 2 Certified Hadoop Developers.

HBase

HBase is a scalable, distributed column-oriented database that supports structured data storage for large data sets. It works well with MapReduce, allowing developers to process data stored in HBase with MapReduce-based jobs. HBase is modeled after Google BigTable.

Sematext provides HBase consulting and runs the popular SPM for HBase service used for monitoring performance of HBase clusters.

We have gone through Cloudera's HBase training courses, have contributed patches to HBase, have open-sourced HBase-based projects, and have non-trivial products built on top of HBase.

Voldemort

Voldemort is a high-performance distributed Key-Value Store that includes data partioning, replication, rebalancing, graceful failure handling, pluggable storage engines, etc. Voldemort was developed at LinkedIn, where it is used in high-data and high-request volume environments.

We have a product built on top of Voldemort.

Cassandra

Cassandra is a highly scalable column-oriented distributed database originally developed by Facebook and later donated to Apache Software Foundation. Cassandra is a Google BigTable and Amazon Dynamo hybrid featuring data partitioning, replication, rebalancing, tunable eventual consistency settings, elastic run-time cluster expansion, data durability, fault tolerance, etc.

We have gone through the Cassandra training course given by one of Cassandra developers and founders of DataStax.

Nutch

Nutch is an open source web-search software. It builds on Lucene Java and Solr, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc. It is scalable and runs on top of Hadoop (i.e. it uses Hadoop's MapReduce and HDFS).

Flume

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.

We have implemented Flume's HBase sink and have contributed it to Flume.

Other

Our experience and expertize doesn't end with search technologies. The following are some of the other technologies we use regularly:

Lucene Project Downloads

Hadoop Project Downloads