Technology
We are heavily involved in and/or are active developers of several
excellent open-source search products: Lucene, Solr, Nutch (LSN),
ElasticSearch, Hadoop, HBase, Flume, and Mahout, and most of them
we know inside-out.
The LSN trio is a mature set of products developed over the years
under the Apache Software Foundation umbrella, and ElasticSearch
is rapidly gaining in popularity. Lucene, Solr, Nutch, and
ElasticSearch are used by giants such as AOL, Apple, Comcast,
SalesForce, and a number of other companies, some of which you can
see on
Sematext client list.
All together, the Lucene family of products sees over 5000
downloads every day. That is nearly 2 million downloads a year!
A number of our own search-related
products seamlessly integrate
with Lucene, Solr, ElasticSearch, and Nutch, but are designed to
also be search-provider agnostic whenever possible. That makes
them usable with Endeca, FAST, Google Search Appliance, Autonomy,
Attivio, Vivisimo, or any other commercial/enterprise search
solution. Our products are built on top of core search and are
designed to enhance the overall search experience, be it through
providing query spellchecking, offering of related searches,
allowing search auto-completion, and so on.
Apache Lucene is a high performance, scalable Information
Retrieval (IR) library. Information retrieval refers to the
process of searching for documents, information within documents
or metadata about documents. Lucene lets you add searching
capabilities to your applications. It is a mature, free,
open-source project implemented in Java; it’s a project in the
Apache Software Foundation, licensed under the liberal Apache
Software License. As such, Lucene is currently, and has been for
quite a few years, the most popular free IR library.
Sematext founder is a Lucene developer of 10+ years and the
co-author of Lucene in Action (1st and 2nd ed.), the best selling
Lucene book.
Apache Solr is the popular, blazing fast open source enterprise
search platform from the Apache Lucene project. Its major features
include powerful full-text search, hit highlighting, faceted
search, dynamic clustering, database integration, and rich
document (e.g. Word, PDF) handling. Solr is highly scalable,
providing distributed search and index replication, and it powers
the search and navigation features of many of the world's largest
internet sites.
Solr is written in Java and runs as a standalone full-text search
server within a servlet container such as Tomcat or Jetty. Solr
uses the Lucene Java search library at its core for full-text
indexing and search, and has REST-like HTTP/XML and JSON APIs that
make it easy to use from virtually any programming
language. Solr's powerful external configuration allows it to be
tailored to almost any type of application without Java coding,
and it has an extensive plugin architecture when more advanced
customization is required.
Sematext provides Apache Solr
consulting,
tech-support, and runs
the popular
SPM
for Solr service used for monitoring performance of
Solr clusters.
Sematext founder has been a Solr developer since 2006. One of
Sematext engineers is the author of the Solr Cookbook.
ElasticSearch is a cluster and cloud-aware, high-performance,
open source search server. It features cluster node
auto-discovery, index sharding and replication, distributed
search, cluster monitoring, faceting, filtering, highlighting,
etc.
It is written in Java and runs as a standalone full-text search
server. ElasticSearch uses the Lucene Java search library at its
core for full-text indexing and search, and has REST-like HTTP and
JSON API that make it easy to use from virtually any programming
language. It has extensible architecture and a growing list of
pluggable modules.
Sematext provides ElasticSearch
consulting,
tech-support, and runs
the popular
SPM
for ElasticSearch service used for monitoring performance of
ElasticSearch clusters.
Mahout is a scalable Machine Learning Java library. It contains
parallelizable implementations of a number of Machine Learning
algoriths for Classification / Categorization, Clustering,
Recommendations, Pattern Mining, Regression, Dimension Reduction,
Evolutionary Algorithms, etc. Because it makes use of Hadoop and
can run on many machines in parallel, it is suitable for very
large data sets.
Hadoop is an open-source solution for reliable, scalable,
distributed computing. HDFS (Hadoop Distributed File System) is a
core Hadoop sub-system that provides high throughput access to
application data. Hadoop's MapReduce implementation is a
framework suitable for distributed processing of large data sets
on compute clusters and HDFS.
Sematext employs 2 Certified Hadoop Developers.
HBase is a scalable, distributed column-oriented database that
supports structured data storage for large data sets. It works
well with MapReduce, allowing developers to process data stored in
HBase with MapReduce-based jobs. HBase is modeled after Google
BigTable.
Sematext provides HBase
consulting and runs the popular
SPM for
HBase service used for monitoring performance of HBase
clusters.
We have gone through Cloudera's HBase training courses, have
contributed patches to HBase, have open-sourced HBase-based
projects, and have non-trivial products built on top of HBase.
Voldemort is a high-performance distributed Key-Value Store that
includes data partioning, replication, rebalancing, graceful
failure handling, pluggable storage engines, etc. Voldemort was
developed at LinkedIn, where it is used in high-data and
high-request volume environments.
We have a product built on top of Voldemort.
Cassandra is a highly scalable column-oriented distributed
database originally developed by Facebook and later donated to
Apache Software Foundation. Cassandra is a Google BigTable and
Amazon Dynamo hybrid featuring data partitioning, replication,
rebalancing, tunable eventual consistency settings, elastic
run-time cluster expansion, data durability, fault tolerance, etc.
We have gone through the Cassandra training course given by one
of Cassandra developers and founders of DataStax.
Nutch is an open source web-search software. It builds on Lucene
Java and Solr, adding web-specifics, such as a crawler, a
link-graph database, parsers for HTML and other document formats,
etc. It is scalable and runs on top of Hadoop (i.e. it uses
Hadoop's MapReduce and HDFS).
Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large amounts of
log data. It has a simple and flexible architecture based on
streaming data flows. It is robust and fault tolerant with tunable
reliability mechanisms and many failover and recovery
mechanisms. The system is centrally managed and allows for
intelligent dynamic management. It uses a simple extensible data
model that allows for online analytic applications.
We have implemented Flume's HBase sink and have contributed it to Flume.
Other
Our experience and expertize doesn't end with search technologies.
The following are some of the other technologies we use regularly:
- BerkeleyDB (aka BDB)
- Droids
- Tika
- MySQL
- PostgreSQL
- ...
Lucene Project Downloads
Hadoop Project Downloads