Get Started with Apache Solr
An Intro to Apache Solr Basics: Tips and In-depth Resources
The ability to search is a key feature of most modern applications. While encompassing huge amounts of data, they need to allow the end-user to find what they’re searching for without delay. DevOps need to look beyond the traditional databases with complicated and non-user-friendly (even if smart and innovative) SQL query-based solutions to implement search functionality.
That’s where Apache Solr comes in – to help smooth users’ search experience with features such as autosuggest in search fields, range or category browsing using facets, and more. So let’s dive in and “strip” Solr to the basics. Find out what is Apache Solr, why it is important, and how it works:
What is Apache Solr?
Apache Solr (Searching On Lucene w/Replication) is a free, open-source search engine based on the Apache Lucene library. An Apache Lucene subproject, it has been available since 2004 and is one of the most popular search engines available today worldwide. Solr, however, is more than a search engine — it’s also often used as a document-based NoSQL database with transactional support that can be used for storage purposes and even a key-value store.
To better understand the relationship between Solr and Lucene, read our post on Solr vs. Lucene (coming soon).
Written in Java, Solr has RESTful XML/HTTP and JSON APIs and client libraries for many programming languages such as Java, Phyton, Ruby, C#, PHP and many more being used to build search-based and big data analytics applications for websites, databases, files, etc.
Solr takes in structured, semi-structured and unstructured data from various sources, stores and indexes it, and makes it available for search in near real-time. Solr is also used for its analytical capabilities as well, enabling you to do faceted product search, log/security event aggregation, social media analysis and so on.
Solr can work with large amounts of data in what has traditionally been called master-slave mode, but it allows further scaling via clusters in SolrCloud mode. Learn how to migrate from master-slave to SolrCloud and check out the video where we explain how to scale Solr with SolrCloud.
Read more about SolrCloud: SolrCloud: Dealing with Large Tenants and Routing and Running Solr on Docker (coming soon).
Solr is completely open source and companies usually run it on their server. If you’re just starting out with Solr, you can enroll in one of our Solr Training classes where you can learn to master Solr in just a few hours. And if you ‘re already using Solr but need some further expertize to optimize it to better suit your needs, we can also help you with Solr Support and Solr Consulting.
Solr competes with Elasticsearch but it also rivals commercial search and analytics solutions such as Splunk.
Read more about the differences between Elasticsearch and Solr.
Why Use Apache Solr?
Solr has support for multi-tenant architecture that enables you to scale, distribute and manage indexes for large scale applications.
In a nutshell, Solr is a stable, reliable and fault-tolerant search platform with a rich set of core functions that enable you to improve both user experience and the underlying data modeling. For instance, among functionalities that help deliver good user experience, we can name spell checking, geospatial search, faceting, or auto-suggest, while backend developers may benefit from features like joins, clustering, being able to import rich document formats, and many more.
However, to fully grasp how to use it for your benefit, here are Solr‘s core features and why you may want to use Solr:
Powerful Full-Text Search Capabilities
Solr provides advanced near real-time searching capabilities such as fielded search, Boolean queries, phrase queries, fuzzy queries, spell check, wildcards, joins, grouping, auto-complete and many more across different types of data.
Read further about Sematext Solr AutoComplete: Introduction and How to.
Comprehensive Administration Interfaces
Solr provides a built-in responsive user interface that enables you to perform administrative tasks, such as managing logging, adding, deleting, updating or searching documents.
High Scalability and Flexibility
With tools such as Apache ZooKeeper, it’s easy to scale Solr up or down, as it relies heavily on automated index replication, distribution, load-balancing, and automated failover and recovery.
Therefore, depending on the needs and size of your operation, Solr can be deployed to any kind of system such as standalone, distributed, cloud, all while simplifying configuration.
Extensible Plugin Architecture
Solr publishes extension points that make it easy to plugin both index and query time plugins.
Solr comes with features that address several aspects of security:
- SSL for encryption of HTTP traffic between Solr clients and Solr, as well as between nodes
- Basic and Kerberos-based authentication
- Authorization APIs for defining users, roles, and permissions
Read more about Solr Security (coming soon).
Solr exposes its metrics via JMX MBeans, so you can do some ad-hoc monitoring (more like spot checking) using tools like JConsole, or JMXC. As of Solr 6.4 Solr started exposing its metrics via an HTTP API, too.
For monitoring Solr in production there are commercial and open source tools you can use to monitor Solr metrics, such as Sematext Java Agent.
To get in-depth insights into the key Solr metrics, some level of expertise is required, and Sematext is an excellent Solr performance monitoring tool should you need one.
Read more on how you can monitor Solr with Sematext.
Besides English, both Solr and Lucene work a number of other languages such as Chinese, Japanese, Korean, Arabic, German, French, Spanish, and many others. It has language detection built-in and provides language-specific text analysis tools accordingly.
Powerful Analytical Capabilities
Solr has two ways of analyzing data:
- Facets. These are good for real-time analytics. For example, in product search, you’d break down results by brand. In log analysis, you’d look at the volume of errors per hour.
- Streaming aggregations. They allow you to do more complex processing, though it’s typically slower than facets. Examples include joining results with a different data set (potentially outside Solr) and machine learning tasks such as clustering or regression.
Solr Terminology: Understanding the Basic Concepts Used in Solr
Before diving into the process of how Solr works, it’s important to understand the key terms used when working with Solr, from cores to documents, node, shards and more.
A document is a basic unit of information in Solr which can be stored and indexed. Documents are stored in collections. They can be added, deleted, and updated, typically through index handlers.
The field stores the data in a document holding a key-value pair, where key states the field name and value the actual field data. Solr supports different field types: float, long, double, date, date, text, integer, boolean, etc.
A Solr Collection is a group of shards/cores that form a single logical index. Each collection has its own set of configuration and schema definition, which can be different than other collections.
To create or delete a collection, list available collections and other management tasks, check out the Solr Collections API.
Shards allow you to split and store your index into one or more pieces, thus a shard is a slice of a collection. Each shard lives on a node and is hosted in a core.
Also read How to Handle Shards in SolrCloud
A node is a single Java Virtual Machine instance running Solr, also known as a Solr server. A node can host multiple shards.
A replica is a physical copy of a shard which runs as a core in a node. One of these copies is a leader (see below). Other copies of the same shard will replicate data from the leader. Read more on types of replicas and Solr replication here:
The leader is a replica of the shard that sends requests of the SolrCloud to the rest of the replicas in the shard whenever there’s an index update, such as document additions or deletions. If the leader goes down, one of the other replicas will be elected as a leader automatically.
Specific to SolrCloud, a cluster is made up of one or more nodes that store all the data, providing distributed indexing and search capabilities across all nodes. Read more about SolrCloud here.
NOTE: So far we’ve described SolrCloud, which is the newer (usually preferred way) of running Solr. SolrCloud is distributed, relying on Apache Zookeeper to store its cluster state. That said, there’s still the option to run Solr in a standalone or master-slave setup. There, you can create/remove/delete cores via the CoreAdmin API (and parameters will be stored in the core.properties
file). But usually, for larger-scale setups, you’d migrate from Solr master-slave to SolrCloud.
How Does Solr Work?
Solr works by gathering, storing and indexing documents from different sources and making them searchable in near real-time. It follows a 3-step process that involves indexing, querying, and finally, ranking the results – all in near real-time, even though it can work with huge volumes of data.
More specifically, here’s how Solr performs the following operations in a sequence to search for a document:
Step 1: Indexing
There are several approaches it can take to index documents:
- If your files are already in JSON, XML/XSLT or CSV formats, you can upload them directly to Solr by calling the index request handler (or simply index handler).
- If you want to index rich text documents such as PDF files or Office documents that are supported by Apache Tika out of the box, you can use the ExtractingRequestHandler, also known as Solr Cell. This request handler parses incoming files with Tika and extracts fields that you need to index.
- You can also import data from a database, emails, RSS feeds, XML data, plain text files, etc. Solr has a plugin called the DataImportHandler, which can fetch data from a database and index it, using column names as document field names.
Solr uses Lucene to create an inverted index because it inverts a page-centric data structure (documents ⇒ words) to a keyword-centric structure (word ⇒ documents). It’s like the index you see at the end of any book where you can find where certain words occur in the book. Similarly, the Solr index is a list that holds the mapping of words, terms or phrases and their corresponding places in the documents stored.
Solr, therefore, achieves faster responses because it searches for keywords in the index instead of scanning the text directly.
Solr uses fields to index a document. However, before being added to the index, data goes through a field analyzer, where Solr uses char filters, tokenizers, and token filters to make data searchable. Char filters can make changes to the string as a whole. Then, tokenizers break field data into lexical units or tokens that then pass through filters which decide to keep, transform (e.g. setting all the data to lowercase, removing word stems) or discard them, or create new ones. These final tokens are added to the index or searched at query time.
However, you need to set up rules for processing content found in fields as documents are indexed. These rules specify field types, which fields are required and which should be used as the primary/unique key, and how to index and search each field.
The fields and the rules are defined in the managed-schema file (formerly schema.xml), typically stored in the confDir for your core or collection.
But just defining these rules is not enough to ensure optimum performance. Instead, there are several things you need to consider so that you can get the most out of Solr in terms of performance whenever you’re updating your index (add, delete documents). In this talk we explain how and when to optimize Solr. If you don’t have time to watch the video, check out the Optimize Is (Not) Bad for You slides instead.
Read more on indexing with Solr:
- Solr Streaming Expressions for Collection auto-updating
- DocValues Reindexing with Solr Streaming Expressions
- Presentation: Solr for Indexing and Searching Logs
Step 2: Querying
One can search for various terms such as keywords, images or geolocation data, for instance. When you send a query, Solr processes it with a query request handles (or simply query handler) that works similarly to the index handler, only that is used to return documents from the Solr index instead of uploading them.
NOTE: Before running the actual query, you may want to identify the fields you want to target with each keyword. This process is called Entity Extraction, and you can use the Solr Text Tagger for this purpose.
Step 3: Ranking the Results
As it matches indexed documents to a query, Solr ranks the results by their relevancy score – the most relevant hits appear at the top of the matched documents.
- SolrCloud and SQL Queries
- Parameterizing Queries in Solr and Elasticsearch
- Solr Learning To Rank and Streaming Expressions
Download our Solr / SolrCloud Cheat Sheet to learn how to access all Solr features, from running Solr to data manipulation, searching, faceting, streaming aggregations. You can also enroll in one of our Solr Training classes.
Solr Use Cases & Applications
Solr is a search engine with multiple uses that has proven critical to business operations. Besides the powerful search features, Solr make for an exceptional data store for analytics use. Solr is, therefore, the backbone used for applications with sophisticated search and analytics requirements in any domain, really, from marketing, energy, education to HR, healthcare, retail, real estate and many more.
Due to its extensible nature and customizable search features, it can easily be adapted to suit your particular needs. Companies like Apple, Netflix, Instagram, NASA, Zappos, Goldman Sachs, and The White House are just a few that use Solr to support their business. In fact, it’s used by many of the Fortune 500 companies.
Solr is popular for websites as it can be used to index and search multiple sites, as well as for enterprise search because it can index and search documents and email attachments. However, it’s not helpful only in the IT domain, but also in scientific applications, like searching for DNA patterns, or in scientific research to search for particular genes or nucleotide sequences to identify an organism. There are really an endless number of use-cases where Solr can be of use.
Here, we’ve gathered some use-cases for you to better understand how you can leverage Solr features in a BI infrastructure:
Hiring managers of recruitment agencies have to scan piles of resumes to find just a few suitable candidates to interview. Solr can help reduce the time spent on going through resumes. With Apache Tika, it can index unstructured data coming from rich text documents such as PDFs, Word documents, XML or plain text. The search engine can pull keywords and phrases, identify and convert different word forms, and detect the language that was used. Furthermore, after hire, you can create a predictive model based on the employee’s resume to use in future recruitment processes for similar positions.
When expanding a store chain, Solr can help strategic planners decide where the new location should be. Using its geospatial functionality, it can map out the existing and potential customers and include distance as criteria when ranking each potential location. Additionally, by analyzing customer purchases, it can group customers by distance travelled, number of visits, or the amount purchased.
Log File Analytics
A perfect example of how Solr can support very large indexes is to see it in action in manufacturing. In this kind of operation, parts are tracked from the moment they enter the inventory until they leave the line fully assembled. And not just once, but by every machine they pass through on the assembly line. Solr can handle this massive amount of data and provide efficient ingestion and search capabilities in near real-time. With Solr, you can easily see production rate, defect rate, group data by date range, product line, location, etc.
Monitoring Solr with Sematext
Now that you know how Solr works and what is used for, you can understand why monitoring Solr is important – it gives you insight into the health and compliance of your application, enabling you to act fast and informed whenever you get a red flag. With Sematext Cloud you can easily monitor Solr logs and metrics in one place.
If you don’t have the expertise yet to monitor Solr, enroll in one of our Solr Training classes, and we’ll help you get there. Or if you’re already using Solr but need support along the way to fine tune it, we can also help you with Solr Support and Solr Consulting.