rsyslog Elasticsearch reindex multiple scripts(3)

Scalable and Flexible Elasticsearch Reindexing via rsyslog

Earlier on, we posted a recipe on reindexing data from within an Elasticsearch 2.3+ cluster. But this doesn’t work if you want to reindex in a different cluster or if your Elasticsearch is older than 2.3. Or both, when you’re trying to migrate from 1.x to 2.x or later.

For such cases, we posted a Logstash reindexing recipe. However, Logstash can sometimes become a bottleneck, so we needed something faster for indexing lots of data. We turned to rsyslog, a log shipper with performance as its #1 feature.

The plan

As rsyslog doesn’t have an Elasticsearch input like Logstash does, we’ve used an external application to scroll through Elasticsearch documents and push them to rsyslog via TCP. The flow would be:

rsyslog to Elasticsearch reindex flow

This is an easy way to extend rsyslog, using whichever language you’re comfortable with, to support more inputs. Here, we piggyback on the TCP input. You can do a similar job with filters/parsers – you can find some examples here – by piggybacking the mmexternal module, which uses stdout&stdin for communication. The same is possible for outputs, normally added via the omprog module: we did this to add a Solr output and one for SPM custom metrics.

The custom script in question doesn’t have to be multi-threaded, you can simply spin up more of them, scrolling different indices. In this particular case, using two scripts gave us slightly better throughput, saturating the network:

rsyslog to Elasticsearch reindex flow multiple scripts

Writing the custom script

Before starting to write the script, one needs to know how the messages sent to rsyslog would look like. To be able to index data, rsyslog will need an index name, a type name and optionally an ID. In this particular case, we were dealing with logs, so the ID wasn’t necessary.

With this in mind, I see a number of ways of sending data to rsyslog:

  • one big JSON per line. One can use mmnormalize to parse that JSON, which then allows rsyslog do use values from within it as index name, type name, and so on
  • for each line, begin with the bits of “extra data” (like index and type names) then put the JSON document that you want to reindex. Again, you can use mmnormalize to parse, but this time you can simply trust that the last thing is a JSON and send it to Elasticsearch directly, without the need to parse it
  • if you only need to pass two variables (index and type name, in this case), you can piggyback on the vague spec of RFC3164 syslog and send something like
    destination_index document_type:{"original": "document"}
    

This last option will parse the provided index name in the hostname variable, the type in syslogtag and the original document in msg. A bit hacky, I know, but quite convenient (makes the rsyslog configuration straightforward) and very fast, since we know the RFC3164 parser is very quick and it runs on all messages anyway. No need for mmnormalize, unless you want to change the document in-flight with rsyslog.

Below you can find the Python code that can scan through existing documents in an index (or index pattern, like logstash_2016.05.*) and push them to rsyslog via TCP. You’ll need the Python Elasticsearch client (pip install elasticsearch) and you’d run it like this:

python elasticsearch_to_rsyslog.py source_index destination_index

The script being:

from elasticsearch import Elasticsearch
import json, socket, sys

source_cluster = ['server1', 'server2']
rsyslog_address = '127.0.0.1'
rsyslog_port = 5514

es = Elasticsearch(source_cluster,
      retry_on_timeout=True,
      max_retries=10)
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((rsyslog_address, rsyslog_port))


result = es.search(index=sys.argv[1], scroll='1m', search_type='scan', size=500)

while True:
  res = es.scroll(scroll_id=result['_scroll_id'], scroll='1m')
  for hit in result['hits']['hits']:
    s.send(sys.argv[2] + ' ' + hit["_type"] + ':' + json.dumps(hit["_source"])+'\n')
  if not result['hits']['hits']:
    break

s.close()

If you need to modify messages, you can parse them in rsyslog via mmjsonparse and then add/remove fields though rsyslog’s scripting language. Though I couldn’t find a nice way to change field names – for example to remove the dots that are forbidden since Elasticsearch 2.0 – so I did that in the Python script:

def de_dot(my_dict):
  for key, value in my_dict.iteritems():
    if '.' in key:
      my_dict[key.replace('.','_')] = my_dict.pop(key)
    if type(value) is dict:
      my_dict[key] = de_dot(my_dict.pop(key))
  return my_dict

And then the “send” line becomes:

s.send(sys.argv[2] + ' ' + hit["_type"] + ':' + json.dumps(de_dot(hit["_source"]))+'\n')

Configuring rsyslog

The first step here is to make sure you have the lastest rsyslog, though the config below works with versions all the way back to 7.x (which can be found in most Linux distributions). You just need to make sure the rsyslog-elasticsearch package is installed, because we need the Elasticsearch output module.

# messages bigger than this are truncated
$maxMessageSize 10000000  # ~10MB

# load the TCP input and the ES output modules
module(load="imtcp")
module(load="omelasticsearch")

main_queue(
  # buffer up to 1M messages in memory
  queue.size="1000000"
  # these threads process messages and send them to Elasticsearch
  queue.workerThreads="4"
  # rsyslog processes messages in batches to avoid queue contention
  # this will also be the Elasticsearch bulk size
  queue.dequeueBatchSize="4000"
)

# we use templates to specify how the data sent to Elasticsearch looks like
template(name="document" type="list"){
  # the "msg" variable contains the document
  property(name="msg")
}
template(name="index" type="list"){
  # "hostname" has the index name
  property(name="hostname")
}
template(name="type" type="list"){
  # "syslogtag" has the type name
  property(name="syslogtag")
}

# start the TCP listener on the port we pointed the Python script to
input(type="imtcp" port="5514")

# sending data to Elasticsearch, using the templates defined earlier
action(type="omelasticsearch"
  template="document"
  dynSearchIndex="on" searchIndex="index"
  dynSearchType="on" searchType="type"
  server="localhost"  # destination Elasticsearch host
  serverport="9200"   # and port
  bulkmode="on"  # use the bulk API
  action.resumeretrycount="-1"  # retry indefinitely if Elasticsearch is unreachable
)

This configuration doesn’t have to disturb your local syslog (i.e. by replacing /etc/rsyslog.conf). You can put it someplace else and run a different rsyslog process:

rsyslogd -i /var/run/rsyslog_reindexer.pid -f /home/me/rsyslog_reindexer.conf

And that’s it! With rsyslog started, you can start the Python script(s) and do the reindexing.

If you need any help with Elasticsearch, rsyslog, Logstash and the like, check out our Elasticsearch consulting, Logging consulting, Elasticsearch production support and Elasticsearch and Logging training info.

23 rules logstash up cpu

Elasticsearch Ingest Node vs Logstash Performance

Starting from Elasticsearch 5.0, you’ll be able to define pipelines within it that process your data, in the same way you’d normally do it with something like Logstash. We decided to take it for a spin and see how this new functionality (called Ingest) compares with Logstash filters in both performance and functionality.

Specifically, we tested the grok processor on Apache common logs (we love logs here), which can be parsed with a single rule, and on CISCO ASA firewall logs, for which we have 23 rules. This way we could also check how both Ingest and Logstash scale when you start adding more rules.

Baseline performance

To get a baseline, we pushed logs with Filebeat 5.0alpha1 directly to Elasticsearch, without parsing them in any way. We used an AWS c3.large for Filebeat (2 vCPU) and a c3.xlarge for Elasticsearch (4 vCPU). We also installed SPM to monitor Elasticsearch’s performance.

It turned out that network was the bottleneck, which is why pushing raw logs doesn’t saturate the CPU:
raw logs CPU

Even though we got a healthy throughput rate of 12-14K EPS:
raw logs throughput

But raw, unparsed logs are rarely useful. Ideally, you’d log in JSON and push directly to Elasticsearch. Conveniently, Filebeat can parse JSON since 5.0. That said, throughput dropped to about 4K EPS because JSON logs are bigger and saturate the network:
Throughput of JSON logs

CPU dropped as well, but not that much because now Elasticsearch has to do more work (more fields to index):
JSON logs CPU

This 4K EPS throughput/40 percent CPU ratio is the most efficient way to send logs to Elasticsearch – if you can log in JSON. If you can’t, you’ll need to parse them. So we added another c3.xl instance (4 vCPUs) to do the parsing, first with Logstash, then with a separate Elasticsearch dedicated Ingest node.

Logstash

With Logstash 5.0 in place, we pointed Filebeat to it, while tailing the raw Apache logs file. On the Logstash side, we have a beats listener, a grok filter and an Elasticsearch output:

input {
  beats {
    port => 5044
  }
}

filter {
   grok {
     match => ["message", "%{COMMONAPACHELOG}%{GREEDYDATA:additional_fields}"]
   }
}

output {
  elasticsearch {
    hosts => "10.154.238.233:9200"
    workers => 4
  }
}

The default number of 2 pipeline workers seemed enough, but we’ve specified more output workers to make up for the time each of them waits for Elasticsearch to reply. That said, network was again the bottleneck so throughput was capped at 4K EPS like with JSON logs:
Logstash apache logs throughput

Meanwhile, Logstash used just about the same amount of CPU as Elasticsearch, at 40-50%:
Logstash apache logs CPU usage

Then we parsed CISCO ASA logs. The config looks similar, except there were 23 grok rules instead of one. Logstash handled the load surprisingly well – throughput was again capped by the network, slightly lower than before because JSONs were bigger:
Logstash CISCO ASA grok throughput

While CPU usage only increased to 60-70%:
Logstash CISCO ASA CPU usage

This means the throughput-to-CPU ratio only went down by about 1.5x after adding a lot more rules. However, in both cases Logstash proved pretty heavy, using about the same CPU to parse the data as Elasticsearch used for indexing it. Let’s see if the Ingest node can do better.

Ingest node

We used the same c3.xl instance for Ingest node tests: we’ve set node.master and node.data to false in its elasticsearch.yml, to make sure it only does grok and nothing else. We’ve also set node.ingest to false of the data node, so it can focus on indexing.

Next step was to define a pipeline that does the grok processing on the Ingest node:

curl -XPOST localhost:9200/_ingest/pipeline/apache?pretty -d '{
  "description": "grok apache logs",
  "processors": [
    {
      "grok": {
        "field": "message",
        "pattern": "%{COMMONAPACHELOG}%{GREEDYDATA:additional_fields}"
      }
    }
  ]
}'

Then, to trigger the pipeline for a certain document/bulk, we added the name of the defined pipeline to the HTTP parameters like pipeline=apache. We used curl this time for indexing, but you can add various parameters in Filebeat, too.

With Apache logs, the throughput numbers were nothing short of impressive (12-16K EPS):
ingest node apache logs grok throughput

This used up all the CPU on the data node, while the ingest node was barely breaking a sweat at 15%:
ingest node grok apache logs CPU usage

Because Filebeat only sent raw logs to Elasticsearch (specifically, the dedicated Ingest node), there was less strain on the network. The Ingest node, on the other hand, also acted like a client node, distributing the logs (now parsed) to the appropriate shards, using the node-to-node transport protocol. Overall, the Ingest node provided ~10x better CPU-to-throughput ratio than Logstash.

Things still look better, but not this dramatic, with CISCO ASA logs. We have multiple sub-types of logs here, and therefore multiple grok rules. With Logstash, you can specify an array of match directives:

grok {
  match => [
   "cisco_message", "%{CISCOFW106001}",
   "cisco_message", "%{CISCOFW106006_106007_106010}",
...

There’s no such thing for Ingest node yet, so you need to define one rule, and then use the on_failure block to define another grok rule (effectively saying “if this rule doesn’t match, try that one”) and keep nesting like that until you’re done:

"grok": {
  "field": "cisco_message",
  "pattern": "%{CISCOFW106001}",
  "on_failure": [
    {
      "grok": {
      "field": "cisco_message",
      "pattern": "%{CISCOFW106006_106007_106010}",
      "on_failure": [...

The other problem is performance. Because now there are up to 23 rules to evaluate, throughput goes down to about 10K EPS:
Ingest node CISCO ASA grok throughput

And the CPU bottleneck shifts to the Ingest node:
Ingest node CISCO ASA grok CPU

Overall, the throughput-to-CPU ratio of the Ingest node dropped by a factor of 9 compared to the Apache logs scenario.

Conclusions

  • Logstash is easier to configure, at least for now, and performance didn’t deteriorate as much when adding rules
  • Ingest node is lighter across the board. For a single grok rule, it was about 10x faster than Logstash
  • Ingest nodes can also act as “client” nodes
  • Define the grok rules matching most logs first, because both Ingest and Logstash exit the chain on the first match by default

You’ve made it all the way down here? Bravo! If you need any help with Elasticsearch – don’t forget @sematext does Elasticsearch Consulting, Production Support, as well as Elasticsearch Training.

Monitoring rsyslog with Kibana and SPM

A while ago we published this post where we explained how you can get stats about rsyslog, such as the number of messages enqueued, the number of output errors and so on. The point was to send them to Elasticsearch (or Logsene, our logging SaaS, which exposes the Elasticsearch API) in order to analyze them.

This is part 2 of that story, where we share how we process these stats in production. We’ll cover:

  • an updated config, working with Elasticsearch 2.x
  • what Kibana dashboards we have in Logsene to get an overview of what rsyslog is doing
  • how we send some of these metrics to SPM as well, in order to set up alerts on their values: both threshold-based alerts and anomaly detection

Read More

Recipe: Apache Logs + rsyslog (parsing) + Elasticsearch

More than two years ago we posted a recipe on how to centralize syslog in Elasticsearch in order to search and analyze them with Kibana, all by using only rsyslog. This works well, because rsyslog is fast and light, as we shown in later posts and recent presentations.

This recipe is about tailing Apache HTTPD logs with rsyslog, parsing them into structured JSON documents, and forwarding them to Elasticsearch (or our log analytics SaaS, Logsene, which exposes the Elasticsearch API). Having them indexed in a structured way will allow you to do better analytics with tools like Kibana:

Kibana_screenshot

We’ll also cover pushing system logs and how to buffer them properly, so it’s an updated, more complete recipe compared to the old one.

Getting the ingredients

Even though most distros already have rsyslog installed, it’s highly recommended to get the latest stable from the rsyslog repositories. This way you’ll benefit from the last two to five years of development (depending on how conservative your distro is). The packages you’ll need are:

With the ingredients in place, let’s start cooking a configuration. The configuration needs to do the following:

  • load the required modules
  • configure inputs: tailing Apache logs and system logs
  • configure the main queue to buffer your messages. This is also the place to define the number of worker threads and batch sizes (which will also be Elasticsearch bulk sizes)
  • parse common Apache logs into JSON
  • define a template where you’d specify how JSON messages would look like. You’d use this template to send logs to Logsene/Elasticsearch via the Elasticsearch output

Loading modules

Here, we’ll need imfile to tail files, mmnormalize to parse them, and omelasticsearch to send them. If you want to tail the system logs, you’d also need to include imuxsock and imklog (for kernel logs).

# system logs
module(load="imuxsock")
module(load="imklog")
# file
module(load="imfile")
# parser
module(load="mmnormalize")
# sender
module(load="omelasticsearch")

Configure inputs

For system logs, you typically don’t need any special configuration (unless you want to listen to a non-default Unix Socket). For Apache logs, you’d point to the file(s) you want to monitor. You can use wildcards for file names as well. You also need to specify a syslog tag for each input. You can use this tag later for filtering.

input(type="imfile"
      File="/var/log/apache*.log"
      Tag="apache:"
)

NOTE: By default, rsyslog will not poll for file changes every N seconds. Instead, it will rely on the kernel (via inotify) to poke it when files get changed. This makes the process quite realtime and scales well, especially if you have many files changing rarely. Inotify is also less prone to bugs when it comes to file rotation and other events that would otherwise happen between two “polls”. You can still use the legacy mode=”polling” by specifying it in imfile’s module parameters.

Queue and workers

By default, all incoming messages go into a main queue. You can also separate flows (e.g. files and system logs) by using different rulesets but let’s keep it simple for now.

For tailing files, this kind of queue would work well:

main_queue(
  queue.workerThreads="4"
  queue.dequeueBatchSize="1000"
  queue.size="10000"
)

This would be a small in-memory queue of 10K messages, which works well if Elasticsearch goes down, because the data is still in the file and rsyslog can stop tailing when the queue becomes full, and then resume tailing. 4 worker threads will pick batches of up to 1000 messages from the queue, parse them (see below) and send the resulting JSONs to Elasticsearch.

If you need a larger queue (e.g. if you have lots of system logs and want to make sure they’re not lost), I would recommend using a disk-assisted memory queue, that will spill to disk whenever it uses too much memory:

main_queue(
  queue.workerThreads="4"
  queue.dequeueBatchSize="1000"
  queue.highWatermark="500000"    # max no. of events to hold in memory
  queue.lowWatermark="200000"     # use memory queue again, when it's back to this level
  queue.spoolDirectory="/var/run/rsyslog/queues"  # where to write on disk
  queue.fileName="stats_ruleset"
  queue.maxDiskSpace="5g"        # it will stop at this much disk space
  queue.size="5000000"           # or this many messages
  queue.saveOnShutdown="on"      # save memory queue contents to disk when rsyslog is exiting
)

Parsing with mmnormalize

The message normalization module uses liblognorm to do the parsing. So in the configuration you’d simply point rsyslog to the liblognorm rulebase:

action(type="mmnormalize"
  rulebase="/opt/rsyslog/apache.rb"
)

where apache.rb will contain rules for parsing apache logs, that can look like this:

version=2

rule=:%clientip:word% %ident:word% %auth:word% [%timestamp:char-to:]%] "%verb:word% %request:word% HTTP/%httpversion:float%" %response:number% %bytes:number% "%referrer:char-to:"%" "%agent:char-to:"%"%blob:rest%

Where version=2 indicates that rsyslog should use liblognorm’s v2 engine (which is was introduced in rsyslog 8.13) and then you have the actual rule for parsing logs. You can find more details about configuring those rules in the liblognorm documentation.

Besides parsing Apache logs, creating new rules typically requires a lot of trial and error. To check your rules without messing with rsyslog, you can use the lognormalizer binary like:

head -1 /path/to/log.file | /usr/lib/lognorm/lognormalizer -r /path/to/rulebase.rb -e json

NOTE: If you’re used to Logstash’s grok, this kind of parsing rules will look very familiar. However, things are quite different under the hood. Grok is a nice abstraction over regular expressions, while liblognorm builds parse trees out of specialized parsers. This makes liblognorm much faster, especially as you add more rules. In fact, it scales so well, that for all practical purposes, performance depends on the length of the log lines and not on the number of rules. This post explains the theory behind this assuption, and there are already some preliminary tests to prove it as well (some of which we’ll present at Lucene Revolution). The downside is that you’ll lose some of the flexibility offered by regular expressions. You can still use regular expressions with liblognorm (you’d need to set allow_regex to on when loading mmnormalize) but then you’d lose a lot of the benefits that come with the parse tree approach.

Templates and actions

If you’re using rsyslog only for parsing Apache logs (and not system logs) and send your logs to Logsene, this bit is rather simple. Because by the time parsing ended, you already have all the relevant fields in the $!all-json variable, that you’ll use as a template:

template(name="all-json" type="list"){
  property(name="$!all-json")
}

Then you an use this template to send logs to Logsene via the Elasticsearch API and using your Logsene application token as the index name:

action(type="omelasticsearch"
  template="all-json"  # use the template defined earlier
  searchIndex="LOGSENE-APP-TOKEN-GOES-HERE"
  searchType="apache"
  server="logsene-receiver.sematext.com"
  serverport="80"
  bulkmode="on"  # use the bulk API
  action.resumeretrycount="-1"  # retry indefinitely if Logsene/Elasticsearch is unreachable
)

Putting both Apache and system logs together

Finally, if you use the same rsyslog to parse system logs, mmnormalize won’t parse them (because they don’t match Apache’s common log format). In this case, you’ll need to pick the rsyslog properties you want and build an additional JSON template:

template(name="plain-syslog"
  type="list") {
    constant(value="{")
      constant(value="\"timestamp\":\"")     property(name="timereported" dateFormat="rfc3339")
      constant(value="\",\"host\":\"")        property(name="hostname")
      constant(value="\",\"severity\":\"")    property(name="syslogseverity-text")
      constant(value="\",\"facility\":\"")    property(name="syslogfacility-text")
      constant(value="\",\"tag\":\"")   property(name="syslogtag" format="json")
      constant(value="\",\"message\":\"")    property(name="msg" format="json")
    constant(value="\"}")
}

Then you can make rsyslog decide: if a log was parsed successfully, use the all-json template. If not, use the plain-syslog one:

if $parsesuccess == "OK" then {
 action(type="omelasticsearch"
  template="all-json"
  ...
 )
} else {
 action(type="omelasticsearch"
  template="plain-syslog"
  ...
 )
}

And that’s it! Now you can restart rsyslog and get both your system and Apache logs parsed, buffered and indexed into Logsene. If you rolled your own Elasticsearch cluster, there’s one more step on the rsyslog side.

Time-based indices in your own Elasticsearch cluster

Logsene uses time-based indices out of the box, but in a local setup you’ll need to do this yourself. Such a design will give your cluster a lot more capacity due to the way Elasticsearch merges data in the background (we covered this in detail in our presentations at GeeCON and Berlin Buzzwords).

To make rsyslog use daily or other time-based indices, you need to define a template that builds an index name off the timestamp of each log. This is one that names them logstash-YYYY.MM.DD, like Logstash does by default:

template(name="logstash-index"
  type="list") {
    constant(value="logstash-")
    property(name="timereported" dateFormat="rfc3339" position.from="1" position.to="4")
    constant(value=".")
    property(name="timereported" dateFormat="rfc3339" position.from="6" position.to="7")
    constant(value=".")
    property(name="timereported" dateFormat="rfc3339" position.from="9" position.to="10")
}

And then you’d use this template in the Elasticsearch output instead of a static index name (this also requires setting dynSearchIndex to on):

action(type="omelasticsearch"
  template="all-json"
  dynSearchIndex="on"
  searchIndex="logstash-index"
  searchType="apache"
  server="MY-ELASTICSEARCH-SERVER"
  bulkmode="on"
  action.resumeretrycount="-1"
)

And now you’re really done, at least as far as rsyslog is concerned. For tuning Elasticsearch, have a look at our GeeCON and Berlin Buzzwords presentations. If you have additional questions, please let us know in the comments. And if you find this topic exciting, we’re happy to let you know that we’re hiring worldwide.

Replaying Elasticsearch Slowlogs with Logstash and JMeter

Sometimes we just need to replay production queries – whether it’s because we want a realistic load test for the new version of a product or because we want to reproduce, in a test environment, a bug that only occurs in production (isn’t it lovely when that happens? Everything is fine in tests but when you deploy, tons of exceptions in your logs, tons of alerts from the monitoring system…).

With Elasticsearch, you can enable slowlogs to make it log queries taking longer (per shard) than a certain threshold. You can change settings on demand. For example, the following request will record all queries for test-index:

curl -XPUT localhost:9200/test-index/_settings -d '{
  "index.search.slowlog.threshold.query.warn" : "1ms"
}'

You can run those queries from the slowlog in a test environment via a tool like JMeter. In this post, we’ll cover how to parse slowlogs with Logstash to write only the queries to a file, and how to configure JMeter to run queries from that file on an Elasticsearch cluster.

Read More

Solr Redis Plugin Use Cases and Performance Tests

The Solr Redis Plugin is an extension for Solr that provides a query parser that uses data stored in Redis. It is open-sourced on Github by Sematext. This tool is basically a QParserPlugin that establishes a connection to Redis and takes data stored in SET, ZRANGE and other Redis data structures in order to build a query. Data fetched from Redis is used in RedisQParser and is responsible for building a query. Moreover, this plugin provides a highlighter extension which can be used to highlight parts of aliased Solr Redis queries (this will be described in a future).

Use Case: Social Network

Imagine you have a social network and you want to implement a search solution that can search things like: events, interests, photos, and all your friends’ events, interests, and photos. A naive, Solr-only-based implementation would search over all documents and filter by a “friends” field. This requires denormalization and indexing the full list of friends into each document that belongs to a user. Building a query like this is just searching over documents and adding something like a “friends:1234” clause to the query. It seems simple to implement, but the reality is that this is a terrible solution when you need to update a list of friends because it requires a modification of each document. So when the number of documents (e.g., photos, events, interests, friends and their items) connected with a user grows, the number of potential updates rises dramatically and each modification of connections between users becomes a nightmare. Imagine a person with 10 photos and 100 friends (all of which have their photos, events, interests, etc.).  When this person gets the 101th friend, the naive system with flattened data would have to update a lot of documents/rows.  As we all know, in a social network connections between people are constantly being created and removed, so such a naive Solr-only system could not really scale.

Social networks also have one very important attribute: the number of connections of a single user is typically not expressed in millions. That number is typically relatively small — tens, hundreds, sometimes thousands. This begs the question: why not carry information about user connections in each query sent to a search engine? That way, instead of sending queries with clause “friends:1234,” we can simply send queries with multiple user IDs connected by an “OR” operator. When a query has all the information needed to search entities that belong to a user’s friends, there is no need to store a list of friends in each user’s document. Storing user connections in each query leads to sending of rather large queries to a search engine; each of them containing multiple terms containing user ID (e.g., id:5 OR id:10 OR id:100 OR …) connected by a disjunction operator. When the number of terms grows the query requests become very big. And that’s a problem, because preparing it and sending it to a search engine over the network becomes awkward and slow.

How Does It Work?

The image below presents how Solr Redis Plugin works.

Read More

Apache Spark Monitoring in SPM

Apache Spark is an open-source, large-scale data processing engine built on top of the Hadoop Distributed File System (HDFS) and enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk.  So it’s not surprising the usage of Spark is booming as this Google Trends graph shows.

And while Spark usage has been going through the roof, Engineers and DevOps handling Spark have not had a good monitoring tool at their disposal.  Well, that is, until now.  By releasing the first Spark monitoring product to market Sematext has, with the addition of Spark monitoring to SPM Performance Monitoring, Alerting and Anomaly Detection, just filled a big hole in the Spark ecosystem.

Having just been added — along with other goodies — to the latest SPM release, SPM for Spark monitors all Spark metrics.  It includes alerting, anomaly detection, log correlation, custom dashboards, events graphing, custom metrics, and a ton more.  SPM can be installed On Premises or one can use the Cloud version run by Sematext, in which case the setup takes less than 5 minutes before graphs with performance metrics start appearing in real-time.

Enough with the words – Show me what Spark Monitoring looks like!

Have a look at a few screenshots to see how we graph Spark metrics in SPM.  While we don’t use Spark at Sematext at this time and thus don’t have a live demo to show you, you can check out SPM’s live demo and see some other types of apps we monitor, such as Hadoop, HBase, Cassandra, Kafka, Storm, ZooKeeper, Elasticsearch, Solr, NGINX and NGINX Plus, Apache, MySQL, Redis, Java webapps and generic Java applications, as well as custom metrics.

Screenshot – Spark Executor metrics [click to enlarge]

Spark_screenshot_Executor_3

Screenshot – Spark Worker metrics  [click to enlarge]

Spark_screenshot_Worker_2

And One More Thing…

SPM now works hand-in-hand with Logsene Log Management and Analytics.  This makes the integration of performance metrics, logs, events and anomalies more robust for those of you looking to combine performance monitoring and centralized log management in one place — not only knowing that SOMETHING affected performance of your Spark cluster when you look at your performance metrics graphs or get an alert, but also exactly WHAT happened with the cluster by having immediate access to all relevant Spark event logs right there!

Take a Test Drive — It’s Easy and Free to Get Started

Like what you see here?  Sound like something that could benefit your organization?  Then try SPM and/or Logsene for Free for 30 days by registering here.  There’s no commitment and no credit card required.

Announcement: Percentiles added to SPM

In the spirit of continuous improvement, we are happy to announce that percentiles have recently been added to SPM’s arsenal of measurement tools.  Percentiles provide more accurate statistics than averages, and users are able to see 50%, 95% and 99% percentiles for specific metrics and set both regular threshold-based as well as anomaly detection alerts.  We will go more into the details about how the percentiles are computed in another post, but for now we want to put the word out and show some of the related graphs — click on them to enlarge them.  Enjoy!

Elasticsearch – Request Rate and Latency

pecentiles_es

Garbage Collectors Time

percentiles_gc

Kafka – Flush Time

percentiles_kafka_1

Kafka – Fetch/Produce Latency 1

percentiles_kafka_2

Kafka – Fetch/Produce Latency 2

percentiles_kafka_3

Solr Req. Rate and Latency 1

percentile_solr

Solr – Req. Rate and Latency 2

percentiles_solr_2

If you enjoy performance monitoring, log analytics, or search analytics, working with projects like Elasticsearch, Solr, HBase, Hadoop, Kafka, Storm, we’re hiring planet-wide!

Rsyslog 8.1 Elasticsearch Output Performance

Recently, the first rsyslog version 8 release was announced. Major changes in its core should give outputs better performance, and the one for Elasticsearch should benefit a lot. Since we’re using rsyslog and Elasticsearch in Sematext‘s own Logsene, we had to take the new version for a spin.

The Weapon and the Target

For testing, we used a good-old i3 laptop, with 8GB of RAM. We generated 20 million logs, sent them to rsyslog via TCP and from there to Elasticsearch in the Logstash format, so they can get explored with Kibana. The objective was to stuff as many events per second into Elasticsearch as possible.

Rsyslog Architecture Overview

In order to tweak rsyslog effectively, one needs to understand its architecture, which is not that obvious (although there’s an ongoing effort to improve the documentation). The gist of it its architecture represented in the figure below.

  • you have input modules taking messages (from files, network, journal, etc.) and pushing them to a main queue
  • one or more main queue threads take those events and parse them. By default, they parse syslog formats, but you can configure rsyslog to use message modifier modules to do additional parsing (e.g. CEE-formatted JSON messages). Either way, this parsing generates structured events, made out of properties
  • after parsing, the main queue threads push events to the action queue. Or queues, if there are multiple actions and you want to fan-out
  • for each defined action, one or more action queue threads takes properties from events according to templates, and makes messages that would be sent to the destination. In Elasticsearch’s case, a template should make Elasticsearch JSON documents, and the destination would be the REST API endpoint
rsyslog message flow
rsyslog message flow

There are two more things to say about rsyslog’s architecture before we move on to the actual test:

  • you can have multiple independent flows (like the one in the figure above) in the same rsyslog process by using rulesets. Think of rulesets as swim-lanes. They’re useful for example when you want to process local logs and remote logs in a completely separate manner
  • queues can be in-memory, on disk, or a combination called disk-assisted. Here, we’ll use in-memory because they’re the fastest. For more information about how queues work, take a look here

Configuration

To generate messages, we used tcpflood, a small and light tool that’s part of rsyslog’s testbench. It generates messages and sends them over to the local syslog via TCP.

Rsyslog took received those messages with the imtcp input module, queued them and forwarded them to Elasticsearch 0.90.7, which was also installed locally. We also tried with Elasticsearch 1.0 Beta 1 and got the same results (see below).

The flow of messages in this test is represented in the following figure:

Flow of messages in this test
Flow of messages in this test

The actual rsyslog config is listed below. It can be tuned further (for example by using the multithreaded imptcp input module), but we didn’t get significant improvements.

module(load="imtcp")   # TCP input module
module(load="omelasticsearch") # Elasticsearch output module

input(type="imtcp" port="13514")  # where to listen for TCP messages

main_queue(
    queue.size="1000000" # capacity of the main queue
    queue.dequeuebatchsize="1000" # process messages in batches of 1000
    queue.workerthreads="2" # 2 threads for the main queue
)

# template to generate JSON documents for Elasticsearch
template(name="plain-syslog"
         type="list") {
           constant(value="{")
             constant(value=""@timestamp":"")      property(name="timereported" dateFormat="rfc3339")
             constant(value="","host":"")        property(name="hostname")
             constant(value="","severity":"")    property(name="syslogseverity-text")
             constant(value="","facility":"")    property(name="syslogfacility-text")
             constant(value="","syslogtag":"")   property(name="syslogtag" format="json")
             constant(value="","message":"")    property(name="msg" format="json")
             constant(value=""}")
         }

action(type="omelasticsearch"
       template="plain-syslog"  # use the template defined earlier
       searchIndex="test-index"
       bulkmode="on"
       queue.dequeuebatchsize="5000"   # ES bulk size
       queue.size="100000"
       queue.workerthreads="5"
       action.resumeretrycount="-1"  # retry indefinitely if ES is unreachable
)

You can see from the configuration that:

  • both main and action queues have a defined size in number of messages
  • both have number of threads that deliver messages to the next step. The action needs more because it has to wait for Elasticsearch to reply
  • moving of messages from the queues happens in batches. For the Elasticsearch output, the batch of messages is sent through the Bulk API, which makes queue.dequeuebatchsize effectively the bulk size

Results

We started with default Elasticsearch settings. Then we tuned them to leave rsyslog with a more significant slice of the CPU. We measured the indexing rate with SPM. Here are the average results over 20 million indexed events:

  • with default Elasticsearch settings, we got 8,000 events per second
  • after setting Elasticsearch up more production-like (5 second refresh interval, increased index buffer size, translog thresholds, etc), and the throughput went up to average of 20,000 events per second
  • in the end, we went berserk and used in-memory indices, updated the mapping to disable any storing or indexing for any field, to have Elasticsearch do as little work as possible and make room for rsyslog. Got an average of 30,000 events per second. In this scenario, rsyslog was using between 1 and 1.5 of the 4 virtual CPU cores, with tcpflood using 0.5 and Elasticsearch using from 2 to 2.5

Conclusion

20K EPS on a low-end machine with production-like configuration means Elasticsearch is quick at indexing. This is very good for logs, where you typically have lots of messages being generated, compared to how often you search.

If you need some tool to ship your logs to Elasticsearch with minimum overhead, rsyslog with its new version 8 may well be your best bet.

Additional resources:

NOTE: We’re hiring, so if you find this interesting, please check us out. Search, logging and performance is our thing!

Video: Using Solr for Logs with Rsyslog, Flume, Fluentd and Logstash

A while ago we published the slides from our talk at Lucene Revolution about using Solr for indexing and searching logs. This topic is of special interest for us, since we’ve released Logsene and we’re also offering consulting services for logging infrastructure. If you’re also into working with search engines or logs, please note that we’re hiring worldwide.

The video for that talk is now available, and you can watch it below. The talk is made of three parts:

  • one that discusses the general concepts of what a log is, structured logging and indexing logs in general, whether it’s Solr or Elasticsearch
  • one that shows how to use existing tools to send logs to Solr: Rsyslog and Fluentd to send structured events (yes, structured syslog!); Apache Flume and Logstash to take unstructured data, make it structured via Morphlines and Grok, and then send it to Solr
  • one that shows how to optimize Solr’s performance for handling logs. From tuning the commit frequency and merge factor to using time-based collections with aliases