Hiring: Elasticsearch Engineer – Instructor – Consultant

Sematext is hiring! More specifically, we are looking for people with Elasticsearch skills to join our Professional Services team. Our Elasticsearch Professional Services include:

  • Consulting
  • Production Support
  • Training

Most of our consulting work we do remotely, but occasionally a short on-site visit is needed. Our Production Support already benefits from our team being distributed, but we are looking to add more weight in the western hemisphere for better timezone coverage. We run public Elasticsearch classes several times a year and are looking to do them more frequently. We also offer on-site training, which most often takes place in North America and Europe.

We regularly work with some of the world’s biggest companies and are looking for teammates to join us as Elasticsearch support engineers and consultants, as well as instructors (aka trainers).

Because you can’t really be, become, and remain an Elasticsearch expert unless you work with Elasticsearch, you’ll also get to apply and improve your expertise by working on Logsene, a log management solution with Elasticsearch under the hood, Elasticsearch API, and Kibana integrated in the UI, as well as SPM, an application performance monitoring solution whose subsystems also include Elasticsearch, among other things.

We’re small, young, completely self-funded, distributed and diverse, contribute to open-source, write books, speak at conferences, etc. Because we have both the services side and multiple products there is a lot of variety and there are opportunities to contribute in different areas of our business and grow in multiple directions, both up and “sideways”.

If you’d like to join us, get in touch. See our Jobs page for other openings.

Elastic Stack Training

Elasticsearch Training, San Francisco & New York, October

If you are using Elasticsearch and are looking for Elasticsearch training to quickly improve your Elastic Stack skills, we’ve running several Elasticsearch classes this October in San Francisco and New York.

All classes are also available virtually. This means you get to participate in the class, see the whiteboard, see and hear the instructor as well as other attendees, and they get to see and hear you….. without you having to travel.

Have two people attend the training from the same company? The second one gets 25% off.

To see the full course details, full outline and information about the instructor, click on the class names below.

solr-elasticsearch-training-fall-2016
San Francisco:

ny-solr-elasticsearch-training-fall-2016
New York City:

All classes include breakfast and lunch. If you have special dietary needs, please let us know ahead of time. If you have any questions about any of the classes you can use our live chat (see bottom-right of the page), email us at training@sematext.com or call us at 1-347-480-1610.

Sematext Elastic Stack Training

Elasticsearch / Elastic Stack Training – NYC June 13-16

Next month, June 13-16, 2016, we will be running three Elastic Stack (aka ELK Stack) classes in New York City:

  1. June 13 & 14: Elasticsearch for Developers Training Workshop
  2. June 15: Elasticsearch Operations Training Workshop
  3. June 16: Elasticsearch for Logging Training Workshop

All classes cover Elasticsearch 2.x as well as Elasticsearch 5.x!

You can see the complete course outlines under Training Overview.  All three classes include lots of valuable hands-on exercises.  Be prepared to learn a lot!

Cost:

  • 2-day course: $1,200 early bird rate (valid through June 1) and $1,500 afterwards.
  • 1-day course: $700 early bird rate (valid through June 1) and $800 afterwards.

There’s also a 50% discount for the purchase of a 2nd seat!

Location:
462 7th Avenue, New York, NY 10018 – see map

If you have any questions please get in touch.

rsyslog Elasticsearch reindex multiple scripts(3)

Scalable and Flexible Elasticsearch Reindexing via rsyslog

Earlier on, we posted a recipe on reindexing data from within an Elasticsearch 2.3+ cluster. But this doesn’t work if you want to reindex in a different cluster or if your Elasticsearch is older than 2.3. Or both, when you’re trying to migrate from 1.x to 2.x or later.

For such cases, we posted a Logstash reindexing recipe. However, Logstash can sometimes become a bottleneck, so we needed something faster for indexing lots of data. We turned to rsyslog, a log shipper with performance as its #1 feature.

The plan

As rsyslog doesn’t have an Elasticsearch input like Logstash does, we’ve used an external application to scroll through Elasticsearch documents and push them to rsyslog via TCP. The flow would be:

rsyslog to Elasticsearch reindex flow

This is an easy way to extend rsyslog, using whichever language you’re comfortable with, to support more inputs. Here, we piggyback on the TCP input. You can do a similar job with filters/parsers – you can find some examples here – by piggybacking the mmexternal module, which uses stdout&stdin for communication. The same is possible for outputs, normally added via the omprog module: we did this to add a Solr output and one for SPM custom metrics.

The custom script in question doesn’t have to be multi-threaded, you can simply spin up more of them, scrolling different indices. In this particular case, using two scripts gave us slightly better throughput, saturating the network:

rsyslog to Elasticsearch reindex flow multiple scripts

Writing the custom script

Before starting to write the script, one needs to know how the messages sent to rsyslog would look like. To be able to index data, rsyslog will need an index name, a type name and optionally an ID. In this particular case, we were dealing with logs, so the ID wasn’t necessary.

With this in mind, I see a number of ways of sending data to rsyslog:

  • one big JSON per line. One can use mmnormalize to parse that JSON, which then allows rsyslog do use values from within it as index name, type name, and so on
  • for each line, begin with the bits of “extra data” (like index and type names) then put the JSON document that you want to reindex. Again, you can use mmnormalize to parse, but this time you can simply trust that the last thing is a JSON and send it to Elasticsearch directly, without the need to parse it
  • if you only need to pass two variables (index and type name, in this case), you can piggyback on the vague spec of RFC3164 syslog and send something like
    destination_index document_type:{"original": "document"}
    

This last option will parse the provided index name in the hostname variable, the type in syslogtag and the original document in msg. A bit hacky, I know, but quite convenient (makes the rsyslog configuration straightforward) and very fast, since we know the RFC3164 parser is very quick and it runs on all messages anyway. No need for mmnormalize, unless you want to change the document in-flight with rsyslog.

Below you can find the Python code that can scan through existing documents in an index (or index pattern, like logstash_2016.05.*) and push them to rsyslog via TCP. You’ll need the Python Elasticsearch client (pip install elasticsearch) and you’d run it like this:

python elasticsearch_to_rsyslog.py source_index destination_index

The script being:

from elasticsearch import Elasticsearch
import json, socket, sys

source_cluster = ['server1', 'server2']
rsyslog_address = '127.0.0.1'
rsyslog_port = 5514

es = Elasticsearch(source_cluster,
      retry_on_timeout=True,
      max_retries=10)
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((rsyslog_address, rsyslog_port))


result = es.search(index=sys.argv[1], scroll='1m', search_type='scan', size=500)

while True:
  res = es.scroll(scroll_id=result['_scroll_id'], scroll='1m')
  for hit in result['hits']['hits']:
    s.send(sys.argv[2] + ' ' + hit["_type"] + ':' + json.dumps(hit["_source"])+'\n')
  if not result['hits']['hits']:
    break

s.close()

If you need to modify messages, you can parse them in rsyslog via mmjsonparse and then add/remove fields though rsyslog’s scripting language. Though I couldn’t find a nice way to change field names – for example to remove the dots that are forbidden since Elasticsearch 2.0 – so I did that in the Python script:

def de_dot(my_dict):
  for key, value in my_dict.iteritems():
    if '.' in key:
      my_dict[key.replace('.','_')] = my_dict.pop(key)
    if type(value) is dict:
      my_dict[key] = de_dot(my_dict.pop(key))
  return my_dict

And then the “send” line becomes:

s.send(sys.argv[2] + ' ' + hit["_type"] + ':' + json.dumps(de_dot(hit["_source"]))+'\n')

Configuring rsyslog

The first step here is to make sure you have the lastest rsyslog, though the config below works with versions all the way back to 7.x (which can be found in most Linux distributions). You just need to make sure the rsyslog-elasticsearch package is installed, because we need the Elasticsearch output module.

# messages bigger than this are truncated
$maxMessageSize 10000000  # ~10MB

# load the TCP input and the ES output modules
module(load="imtcp")
module(load="omelasticsearch")

main_queue(
  # buffer up to 1M messages in memory
  queue.size="1000000"
  # these threads process messages and send them to Elasticsearch
  queue.workerThreads="4"
  # rsyslog processes messages in batches to avoid queue contention
  # this will also be the Elasticsearch bulk size
  queue.dequeueBatchSize="4000"
)

# we use templates to specify how the data sent to Elasticsearch looks like
template(name="document" type="list"){
  # the "msg" variable contains the document
  property(name="msg")
}
template(name="index" type="list"){
  # "hostname" has the index name
  property(name="hostname")
}
template(name="type" type="list"){
  # "syslogtag" has the type name
  property(name="syslogtag")
}

# start the TCP listener on the port we pointed the Python script to
input(type="imtcp" port="5514")

# sending data to Elasticsearch, using the templates defined earlier
action(type="omelasticsearch"
  template="document"
  dynSearchIndex="on" searchIndex="index"
  dynSearchType="on" searchType="type"
  server="localhost"  # destination Elasticsearch host
  serverport="9200"   # and port
  bulkmode="on"  # use the bulk API
  action.resumeretrycount="-1"  # retry indefinitely if Elasticsearch is unreachable
)

This configuration doesn’t have to disturb your local syslog (i.e. by replacing /etc/rsyslog.conf). You can put it someplace else and run a different rsyslog process:

rsyslogd -i /var/run/rsyslog_reindexer.pid -f /home/me/rsyslog_reindexer.conf

And that’s it! With rsyslog started, you can start the Python script(s) and do the reindexing.

If you need any help with Elasticsearch, rsyslog, Logstash and the like, check out our Elasticsearch consulting, Logging consulting, Elasticsearch production support and Elasticsearch and Logging training info.

23 rules logstash up cpu

Elasticsearch Ingest Node vs Logstash Performance

Starting from Elasticsearch 5.0, you’ll be able to define pipelines within it that process your data, in the same way you’d normally do it with something like Logstash. We decided to take it for a spin and see how this new functionality (called Ingest) compares with Logstash filters in both performance and functionality.

Specifically, we tested the grok processor on Apache common logs (we love logs here), which can be parsed with a single rule, and on CISCO ASA firewall logs, for which we have 23 rules. This way we could also check how both Ingest and Logstash scale when you start adding more rules.

Baseline performance

To get a baseline, we pushed logs with Filebeat 5.0alpha1 directly to Elasticsearch, without parsing them in any way. We used an AWS c3.large for Filebeat (2 vCPU) and a c3.xlarge for Elasticsearch (4 vCPU). We also installed SPM to monitor Elasticsearch’s performance.

It turned out that network was the bottleneck, which is why pushing raw logs doesn’t saturate the CPU:
raw logs CPU

Even though we got a healthy throughput rate of 12-14K EPS:
raw logs throughput

But raw, unparsed logs are rarely useful. Ideally, you’d log in JSON and push directly to Elasticsearch. Conveniently, Filebeat can parse JSON since 5.0. That said, throughput dropped to about 4K EPS because JSON logs are bigger and saturate the network:
Throughput of JSON logs

CPU dropped as well, but not that much because now Elasticsearch has to do more work (more fields to index):
JSON logs CPU

This 4K EPS throughput/40 percent CPU ratio is the most efficient way to send logs to Elasticsearch – if you can log in JSON. If you can’t, you’ll need to parse them. So we added another c3.xl instance (4 vCPUs) to do the parsing, first with Logstash, then with a separate Elasticsearch dedicated Ingest node.

Logstash

With Logstash 5.0 in place, we pointed Filebeat to it, while tailing the raw Apache logs file. On the Logstash side, we have a beats listener, a grok filter and an Elasticsearch output:

input {
  beats {
    port => 5044
  }
}

filter {
   grok {
     match => ["message", "%{COMMONAPACHELOG}%{GREEDYDATA:additional_fields}"]
   }
}

output {
  elasticsearch {
    hosts => "10.154.238.233:9200"
    workers => 4
  }
}

The default number of 2 pipeline workers seemed enough, but we’ve specified more output workers to make up for the time each of them waits for Elasticsearch to reply. That said, network was again the bottleneck so throughput was capped at 4K EPS like with JSON logs:
Logstash apache logs throughput

Meanwhile, Logstash used just about the same amount of CPU as Elasticsearch, at 40-50%:
Logstash apache logs CPU usage

Then we parsed CISCO ASA logs. The config looks similar, except there were 23 grok rules instead of one. Logstash handled the load surprisingly well – throughput was again capped by the network, slightly lower than before because JSONs were bigger:
Logstash CISCO ASA grok throughput

While CPU usage only increased to 60-70%:
Logstash CISCO ASA CPU usage

This means the throughput-to-CPU ratio only went down by about 1.5x after adding a lot more rules. However, in both cases Logstash proved pretty heavy, using about the same CPU to parse the data as Elasticsearch used for indexing it. Let’s see if the Ingest node can do better.

Ingest node

We used the same c3.xl instance for Ingest node tests: we’ve set node.master and node.data to false in its elasticsearch.yml, to make sure it only does grok and nothing else. We’ve also set node.ingest to false of the data node, so it can focus on indexing.

Next step was to define a pipeline that does the grok processing on the Ingest node:

curl -XPOST localhost:9200/_ingest/pipeline/apache?pretty -d '{
  "description": "grok apache logs",
  "processors": [
    {
      "grok": {
        "field": "message",
        "pattern": "%{COMMONAPACHELOG}%{GREEDYDATA:additional_fields}"
      }
    }
  ]
}'

Then, to trigger the pipeline for a certain document/bulk, we added the name of the defined pipeline to the HTTP parameters like pipeline=apache. We used curl this time for indexing, but you can add various parameters in Filebeat, too.

With Apache logs, the throughput numbers were nothing short of impressive (12-16K EPS):
ingest node apache logs grok throughput

This used up all the CPU on the data node, while the ingest node was barely breaking a sweat at 15%:
ingest node grok apache logs CPU usage

Because Filebeat only sent raw logs to Elasticsearch (specifically, the dedicated Ingest node), there was less strain on the network. The Ingest node, on the other hand, also acted like a client node, distributing the logs (now parsed) to the appropriate shards, using the node-to-node transport protocol. Overall, the Ingest node provided ~10x better CPU-to-throughput ratio than Logstash.

Things still look better, but not this dramatic, with CISCO ASA logs. We have multiple sub-types of logs here, and therefore multiple grok rules. With Logstash, you can specify an array of match directives:

grok {
  match => [
   "cisco_message", "%{CISCOFW106001}",
   "cisco_message", "%{CISCOFW106006_106007_106010}",
...

There’s no such thing for Ingest node yet, so you need to define one rule, and then use the on_failure block to define another grok rule (effectively saying “if this rule doesn’t match, try that one”) and keep nesting like that until you’re done:

"grok": {
  "field": "cisco_message",
  "pattern": "%{CISCOFW106001}",
  "on_failure": [
    {
      "grok": {
      "field": "cisco_message",
      "pattern": "%{CISCOFW106006_106007_106010}",
      "on_failure": [...

The other problem is performance. Because now there are up to 23 rules to evaluate, throughput goes down to about 10K EPS:
Ingest node CISCO ASA grok throughput

And the CPU bottleneck shifts to the Ingest node:
Ingest node CISCO ASA grok CPU

Overall, the throughput-to-CPU ratio of the Ingest node dropped by a factor of 9 compared to the Apache logs scenario.

Conclusions

  • Logstash is easier to configure, at least for now, and performance didn’t deteriorate as much when adding rules
  • Ingest node is lighter across the board. For a single grok rule, it was about 10x faster than Logstash
  • Ingest nodes can also act as “client” nodes
  • Define the grok rules matching most logs first, because both Ingest and Logstash exit the chain on the first match by default

You’ve made it all the way down here? Bravo! If you need any help with Elasticsearch – don’t forget @sematext does Elasticsearch Consulting, Production Support, as well as Elasticsearch Training.

Monitoring rsyslog with Kibana and SPM

A while ago we published this post where we explained how you can get stats about rsyslog, such as the number of messages enqueued, the number of output errors and so on. The point was to send them to Elasticsearch (or Logsene, our logging SaaS, which exposes the Elasticsearch API) in order to analyze them.

This is part 2 of that story, where we share how we process these stats in production. We’ll cover:

  • an updated config, working with Elasticsearch 2.x
  • what Kibana dashboards we have in Logsene to get an overview of what rsyslog is doing
  • how we send some of these metrics to SPM as well, in order to set up alerts on their values: both threshold-based alerts and anomaly detection

Read More

Deletions in Elasticsearch

Documents Update By Query with Elasticsearch

SIDE NOTE: We run Elasticsearch and ELK trainings, which may be of interest to you and your teammates.

Just recently, we’ve described how to re-index your Elasticsearch data using the built-in re-index API in Elasticsearch 2.3 (and above). Today, we’ll look at another addition to the upcoming Elasticsearch v2.3+ – the Update by Query API. Yes, you got that right, you will be able to update your documents using a query without having to do any expensive fetching and processing on the application side.

You know how updates work in Elasticsearch or in Apache Lucene in general? Yes, that’s true – Lucene segments are immutable, so once you’ve updated the document, the old one gets marked as deleted in the segment and new version of the document gets indexed. Of course, Elasticsearch builds some additional processing on top of Lucene, so we can use scripts to update our data, use optimistic locking, etc., but still the above picture is true.

However, some use cases force us to update documents, sometimes a lot of them at once. To update a batch of documents matching a query, we needed to know their identifiers. This is how things used to work and the general principle was:

  1. Run a query
  2. Gather the results (probably using Scroll API if you expect a lot of them)
  3. Update returned documents one by one or use bulk API
  4. Repeat from 1) when in need

That is finally over, as similar to how Elasticsearch builds the document update features on top of Lucene, starting from version 2.3 we get the ability to run a query and update all documents matching it. Welcome the Update by Query API. 🙂


For the purposes of this blog post we will again use the same small data set that we’ve used when describing the Re-Index API, so the one available on our Github account (https://github.com/sematext/berlin-buzzwords-samples/tree/master/2014/sample-documents). After indexing the data we should have 18 documents:

$ curl -XGET 'localhost:9200/videosearch/_search?size=0&pretty'
{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 18,
    "max_score" : 0.0,
    "hits" : [ ]
  }
}

Let’s assume that we would like to update all the documents that have solr (yes, yes, I know) in the tags field and increment the values stored in their views field. With the Update by Query API, this is as simple as running the following code:

$ curl -XPOST 'localhost:9200/videosearch/_update_by_query?pretty' -d '{
 "query" : {
  "term" : {
   "tags" : "solr"
  }
 },
 "script" : {
  "inline" : "ctx._source.likes += num_likes",
  "params" : {
   "num_likes" : 1
  }
 }
}'

As you can see this was easy. We’ve provided a simple term query and included a script that increments the data. The whole request was sent to the _update_by_query REST end-point in an index we are interested in.

The response of Elasticsearch for the above request, on our example data set, would be similar to the following one (don’t forget to enable inline scripting by adding the script.inline: on to elasticsearch.yml):

{
  "took" : 60,
  "timed_out" : false,
  "total" : 11,
  "updated" : 11,
  "batches" : 1,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : 0,
  "failures" : [ ]
}

The response Elasticsearch returns tells about about the number of updated documents, the number of batches that were created, and information about conflicts and retries. Finally, we have the information on failures.

Is there something that we can control when using the Update by Query API? Again, the answer is yes. We can control the language of the script, we can control the write consistency, replication (synchronous or asynchronous), routing, timeout and response. For example, to get information about all processed documents we could use the following information:

$ curl -XPOST 'localhost:9200/videosearch/_update_by_query?pretty&response=all' -d '{
 "query" : {
  "term" : {
   "tags" : "solr"
  }
 },
 "script" : {
  "inline" : "ctx._source.likes += num_likes",
  "params" : {
   "num_likes" : 1
  },
  "lang" : "groovy"
 }
}'

Or we can control consistency and timeout:

$ curl -XPOST 'localhost:9200/videosearch/_update_by_query?pretty&consistency=one&timeout=1m' -d '{
 "query" : {
  "term" : {
   "tags" : "solr"
  }
 },
 "script" : {
  "inline" : "ctx._source.likes += num_likes",
  "params" : {
   "num_likes" : 1
  },
  "lang" : "groovy"
 }
}'

Let’s take a second at what the response parameter does. It controls what bulk response items to include in the response of the command. The possible values are:

  • none – the default value, which means that no response items will be returned,
  • failed – only information about documents that failed to be updated will be returned,
  • all – information about all processed documents will be returned. Please remember that this options can lead to very large responses when your update by query request processes a lot of data. Because of that you may run into large memory consumption or even into out of memory situations.

What’s next?

Of course, the Update by Query API and Re-index API that we recently wrote about are nice, but what if the update request execution takes a very long time? It would be nice to be able to control or even cancel its execution, wouldn’t it? Well, we have good news – this is coming soon, probably in the next major version of Elasticsearch – see Github issue #15117.

If you need any help with Elasticsearch, check out our Elasticsearch Consulting, Elasticsearch Production Support, and Elasticsearch Training info.

es-reindex

Reindexing Data with Elasticsearch

SIDE NOTE: We run Elasticsearch and ELK trainings, which may be of interest to you and your teammates.

Sooner or later, you’ll run into a problem of reindexing the data of your Elasticsearch instances. When we do Elasticsearch consulting for clients we always look at whether they have some way to efficiently reindex previously indexed data. The reasons for reindexing vary – from data type changes, analysis changes, to introduction of new fields that that need to be populated. No matter the case, you may either reindex from your source of truth or treat your Elasticsearch instance as such. Up to Elasticsearch 2.3 we had to use external tools to help us with this operation, like Logstash or stream2es. We even wrote about how to approach reindexing of data with Logstash. However, today we would like to look at the new functionality that will be added to Elasticsearch 2.3 – the re-index API.

Read More