Sematext Elastic Stack Training

Elasticsearch / Elastic Stack Training – NYC June 13-16

Next month, June 13-16, 2016, we will be running three Elastic Stack (aka ELK Stack) classes in New York City:

  1. June 13 & 14: Elasticsearch for Developers Training Workshop
  2. June 15: Elasticsearch Operations Training Workshop
  3. June 16: Elasticsearch for Logging Training Workshop

All classes cover Elasticsearch 2.x as well as Elasticsearch 5.x!

You can see the complete course outlines under Training Overview.  All three classes include lots of valuable hands-on exercises.  Be prepared to learn a lot!

Cost:

  • 2-day course: $1,200 early bird rate (valid through June 1) and $1,500 afterwards.
  • 1-day course: $700 early bird rate (valid through June 1) and $800 afterwards.

There’s also a 50% discount for the purchase of a 2nd seat!

Location:
462 7th Avenue, New York, NY 10018 – see map

If you have any questions please get in touch.

rsyslog Elasticsearch reindex multiple scripts(3)

Scalable and Flexible Elasticsearch Reindexing via rsyslog

Earlier on, we posted a recipe on reindexing data from within an Elasticsearch 2.3+ cluster. But this doesn’t work if you want to reindex in a different cluster or if your Elasticsearch is older than 2.3. Or both, when you’re trying to migrate from 1.x to 2.x or later.

For such cases, we posted a Logstash reindexing recipe. However, Logstash can sometimes become a bottleneck, so we needed something faster for indexing lots of data. We turned to rsyslog, a log shipper with performance as its #1 feature.

The plan

As rsyslog doesn’t have an Elasticsearch input like Logstash does, we’ve used an external application to scroll through Elasticsearch documents and push them to rsyslog via TCP. The flow would be:

rsyslog to Elasticsearch reindex flow

This is an easy way to extend rsyslog, using whichever language you’re comfortable with, to support more inputs. Here, we piggyback on the TCP input. You can do a similar job with filters/parsers – you can find some examples here – by piggybacking the mmexternal module, which uses stdout&stdin for communication. The same is possible for outputs, normally added via the omprog module: we did this to add a Solr output and one for SPM custom metrics.

The custom script in question doesn’t have to be multi-threaded, you can simply spin up more of them, scrolling different indices. In this particular case, using two scripts gave us slightly better throughput, saturating the network:

rsyslog to Elasticsearch reindex flow multiple scripts

Writing the custom script

Before starting to write the script, one needs to know how the messages sent to rsyslog would look like. To be able to index data, rsyslog will need an index name, a type name and optionally an ID. In this particular case, we were dealing with logs, so the ID wasn’t necessary.

With this in mind, I see a number of ways of sending data to rsyslog:

  • one big JSON per line. One can use mmnormalize to parse that JSON, which then allows rsyslog do use values from within it as index name, type name, and so on
  • for each line, begin with the bits of “extra data” (like index and type names) then put the JSON document that you want to reindex. Again, you can use mmnormalize to parse, but this time you can simply trust that the last thing is a JSON and send it to Elasticsearch directly, without the need to parse it
  • if you only need to pass two variables (index and type name, in this case), you can piggyback on the vague spec of RFC3164 syslog and send something like
    destination_index document_type:{"original": "document"}
    

This last option will parse the provided index name in the hostname variable, the type in syslogtag and the original document in msg. A bit hacky, I know, but quite convenient (makes the rsyslog configuration straightforward) and very fast, since we know the RFC3164 parser is very quick and it runs on all messages anyway. No need for mmnormalize, unless you want to change the document in-flight with rsyslog.

Below you can find the Python code that can scan through existing documents in an index (or index pattern, like logstash_2016.05.*) and push them to rsyslog via TCP. You’ll need the Python Elasticsearch client (pip install elasticsearch) and you’d run it like this:

python elasticsearch_to_rsyslog.py source_index destination_index

The script being:

from elasticsearch import Elasticsearch
import json, socket, sys

source_cluster = ['server1', 'server2']
rsyslog_address = '127.0.0.1'
rsyslog_port = 5514

es = Elasticsearch(source_cluster,
      retry_on_timeout=True,
      max_retries=10)
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((rsyslog_address, rsyslog_port))


result = es.search(index=sys.argv[1], scroll='1m', search_type='scan', size=500)

while True:
  res = es.scroll(scroll_id=result['_scroll_id'], scroll='1m')
  for hit in result['hits']['hits']:
    s.send(sys.argv[2] + ' ' + hit["_type"] + ':' + json.dumps(hit["_source"])+'\n')
  if not result['hits']['hits']:
    break

s.close()

If you need to modify messages, you can parse them in rsyslog via mmjsonparse and then add/remove fields though rsyslog’s scripting language. Though I couldn’t find a nice way to change field names – for example to remove the dots that are forbidden since Elasticsearch 2.0 – so I did that in the Python script:

def de_dot(my_dict):
  for key, value in my_dict.iteritems():
    if '.' in key:
      my_dict[key.replace('.','_')] = my_dict.pop(key)
    if type(value) is dict:
      my_dict[key] = de_dot(my_dict.pop(key))
  return my_dict

And then the “send” line becomes:

s.send(sys.argv[2] + ' ' + hit["_type"] + ':' + json.dumps(de_dot(hit["_source"]))+'\n')

Configuring rsyslog

The first step here is to make sure you have the lastest rsyslog, though the config below works with versions all the way back to 7.x (which can be found in most Linux distributions). You just need to make sure the rsyslog-elasticsearch package is installed, because we need the Elasticsearch output module.

# messages bigger than this are truncated
$maxMessageSize 10000000  # ~10MB

# load the TCP input and the ES output modules
module(load="imtcp")
module(load="omelasticsearch")

main_queue(
  # buffer up to 1M messages in memory
  queue.size="1000000"
  # these threads process messages and send them to Elasticsearch
  queue.workerThreads="4"
  # rsyslog processes messages in batches to avoid queue contention
  # this will also be the Elasticsearch bulk size
  queue.dequeueBatchSize="4000"
)

# we use templates to specify how the data sent to Elasticsearch looks like
template(name="document" type="list"){
  # the "msg" variable contains the document
  property(name="msg")
}
template(name="index" type="list"){
  # "hostname" has the index name
  property(name="hostname")
}
template(name="type" type="list"){
  # "syslogtag" has the type name
  property(name="syslogtag")
}

# start the TCP listener on the port we pointed the Python script to
input(type="imtcp" port="5514")

# sending data to Elasticsearch, using the templates defined earlier
action(type="omelasticsearch"
  template="document"
  dynSearchIndex="on" searchIndex="index"
  dynSearchType="on" searchType="type"
  server="localhost"  # destination Elasticsearch host
  serverport="9200"   # and port
  bulkmode="on"  # use the bulk API
  action.resumeretrycount="-1"  # retry indefinitely if Elasticsearch is unreachable
)

This configuration doesn’t have to disturb your local syslog (i.e. by replacing /etc/rsyslog.conf). You can put it someplace else and run a different rsyslog process:

rsyslogd -i /var/run/rsyslog_reindexer.pid -f /home/me/rsyslog_reindexer.conf

And that’s it! With rsyslog started, you can start the Python script(s) and do the reindexing.

If you need any help with Elasticsearch, rsyslog, Logstash and the like, check out our Elasticsearch consulting, Logging consulting, Elasticsearch production support and Elasticsearch and Logging training info.

23 rules logstash up cpu

Elasticsearch Ingest Node vs Logstash Performance

Starting from Elasticsearch 5.0, you’ll be able to define pipelines within it that process your data, in the same way you’d normally do it with something like Logstash. We decided to take it for a spin and see how this new functionality (called Ingest) compares with Logstash filters in both performance and functionality.

Specifically, we tested the grok processor on Apache common logs (we love logs here), which can be parsed with a single rule, and on CISCO ASA firewall logs, for which we have 23 rules. This way we could also check how both Ingest and Logstash scale when you start adding more rules.

Baseline performance

To get a baseline, we pushed logs with Filebeat 5.0alpha1 directly to Elasticsearch, without parsing them in any way. We used an AWS c3.large for Filebeat (2 vCPU) and a c3.xlarge for Elasticsearch (4 vCPU). We also installed SPM to monitor Elasticsearch’s performance.

It turned out that network was the bottleneck, which is why pushing raw logs doesn’t saturate the CPU:
raw logs CPU

Even though we got a healthy throughput rate of 12-14K EPS:
raw logs throughput

But raw, unparsed logs are rarely useful. Ideally, you’d log in JSON and push directly to Elasticsearch. Conveniently, Filebeat can parse JSON since 5.0. That said, throughput dropped to about 4K EPS because JSON logs are bigger and saturate the network:
Throughput of JSON logs

CPU dropped as well, but not that much because now Elasticsearch has to do more work (more fields to index):
JSON logs CPU

This 4K EPS throughput/40 percent CPU ratio is the most efficient way to send logs to Elasticsearch – if you can log in JSON. If you can’t, you’ll need to parse them. So we added another c3.xl instance (4 vCPUs) to do the parsing, first with Logstash, then with a separate Elasticsearch dedicated Ingest node.

Logstash

With Logstash 5.0 in place, we pointed Filebeat to it, while tailing the raw Apache logs file. On the Logstash side, we have a beats listener, a grok filter and an Elasticsearch output:

input {
  beats {
    port => 5044
  }
}

filter {
   grok {
     match => ["message", "%{COMMONAPACHELOG}%{GREEDYDATA:additional_fields}"]
   }
}

output {
  elasticsearch {
    hosts => "10.154.238.233:9200"
    workers => 4
  }
}

The default number of 2 pipeline workers seemed enough, but we’ve specified more output workers to make up for the time each of them waits for Elasticsearch to reply. That said, network was again the bottleneck so throughput was capped at 4K EPS like with JSON logs:
Logstash apache logs throughput

Meanwhile, Logstash used just about the same amount of CPU as Elasticsearch, at 40-50%:
Logstash apache logs CPU usage

Then we parsed CISCO ASA logs. The config looks similar, except there were 23 grok rules instead of one. Logstash handled the load surprisingly well – throughput was again capped by the network, slightly lower than before because JSONs were bigger:
Logstash CISCO ASA grok throughput

While CPU usage only increased to 60-70%:
Logstash CISCO ASA CPU usage

This means the throughput-to-CPU ratio only went down by about 1.5x after adding a lot more rules. However, in both cases Logstash proved pretty heavy, using about the same CPU to parse the data as Elasticsearch used for indexing it. Let’s see if the Ingest node can do better.

Ingest node

We used the same c3.xl instance for Ingest node tests: we’ve set node.master and node.data to false in its elasticsearch.yml, to make sure it only does grok and nothing else. We’ve also set node.ingest to false of the data node, so it can focus on indexing.

Next step was to define a pipeline that does the grok processing on the Ingest node:

curl -XPOST localhost:9200/_ingest/pipeline/apache?pretty -d '{
  "description": "grok apache logs",
  "processors": [
    {
      "grok": {
        "field": "message",
        "pattern": "%{COMMONAPACHELOG}%{GREEDYDATA:additional_fields}"
      }
    }
  ]
}'

Then, to trigger the pipeline for a certain document/bulk, we added the name of the defined pipeline to the HTTP parameters like pipeline=apache. We used curl this time for indexing, but you can add various parameters in Filebeat, too.

With Apache logs, the throughput numbers were nothing short of impressive (12-16K EPS):
ingest node apache logs grok throughput

This used up all the CPU on the data node, while the ingest node was barely breaking a sweat at 15%:
ingest node grok apache logs CPU usage

Because Filebeat only sent raw logs to Elasticsearch (specifically, the dedicated Ingest node), there was less strain on the network. The Ingest node, on the other hand, also acted like a client node, distributing the logs (now parsed) to the appropriate shards, using the node-to-node transport protocol. Overall, the Ingest node provided ~10x better CPU-to-throughput ratio than Logstash.

Things still look better, but not this dramatic, with CISCO ASA logs. We have multiple sub-types of logs here, and therefore multiple grok rules. With Logstash, you can specify an array of match directives:

grok {
  match => [
   "cisco_message", "%{CISCOFW106001}",
   "cisco_message", "%{CISCOFW106006_106007_106010}",
...

There’s no such thing for Ingest node yet, so you need to define one rule, and then use the on_failure block to define another grok rule (effectively saying “if this rule doesn’t match, try that one”) and keep nesting like that until you’re done:

"grok": {
  "field": "cisco_message",
  "pattern": "%{CISCOFW106001}",
  "on_failure": [
    {
      "grok": {
      "field": "cisco_message",
      "pattern": "%{CISCOFW106006_106007_106010}",
      "on_failure": [...

The other problem is performance. Because now there are up to 23 rules to evaluate, throughput goes down to about 10K EPS:
Ingest node CISCO ASA grok throughput

And the CPU bottleneck shifts to the Ingest node:
Ingest node CISCO ASA grok CPU

Overall, the throughput-to-CPU ratio of the Ingest node dropped by a factor of 9 compared to the Apache logs scenario.

Conclusions

  • Logstash is easier to configure, at least for now, and performance didn’t deteriorate as much when adding rules
  • Ingest node is lighter across the board. For a single grok rule, it was about 10x faster than Logstash
  • Ingest nodes can also act as “client” nodes
  • Define the grok rules matching most logs first, because both Ingest and Logstash exit the chain on the first match by default

You’ve made it all the way down here? Bravo! If you need any help with Elasticsearch – don’t forget @sematext does Elasticsearch Consulting, Production Support, as well as Elasticsearch Training.

Monitoring rsyslog with Kibana and SPM

A while ago we published this post where we explained how you can get stats about rsyslog, such as the number of messages enqueued, the number of output errors and so on. The point was to send them to Elasticsearch (or Logsene, our logging SaaS, which exposes the Elasticsearch API) in order to analyze them.

This is part 2 of that story, where we share how we process these stats in production. We’ll cover:

  • an updated config, working with Elasticsearch 2.x
  • what Kibana dashboards we have in Logsene to get an overview of what rsyslog is doing
  • how we send some of these metrics to SPM as well, in order to set up alerts on their values: both threshold-based alerts and anomaly detection

Read More

Deletions in Elasticsearch

Documents Update By Query with Elasticsearch

SIDE NOTE: We run Elasticsearch and ELK trainings, which may be of interest to you and your teammates.

Just recently, we’ve described how to re-index your Elasticsearch data using the built-in re-index API in Elasticsearch 2.3 (and above). Today, we’ll look at another addition to the upcoming Elasticsearch v2.3+ – the Update by Query API. Yes, you got that right, you will be able to update your documents using a query without having to do any expensive fetching and processing on the application side.

You know how updates work in Elasticsearch or in Apache Lucene in general? Yes, that’s true – Lucene segments are immutable, so once you’ve updated the document, the old one gets marked as deleted in the segment and new version of the document gets indexed. Of course, Elasticsearch builds some additional processing on top of Lucene, so we can use scripts to update our data, use optimistic locking, etc., but still the above picture is true.

However, some use cases force us to update documents, sometimes a lot of them at once. To update a batch of documents matching a query, we needed to know their identifiers. This is how things used to work and the general principle was:

  1. Run a query
  2. Gather the results (probably using Scroll API if you expect a lot of them)
  3. Update returned documents one by one or use bulk API
  4. Repeat from 1) when in need

That is finally over, as similar to how Elasticsearch builds the document update features on top of Lucene, starting from version 2.3 we get the ability to run a query and update all documents matching it. Welcome the Update by Query API. 🙂


For the purposes of this blog post we will again use the same small data set that we’ve used when describing the Re-Index API, so the one available on our Github account (https://github.com/sematext/berlin-buzzwords-samples/tree/master/2014/sample-documents). After indexing the data we should have 18 documents:

$ curl -XGET 'localhost:9200/videosearch/_search?size=0&pretty'
{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 18,
    "max_score" : 0.0,
    "hits" : [ ]
  }
}

Let’s assume that we would like to update all the documents that have solr (yes, yes, I know) in the tags field and increment the values stored in their views field. With the Update by Query API, this is as simple as running the following code:

$ curl -XPOST 'localhost:9200/videosearch/_update_by_query?pretty' -d '{
 "query" : {
  "term" : {
   "tags" : "solr"
  }
 },
 "script" : {
  "inline" : "ctx._source.likes += num_likes",
  "params" : {
   "num_likes" : 1
  }
 }
}'

As you can see this was easy. We’ve provided a simple term query and included a script that increments the data. The whole request was sent to the _update_by_query REST end-point in an index we are interested in.

The response of Elasticsearch for the above request, on our example data set, would be similar to the following one (don’t forget to enable inline scripting by adding the script.inline: on to elasticsearch.yml):

{
  "took" : 60,
  "timed_out" : false,
  "total" : 11,
  "updated" : 11,
  "batches" : 1,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : 0,
  "failures" : [ ]
}

The response Elasticsearch returns tells about about the number of updated documents, the number of batches that were created, and information about conflicts and retries. Finally, we have the information on failures.

Is there something that we can control when using the Update by Query API? Again, the answer is yes. We can control the language of the script, we can control the write consistency, replication (synchronous or asynchronous), routing, timeout and response. For example, to get information about all processed documents we could use the following information:

$ curl -XPOST 'localhost:9200/videosearch/_update_by_query?pretty&response=all' -d '{
 "query" : {
  "term" : {
   "tags" : "solr"
  }
 },
 "script" : {
  "inline" : "ctx._source.likes += num_likes",
  "params" : {
   "num_likes" : 1
  },
  "lang" : "groovy"
 }
}'

Or we can control consistency and timeout:

$ curl -XPOST 'localhost:9200/videosearch/_update_by_query?pretty&consistency=one&timeout=1m' -d '{
 "query" : {
  "term" : {
   "tags" : "solr"
  }
 },
 "script" : {
  "inline" : "ctx._source.likes += num_likes",
  "params" : {
   "num_likes" : 1
  },
  "lang" : "groovy"
 }
}'

Let’s take a second at what the response parameter does. It controls what bulk response items to include in the response of the command. The possible values are:

  • none – the default value, which means that no response items will be returned,
  • failed – only information about documents that failed to be updated will be returned,
  • all – information about all processed documents will be returned. Please remember that this options can lead to very large responses when your update by query request processes a lot of data. Because of that you may run into large memory consumption or even into out of memory situations.

What’s next?

Of course, the Update by Query API and Re-index API that we recently wrote about are nice, but what if the update request execution takes a very long time? It would be nice to be able to control or even cancel its execution, wouldn’t it? Well, we have good news – this is coming soon, probably in the next major version of Elasticsearch – see Github issue #15117.

If you need any help with Elasticsearch, check out our Elasticsearch Consulting, Elasticsearch Production Support, and Elasticsearch Training info.

es-reindex

Reindexing Data with Elasticsearch

SIDE NOTE: We run Elasticsearch and ELK trainings, which may be of interest to you and your teammates.

Sooner or later, you’ll run into a problem of reindexing the data of your Elasticsearch instances. When we do Elasticsearch consulting for clients we always look at whether they have some way to efficiently reindex previously indexed data. The reasons for reindexing vary – from data type changes, analysis changes, to introduction of new fields that that need to be populated. No matter the case, you may either reindex from your source of truth or treat your Elasticsearch instance as such. Up to Elasticsearch 2.3 we had to use external tools to help us with this operation, like Logstash or stream2es. We even wrote about how to approach reindexing of data with Logstash. However, today we would like to look at the new functionality that will be added to Elasticsearch 2.3 – the re-index API.

Read More

Logagent-js – alternative to logstash, filebeat, fluentd, rsyslog?

What is the easiest way to parse, ship and analyze my web server logs? You should know that I’m a Node.js fan boy and not very thrilled with the idea of running a heavy process like Logstash on my low memory server, hosting my private Ghost Blog. I looked into Filebeat, a very light-weight log forwarder written in Go with an impressively low memory footprint of only a few MB, but Filebeat ships only unparsed log lines to Elasticsearch.  In other words, it sort of still needs Logstash to parse web server logs, which include many fields and numeric values!  Of course, structuring logs is essential for analytics.  The setup for rsyslog with elasticsearch and regex parsers is a bit more time consuming but very efficient compared to Logstash. Are there any better alternatives? Having a quick setup, well structured logs and a low memory footprint?

Guess what?  There is! Meet logagent-js – a log parser and shipper with log patterns for a number of popular log formats – from various Docker Images including Nginx, Apache, Linux and Mac system logs, to Elasticsearch, Redis, Solr, MongoDB and more. Logagent-js detects the log format automatically using the built-in pattern definitions (and also lets you provide your own, custom patterns).

Logagent-js includes a command line tool with default settings for Logsene as the Elasticsearch backend for storing the shipped logs.  Logsene is compatible with the Elasticsearch API, but can do much more, such as role-based access control, account sharing for DevOps teams,  ad-hoc charts in the Logsene UI, alerts on logs, and finally it integrates Kibana to ease the life of everybody dealing with log data!

Now let’s see what I run on my private blog site: logagent-js as single command to tail, parse and ship logs, all with less than 40 MB of RAM. Compare that to Logstash, which would not even start with just 40 MB of JVM heap.  Logagent-js can be installed as a command line tool with npm, which is included in Node.js (>0.12):

npm i logagent-js -g

Logagent-js needs only the Logsene Token as a parameter to ship logs to Logsene. When running it as a background process or daemon, it makes sense to limit the Node.js memory with  –max-old-space-size=60 to 100 MB, just in case.  Without such setting Node.js could consume more memory to improve performance in a long running process:

node --max-old-space-size=60 /usr/local/bin/logagent -s -t your-logsene-token-here logs/access_log &

You can also run logagent-js as upstart or systemd service, of course.

A few seconds after you start it you’ll see all your logs, parsed and structured into fields, with correct timestamps, numeric fields, etc., all without any additional configuration! A real gift and a huge time time saver for busy ops people!

Logsene-create-chart

Charting Logs

Next, let’s create some fancy charts with data from our logs. Logsene has ad-hoc charting functions (look for the little blue chart icons in the above screenshot) that let you draw Pie, Area, Line, Spline, Bar, and other types of charts. Logsene is smart and automatically provides chooses Pie charts to display distinct values and bar/line charts for numeric values over time.

Bildschirmfoto 2016-01-20 um 10.11.37

In the above screenshot we see the top viewed pages and the distribution of HTTP status codes.  We were able to generate these charts literally with just a few mouse clicks. The charts use the current query, so we could search for specific URLs and exclude e.g. images, stylesheets or traffic from robots using Logsene’s query language e.g. ‘NOT css AND NOT jpg AND NOT png AND NOT seoscanners’ or, more simply: -css -jpg -png -seoscanners).

Kibana Dashboards

If you prefer Kibana dashboards then you’ll need more complex Elasticsearch queries to remove Stylesheets, JavaScripts or other URLs from the top list. Open Kibana 4 in the Logsene UI and create a visualistaion to filter specific URLs – a ‘Terms Query’ can use regular expressions to Exclude and Include Filters.

Bildschirmfoto 2016-01-20 um 10.21.29

This visualization could be saved and added to a Kibana dashboard. If you know Kibana this takes a few minutes per visualization.  The result is a stored dashboard that could be shared with colleagues, which might not know how to create such dashboards.

Alert Me

The final thing I usually do is define alert queries e.g. to get notified about a growing number of HTTP error messages. For my private blog I use e-mail notifications, but Logsene integrates well with PagerDuty, HipChat, Slack or arbitrary WebHooks.

There are even more options like using Grafana with Logsene, or shipping logs automatically when using Docker.

Finally, a few more words about  logagent-js, which I consider a ‘swiss army knife’ for logs.  It integrates seamlessly with Logsene, while at the same time it can also work with other log destinations. It provides what I believe is a good compromise in terms of performance and setup time – I’d say it’s somewhere between rsyslog and logstash.

All tools for log processing require memory for this processing, but looking at the initial memory usage after starting the tools gives you an impression of the minimum resource usage.  Here are some numbers taking from my server:

Contributions to the pattern library for even more log formats are welcome – we are happy to help with additional log formats or input sources beside the existing inputs (standard input, file, Heroku, CloudFoundry and syslog UDP). Feel free to contact me @seti321 or @sematext to get up and running with your special setup!

If you don’t want to run and manage your own Elasticsearch cluster but would like to use Kibana for log and data analysis, then give Logsene a quick try by registering here – we do all the backend heavy lifting so you can focus on what you want to get out of your data and not on infrastructure.  There’s no commitment and no credit card required.  

We are happy to answer questions or receive feedback – please drop us a line or get us @sematext.

Slack Analytics & Search with Elasticsearch, Node.js and React

Sematext team is highly distributed. We are ex-Skype users who recently switched to Slack for team collaboration. We’ve been happy with Slack features and especially integrations for watching our Github repositories, Jenkins, or receiving SPM or Logsene Alerts from our production servers through their ChatOps support. The ability to add custom integrations is really awesome! Being search experts it is hard for us to accept any limitation in search functionality in tools we use. For example, I personally miss the ability to search over all teams and all channels and I really miss having no analytics on user activity or channel usage. Elasticsearch has become a popular data store for analytical queries.  What if we could take all Slack messages and index them into Elasticsearch? This would make it possible to perform advanced analytics with Kibana or Grafana, such as getting like top terms used, most active users or channels. Finally, a simple mobile web page to access only the indexed data from various Teams and Channels might be handy to have, too.

In this post we’re going to see how to build what we just described.  We’ll use the Slack API, Node.js, React and Elasticsearch in 3 steps:

  • Index Data from Slack
  • Analyse Data from Slack
  • Create a custom Web-App for searchslack-indexing-logsene.png

Index Data from Slack

The Slack API provides several ways to access data. For example, outgoing webhook. This looks useful at first, however, this needs a setup per channel or keywords as trigger. Then I discovered a better way – the Node.js Slack Client.  Simply log in with your Slack account and get all Slack messages! I wrote a little Node.js app to dump the relevant information as JSON to the console or to a file.  Having the JSON output, it can be piped to Logagent-js a smart log shipper written in Node.js. I packaged this as “slack-elasticsearch-indexer” so it’s super easy to run:

npm install slack-elasticsearch-indexer
# Set Elasticsearch Server, btw. the Logsene Receiver is the default
export LOGSENE_URL=https://logsene-receiver.sematext.com/_bulk
# 1 - Slack API Token from https://api.slack.com/web
# 2 - Index name or Logsene Token from https://apps.sematext.com
npm start SLACK_WEB_API_TOKEN LOGSENE_TOKEN

The LOGSENE_TOKEN is what you can get from Logsene – the “ELK log management service”.  Using Logsene means you don’t have to bother running your own Elasticsearch, plus the volume of most team’s Slack data is probably so small that it fits in Logsene’s free plan! 🙂

Once you run the above you should see new Slack Messages on the console.  At the same time the messages will also be sent to Logsene and you will see them in the Logsene UI (or your local Elasticsearch server or cluster) right away.

Analyze Slack Messages in Logsene

Now that our Slack messages are in Logsene we can build our Kibana Dashboards to visualize channel utilization, top terms, the chattiest people, and so on.  But … did you know, that Logsene comes with a nice ad-hoc charting function? Simply open one of the Slack messages in Logsene, and click on the little chart symbol in the field userName and channel (see below).

logsene-slack-search.png

This will very quickly render top users and channels for you:

slack-pie-charts.png

Slack Alerting

Imagine a support chat channel – wouldn’t it be nice to be notified when people start mentioning “Error”, “Problems” and “Broken” things increasingly frequently? This is where we can make use of Logsene Alerts and its ability to do anomaly detection. Any triggered alerts can be delivered via email, PagerDuty, Slack, HipChat or WebHooks:

logsene-alert-definition.pngWhile Logsene is great for alerts, analytics and Slack message search, as a general ‘data viewer’ the message rendering in Logsene does not show application-specific things like users’ profile pictures, which would allow much faster recognition of user messages. Thus, as our next step, we’ll create a simple Web Client with nice rendering of indexed Slack messages. Let’s see how this can be done very quickly using some cutting edge Web technology together with Logsene.

Create a Custom Web-App for Search

We recently started using Facebook’s React.js for rendering of various UI parts like the views for Top Database Operations and we came across a new set of React UI Components for Elasticsearch called SearchKit. Thanks to Logsene’s Elasticsearch API SearchKit works out of the box with Logsene!
After a few lines of CSS and some JavaScript a simple Slack Search UI is born. Check it out!

searchkit-react.png

Edit the source code codepen.io

You just need to use your Logsene token as the Elasticsearch index name to run this app on your own data. For production we recommend adding a proxy to Elasticsearch (or Logsene) on the server side as described in the SearchKit UI documentation to hide connection details from the client application.

While this post shows how to index your Slack messages in Logsene for the purpose of archiving, searching, and analytics, we hope it also serves as an inspiration to build your own custom Search application with SearckKit, React, Node.js and Logsene?

If you haven’t used Logsene before, give it try – you can get a free account and have your logs and other event data in Logsene in no time. Drop us an email or hit us on Twitter with suggestions, questions or comments.