docker-monitoring-no-text-1

Container Monitoring: Top Docker Metrics to Watch

Monitoring of Docker environments is challenging. Why? Because each container typically runs a single process, has its own environment, utilizes virtual networks, or has various methods of managing storage. Traditional monitoring solutions take metrics from each server and applications they run. These servers and applications running on them are typically very static, with very long uptimes. Docker deployments are different: a set of containers may run many applications, all sharing the resources of one or more underlying hosts. It’s not uncommon for Docker servers to run thousands of short-term containers (e.g., for batch jobs) while a set of permanent services runs in parallel. Traditional monitoring tools not used to such dynamic environments are not suited for such deployments. On the other hand, some modern monitoring solutions (e.g. SPM from Sematext) were built with such dynamic systems in mind and even have out of the box reporting for docker monitoring. Moreover, container resource sharing calls for stricter enforcement of resource usage limits, an additional issue you must watch carefully. To make appropriate adjustments for resource quotas you need good visibility into any limits containers have reached or errors they have caused. We recommend using alerts according to defined limits; this way you can adjust limits or resource usage even before errors start happening.


Docker Containers != VMs or Servers. Forget your grandpa’s old monitoring. Use monitoring designed for Docker.


SPM Docker Metrics Overview

Note: All images in this post are from Sematext’s SPM Performance Monitoring tool and its Docker monitoring integration.

Watch Resources of your Docker Hosts

Host CPU

Understanding the CPU utilization of hosts and containers helps one optimize the resource usage of Docker hosts. The container CPU usage can be throttled in order to avoid a single busy container slowing down other containers by taking away all available CPU resources. Throttling the CPU time is a good way to ensure a minimum of processing power for essential services – it’s like the good old nice levels in Unix/Linux.

When the resource usage is optimized, a high CPU utilization might actually be expected and even desired, and alerts might make sense only for when CPU utilisation drops (service outages) or increases for a longer period over some max limit (e.g. 85%).


An overutilized Docker host is a sign of trouble.
An underutilized host is a sign you are wasting money.



Host Memory

The total memory used in each Docker host is important to know for the current operations and for capacity planning. Dynamic cluster managers like Docker Swarm use the total memory available on the host and the requested memory for containers to decide on which host a new container should ideally be launched. Deployments might fail if a cluster manager is unable to find a host with sufficient resources for the container. That’s why it is important to know the host memory usage and the memory limits of containers. Adjusting the capacity of new cluster nodes according to the footprint of Docker applications could help optimize the resource usage.


No, Linux didn’t eat your RAM.
But when buffered || cached memory goes to 0 it’s time to expand the cluster.



 

Host Disk Space

Docker images and containers consume additional disk space. For example, an application image might include a Linux operating system and might have a size of 150-700 MB depending on the size of the base image and installed tools in the container. Persistent Docker volumes consume disk space on the host as well. In our experience watching the disk space and using cleanup tools is essential for continuous operations of Docker hosts.


Good kids clean up their rooms.
Good Docker ops clean up their disks by removing unused containers & images.



DIsk Space on Docker Nodes
Disk space usage on Docker hosts

Because disk space is very critical it makes sense to define alerts for disk space utilization to serve as early warnings and provide enough time to clean up disks or add additional volumes. For example, SPM automatically sets alert rules for disk space usage for you, so you don’t have to remember to do it.

A good practice is to run tasks to clean up the disk by removing unused containers and images frequently.

Total Number of Running Containers

The current and historical number of containers is an interesting metric for many reasons. For example, it is very handy during deployments and updates to check that everything is running like before.

When cluster managers like Docker Swarm, Mesos, Kubernetes, CoreOS/Fleet automatically schedule containers to run on different hosts using different scheduling policies, the number of containers running on each host can help one verify the activated scheduling policies. A stacked bar chart displaying the number of containers on each host and the total number of containers provides a quick visualization of how the cluster manager distributed the containers across the available hosts.

Docker Container Count

Container counts per Docker host over time

 


Use anomaly detection, not threshold-based alerts
to catch sudden container migrations that mean trouble.



This metric can have different “patterns” depending on the use case. For example, batch jobs running in containers vs. long running services commonly result in different container count patterns. A batch job typically starts a container on demand, or starts it periodically, and the container with that job terminates after a relatively short time. In such a scenario one might see a big variation in the number of containers running resulting in a “spiky” container count metric. On the other hand, long running services such as web servers or databases typically run until they get re-deployed during software updates. Although scaling mechanisms might increase or decrease the number of containers depending on load, traffic, and other factors, the container count metric will typically be relatively steady because in such cases containers are often added and removed more gradually. Because of that, there is no general pattern we could use for a default Docker alert rule on the number of running containers.

Nevertheless, alerts based on anomaly detection, which detect sudden changes in the number of the containers in total (or for specific hosts) in a short time window, can be very handy for most of the use cases. The simple threshold-based alerts make sense only when the maximum or minimum number of running containers is known, and in dynamic environments that scale up and down based on external factors, this is often not the case.

Container Metrics

Container metrics are basically the same metrics available for every Linux process, but include limits set via cgroups by Docker, such as limits for CPU or memory usage. Please note that sophisticated monitoring solutions like SPM for Docker are able to aggregate Container Metrics on different levels like Docker Hosts/Cluster Nodes, Image Name or ID and Container Name or ID. Having the ability to do that makes it easy to track resources usage by hosts, application types (image names) or specific containers. In the following examples we might use aggregations on various levels.


Use modern Docker monitoring solutions
to slice & dice by host, node, image or container.
You’ll need that.



Container CPU – Throttled CPU Time

One of the most basic bits of information is information about how much CPU is being consumed by all containers, images, or by specific containers. A great advantage of using Docker is the capability to limit CPU utilisation by containers. Of course, you can’t tune and optimize something if you don’t measure it, so monitoring such limits is the prerequisite. Observing the total time that a container’s CPU usage was throttled provides the information one needs to adjust the setting for CPU shares in Docker. Please note that CPU time is throttled only when the host CPU usage is maxed out. As long as the host has spare CPU cycles available for Docker it will not throttle containers’ CPU usage. Therefore, the throttled CPU is typically zero and a spike of this metric is a typically a good indication of one or more containers needing more CPU power than the host can provide.

Docker Container CPU usage

Container CPU usage and throttled CPU time

The following screenshot shows containers with 5% CPU quota using the command “docker run –cpu-quota=5000 nginx”, we see clearly how the throttled CPU grows until it reaches around 5%, enforced by the Docker engine.

CPU-throttled-time

Container CPU usage and throttled CPU time with CPU quota of 5%

Container Memory – Fail Counters

It is a good practice to set memory limits for containers. Doing that helps avoid a memory-hungry container taking all available memory and starving all other containers on the same server. Runtime constraints on resources can be defined in the Docker run command. For example, “-m 300M” sets the memory limit for the container to 300 MB. Docker exposes a metric called container memory fail counters. This counter is increased each time memory allocation fails — that is, each time the pre-set memory limit is hit. Thus, spikes in this metric indicate one or more containers needing more memory than was allocated. If the process in the container terminates because of this error, we might also see out of memory events from Docker.


 

Docker Memory Fail Counters tell you when containers need more memory.
Alerts are your friends.



A spike in memory fail counters is a critical event and putting alerts on the memory fail counter is very helpful to detect wrong settings for the memory limits or to discover containers that try to consume more memory than expected.

image03

Container Memory Usage

Different applications have different memory footprints. Knowing the memory footprint of the application containers is important for having a stable environment. Container memory limits ensure that applications perform well, without using too much memory, which could affect other containers on the same host. The best practice is to tune memory setting in a few iterations:

  1. Monitor memory usage of the application container
  2. Set memory limits according to the observations
  3. Continue monitoring of memory, memory fail counters, and Out-Of-Memory events. If OOM events happen, the container memory limits may need to be increased, or debugging is required to find the reason for the high memory consumptions.

Docker Container Memory Usage

Container memory usage

Container Swap

Like the memory of any other process, a container’s memory could be swapped to disk. For applications like Elasticsearch or Solr one often finds instructions to deactivate swap on the Linux host – but if you run such applications on Docker it might be sufficient just to set “–memory-swap=-1” in the Docker run command!


Don’t like to see your container swapping?
Use –memory-swap=-1 in the Docker run command and be done with it!



Docker Container Swap

Container swap, memory pages, and swap rate

Container Disk I/O

In Docker multiple applications use the same resources concurrently. Thus, watching the disk I/O helps one define limits for specific applications and give higher throughput to critical applications like data stores or web servers, while throttling disk I/O for batch operations. For example, the command docker run -it –device-write-bps /dev/sda:1mb mybatchjob would limit the container disk writes to a maximum of 1 MB/s.

Docker Container I/O

Container I/O throughput


To limit a Docker container from eating all your disk IO use
e.g. –device-write-bps /dev/sda:1mb



Container Network Metrics

Networking for containers can be very challenging. By default all containers share a network, or containers might be linked together to share a separated network on the same host. However, when it comes to networking between containers running on different hosts an overlay network is required, or containers could share the host network. Having many options for network configurations means there are many possible causes of network errors.

Moreover, not only errors or dropped packets are important to watch out for. Today, most of the applications are deeply dependent on network communication. Throughput of virtual networks could be a bottleneck especially for containers like load balancers. In addition, the network traffic might be a good indicator how much applications are used by clients and sometimes you might see high spikes, which could indicate denial of service attacks, load tests, or a failure in client apps. So watch the network traffic – it is a useful metric in many cases.

Docker Container Network I/o

Network traffic and transmission rates

Summary

There you have it — the top Docker metrics to watch. Staying focused on these top metrics and corresponding analysis will help you stay on the road while driving towards successful Docker deployments on many platforms such as Docker Swarm, Docker Cloud, Docker Datacenter or any other platform supporting Docker containers. If you’d like to learn even more about Docker Monitoring and Logging stay tuned on sematext.com/blog or follow @sematext.

To monitor your Docker environment, any of Sematext’s other pre-built integrations, your app’s custom metrics, or your Docker logs, sign up for a free 30-day trial account below.

docker devops sign-up @sematext

docker-partner-logo

Kafka Consumer Lag Monitoring

SPM is one of the most comprehensive Kafka monitoring solutions, capturing some 200 Kafka metrics, including Kafka Broker, Producer, and Consumer metrics. While lots of those metrics are useful, there is one particular metric everyone wants to monitor – Consumer Lag.

What is Consumer Lag

When people talk about Kafka or about a Kafka cluster, they are typically referring to Kafka Brokers. You can think of a Kafka Broker as a Kafka server. A Broker is what actually stores and serves Kafka messages. Kafka Producers are applications that write messages into Kafka (Brokers). Kafka Consumers are applications that read messages from Kafka (Brokers).

Inside Kafka Brokers data is stored in one or more Topics, and each Topic consists of one or more Partitions. When writing data a Broker actually writes it into a specific Partition. As it writes data it keeps track of the last “write position” in each Partition. This is called Latest Offset also known as Log End Offset. Each Partition has its own independent Latest Offset.

Just like Brokers keep track of their write position in each Partition, each Consumer keeps track of “read position” in each Partition whose data it is consuming. That is, it keeps track of which data it has read. This is known as Consumer Offset. This Consumer Offset is periodically persisted (to ZooKeeper or a special Topic in Kafka itself) so it can survive Consumer crashes or unclean shutdowns and avoid re-consuming too much old data.

Kafka Consumer Lag and Read/Write Rates
Kafka Consumer Lag and Read/Write Rates

In our diagram above we can see yellow bars, which represents the rate at which Brokers are writing messages created by Producers.  The orange bars represent the rate at which Consumers are consuming messages from Brokers. The rates look roughly equal – and they need to be, otherwise the Consumers will fall behind.  However, there is always going to be some delay between the moment a message is written and the moment it is consumed. Reads are always going to be lagging behind writes, and that is what we call Consumer Lag. The Consumer Lag is simply the delta between the Latest Offset and Consumer Offset.

Why is Consumer Lag Important

Many applications today are based on being able to process (near) real-time data. Think about performance monitoring system like SPM or log management service like Logsene. They continuously process infinite streams of near real-time data. If they were to show you metrics or logs with too much delay – if the Consumer Lag were too big – they’d be nearly useless.  This Consumer Lag tells us how far behind each Consumer (Group) is in each Partition.  The smaller the lag the more real-time the data consumption.

Monitoring Read and Write Rates

Kafka Consumer Lag and Broker Offset Changes
Kafka Consumer Lag and Broker Offset Changes

As we just learned the delta between the Latest Offset and the Consumer Offset is what gives us the Consumer Lag.  In the above chart from SPM you may have noticed a few other metrics:

  • Broker Write Rate
  • Consume Rate
  • Broker Earliest Offset Changes

The rate metrics are derived metrics.  If you look at Kafka’s metrics you won’t find them there.  Under the hood SPM collects a few metrics with various offsets from which these rates are computed.  In addition, it charts Broker Earliest Offset Changes, which is  the earliest known offset in each Broker’s Partition.  Put another way, this offset is the offset of the oldest message in a Partition.  While this offset alone may not be super useful, knowing how it’s changing could be handy when things go awry.  Data in Kafka has has a certain TTL (Time To Live) to allow for easy purging of old data.  This purging is performed by Kafka itself.  Every time such purging kicks in the offset of the oldest data changes.  SPM’s Broker Earliest Offset Change surfaces this information for your monitoring pleasure.  This metric gives you an idea how often purges are happening and how many messages they’ve removed each time they ran.

There are several Kafka monitoring tools out there that, like LinkedIn’s Burrow, whose Offset and Consumer Lag monitoring approach is used in SPM.  If you need a good Kafka monitoring solution, give SPM a go.  Ship your Kafka and other logs into Logsene and you’ve got yourself a DevOps solution that will make troubleshooting easy instead of dreadful.

 

5 Minute Recipe: Heroku Log Drain Setup

Since we wrote about how to ship Heroku Logs to ELK we’ve received good feedback from Heroku users and, encouraged by that feedback, deployed a log ingestion service for apps running on Heroku. This makes it super easy to get structured Heroku Logs into Logsene, the hosted ELK logging service.  Let’s see how that’s done in under five minutes (check the current time!):

Step 1 – Create your Logsene App

If you don’t have a Logsene account already simply get a free account and create a Logsene App. This will get you a Logsene Application Token.

Step 2 – Configure Log Drain for your Heroku App

Once you create your Logsene app you’ll see a command to set up the Heroku Log Drain including the Logsene Token.

Simply copy that command and run it in one of the two places:

  1. in the Heroku app directory, like this:

heroku drain:add https://logsene-heroku-receiver.sematext.com/LOGSENE_TOKEN

  1. alternatively, specify your app name in the command instead of calling the command from your Heroku app directory:

heroku drain:add https://logsene-heroku-receiver.sematext.com/LOGSENE_TOKEN -a YOUR_HEROKU_APP_NAME

Step 3 – Watch your Logs in Logsene

If you now access your Heroku App, Heroku should log your HTTP request and a few seconds later the logs will be visible in Logsene.  And not in just any format!  You’ll see PERFECTLY STRUCTURED HEROKU LOGS:

heroku-logs-in-logsene

Parsed Heroku Logs in Logsene

 

Check the time!  Under five minutes?  If you like your Heroku app logs in Logsene tweet us your setup time. 🙂

Sematext Elastic Stack Training

Elasticsearch / Elastic Stack Training – NYC June 13-16

Next month, June 13-16, 2016, we will be running three Elastic Stack (aka ELK Stack) classes in New York City:

  1. June 13 & 14: Elasticsearch for Developers Training Workshop
  2. June 15: Elasticsearch Operations Training Workshop
  3. June 16: Elasticsearch for Logging Training Workshop

All classes cover Elasticsearch 2.x as well as Elasticsearch 5.x!

You can see the complete course outlines under Training Overview.  All three classes include lots of valuable hands-on exercises.  Be prepared to learn a lot!

Cost:

  • 2-day course: $1,200 early bird rate (valid through June 1) and $1,500 afterwards.
  • 1-day course: $700 early bird rate (valid through June 1) and $800 afterwards.

There’s also a 50% discount for the purchase of a 2nd seat!

Location:
462 7th Avenue, New York, NY 10018 – see map

If you have any questions please get in touch.

Sematext Solr Training

Apache Solr Training in NYC June 13-14

If you’ve missed our Core Solr training in October 2015 in New York, here is another chance – we’re running the 2-day Core Solr class again next month – June 13 & 14, 2016.
This course covers Solr 5.x as well as Solr 6.x!  You can see the complete course outline under Solr & Elasticsearch Training Overview . The course is very comprehensive — it includes over 20 chapters and lots of hands-on exercises.  Be prepared to learn a lot!

Cost:
$1,200 early bird rate (valid through June 1) and $1,500 afterwards.

There’s also a 50% discount for the purchase of a 2nd seat!

Location:

462 7th Avenue, New York, NY 10018 – see map

If you have any questions please get in touch.  To sign up, just register here.

reindex data

DocValues Reindexing with Solr Streaming Expressions

Last time, when talking about Solr 6 we learned how to use streaming expressions to automatically update data in a collection. You can imagine this is not the only cool thing you can do with streaming expressions. Today, we will see how to re-index data in your collection for fields that are using doc values. For that we will use Solr 6.1, because of a simple bug that was fixed for that version (details SOLR-9015)

Let’s assume we have two collections – one called video, which will be the source of the data. The second collection will be video_new and will be the target collection. We assume that collections will have slightly different structure – slightly different field names. The video collection will have the following fields:

  • id – document identifier
  • url – URL of the video
  • likes – number of likes
  • views – number of views

The second collection, video_new, will have the following fields:

  • id – document identifier
  • url – URL of the video
  • num_likes – number of likes
  • num_views – number of views

Exporting the data

First thing we need to figure out is a way to export data from the source collection in an efficient fashion. We can’t just set the rows parameter to gazillion, because it is not efficient and can lead to Solr going out of memory. So we will use the /export request handler. The only limitation of that request handler is that data needs to be sorted and needs to use doc values. That is not a problem for our data, however you should be aware of this requirement.

We will start by exporting the data using the standard Solr way – using the request params with the /export handler. The request looks like this:

curl -XGET 'localhost:8983/solr/video/export?q=*:*&sort=id+desc&fl=id,url,likes,views'

The above will result in Solr using the /export handler and returning all data, not only the first page of the results.

However, we want to use streaming expressions to re-index the data. Because of that we can change the above request to use the search streaming expression, which looks as follows:

search(
  video,
  zkHost="localhost:9983",
  qt="/export",
  q="*:*",
  fl="id,url,likes,views",
  sort="id desc")

The working command with the request looks like this:

curl --data-urlencode 'expr=search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc")' http://localhost:8983/solr/video/stream

We use the search streaming expression and provide the name of the collection, which is video in our case, the ZooKeeper host (yes, we can read from other clusters), the name of the request handler which is /export in our case and is required. Finally, we provide the match-all query, the list of fields that we are interested in, and the sorting expression. Please remember that when using the /export handler all fields listed in the fl parameter must use doc values.

Changing field names

Our collections have different field names and because of that the above search request is not enough. We need to alter the name of the fields by using the select streaming expression. We will change the name of the likes field to num_likes and the name of the views field to num_views. The expression that does that is:

select(
  search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),
  id, 
  url, 
  likes as num_likes,
  views as num_views
)

The select streaming expression lets us choose which fields should be used in the resulting tuples and how they will be named. In our case we take the id and url fields as is and we change the name of the likes and views fields.

To test the result of that expression you can simply use the following command:

curl --data-urlencode 'expr=select(search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),id,url,likes as num_likes,views as num_views)' http://localhost:8983/solr/video/stream

Running the re-indexing

Finally, we have the data prepared and read in an efficient way, so we can send data to Solr for indexation. We do that using the update streaming expression simply by specifying the target collection name and the batch size, like this:

update(
  video_new, 
  batchSize=100, 
  select(search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),id,url,likes as num_likes,views as num_views))

And the command that we would send to Solr:

curl --data-urlencode 'expr=update(video_new,batchSize=100,select(search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),id,url,likes as num_likes,views as num_views))' http://localhost:8983/solr/video/stream

Please note that we send the command to the source collection /stream handler – in our case to the video collection. This is important.

Verifying the re-indexation

Once the task has been finished by Solr we can check the number of documents returned by each collection to verify that data has been re-indexed properly. We can do that by running these commands:

curl -XGET 'localhost:8983/solr/video/select?q=*:*&indent=true&rows=0'

and

curl -XGET 'localhost:8983/solr/video_new/select?q=*:*&indent=true&rows=0'

Both result in the following number of documents:

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <bool name="zkConnected">true</bool>
  <int name="status">0</int>
  <int name="QTime">38</int>
  <lst name="params">
    <str name="q">*:*</str>
    <str name="indent">true</str>
    <str name="rows">0</str>
  </lst>
</lst>
<result name="response" numFound="18" start="0">
</result>
</response>

And that means that everything works as intended 🙂

Interested in Solr Streaming Expressions? Subscribe to this blog or follow @sematext – we have more Streaming Expressions blog posts in the queue. If you need any help with Solr / SolrCloud – don’t forget @sematext does Solr Consulting, Production Support, as well as Solr Training!

rsyslog Elasticsearch reindex multiple scripts(3)

Scalable and Flexible Elasticsearch Reindexing via rsyslog

Earlier on, we posted a recipe on reindexing data from within an Elasticsearch 2.3+ cluster. But this doesn’t work if you want to reindex in a different cluster or if your Elasticsearch is older than 2.3. Or both, when you’re trying to migrate from 1.x to 2.x or later.

For such cases, we posted a Logstash reindexing recipe. However, Logstash can sometimes become a bottleneck, so we needed something faster for indexing lots of data. We turned to rsyslog, a log shipper with performance as its #1 feature.

The plan

As rsyslog doesn’t have an Elasticsearch input like Logstash does, we’ve used an external application to scroll through Elasticsearch documents and push them to rsyslog via TCP. The flow would be:

rsyslog to Elasticsearch reindex flow

This is an easy way to extend rsyslog, using whichever language you’re comfortable with, to support more inputs. Here, we piggyback on the TCP input. You can do a similar job with filters/parsers – you can find some examples here – by piggybacking the mmexternal module, which uses stdout&stdin for communication. The same is possible for outputs, normally added via the omprog module: we did this to add a Solr output and one for SPM custom metrics.

The custom script in question doesn’t have to be multi-threaded, you can simply spin up more of them, scrolling different indices. In this particular case, using two scripts gave us slightly better throughput, saturating the network:

rsyslog to Elasticsearch reindex flow multiple scripts

Writing the custom script

Before starting to write the script, one needs to know how the messages sent to rsyslog would look like. To be able to index data, rsyslog will need an index name, a type name and optionally an ID. In this particular case, we were dealing with logs, so the ID wasn’t necessary.

With this in mind, I see a number of ways of sending data to rsyslog:

  • one big JSON per line. One can use mmnormalize to parse that JSON, which then allows rsyslog do use values from within it as index name, type name, and so on
  • for each line, begin with the bits of “extra data” (like index and type names) then put the JSON document that you want to reindex. Again, you can use mmnormalize to parse, but this time you can simply trust that the last thing is a JSON and send it to Elasticsearch directly, without the need to parse it
  • if you only need to pass two variables (index and type name, in this case), you can piggyback on the vague spec of RFC3164 syslog and send something like
    destination_index document_type:{"original": "document"}
    

This last option will parse the provided index name in the hostname variable, the type in syslogtag and the original document in msg. A bit hacky, I know, but quite convenient (makes the rsyslog configuration straightforward) and very fast, since we know the RFC3164 parser is very quick and it runs on all messages anyway. No need for mmnormalize, unless you want to change the document in-flight with rsyslog.

Below you can find the Python code that can scan through existing documents in an index (or index pattern, like logstash_2016.05.*) and push them to rsyslog via TCP. You’ll need the Python Elasticsearch client (pip install elasticsearch) and you’d run it like this:

python elasticsearch_to_rsyslog.py source_index destination_index

The script being:

from elasticsearch import Elasticsearch
import json, socket, sys

source_cluster = ['server1', 'server2']
rsyslog_address = '127.0.0.1'
rsyslog_port = 5514

es = Elasticsearch(source_cluster,
      retry_on_timeout=True,
      max_retries=10)
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((rsyslog_address, rsyslog_port))


result = es.search(index=sys.argv[1], scroll='1m', search_type='scan', size=500)

while True:
  res = es.scroll(scroll_id=result['_scroll_id'], scroll='1m')
  for hit in result['hits']['hits']:
    s.send(sys.argv[2] + ' ' + hit["_type"] + ':' + json.dumps(hit["_source"])+'\n')
  if not result['hits']['hits']:
    break

s.close()

If you need to modify messages, you can parse them in rsyslog via mmjsonparse and then add/remove fields though rsyslog’s scripting language. Though I couldn’t find a nice way to change field names – for example to remove the dots that are forbidden since Elasticsearch 2.0 – so I did that in the Python script:

def de_dot(my_dict):
  for key, value in my_dict.iteritems():
    if '.' in key:
      my_dict[key.replace('.','_')] = my_dict.pop(key)
    if type(value) is dict:
      my_dict[key] = de_dot(my_dict.pop(key))
  return my_dict

And then the “send” line becomes:

s.send(sys.argv[2] + ' ' + hit["_type"] + ':' + json.dumps(de_dot(hit["_source"]))+'\n')

Configuring rsyslog

The first step here is to make sure you have the lastest rsyslog, though the config below works with versions all the way back to 7.x (which can be found in most Linux distributions). You just need to make sure the rsyslog-elasticsearch package is installed, because we need the Elasticsearch output module.

# messages bigger than this are truncated
$maxMessageSize 10000000  # ~10MB

# load the TCP input and the ES output modules
module(load="imtcp")
module(load="omelasticsearch")

main_queue(
  # buffer up to 1M messages in memory
  queue.size="1000000"
  # these threads process messages and send them to Elasticsearch
  queue.workerThreads="4"
  # rsyslog processes messages in batches to avoid queue contention
  # this will also be the Elasticsearch bulk size
  queue.dequeueBatchSize="4000"
)

# we use templates to specify how the data sent to Elasticsearch looks like
template(name="document" type="list"){
  # the "msg" variable contains the document
  property(name="msg")
}
template(name="index" type="list"){
  # "hostname" has the index name
  property(name="hostname")
}
template(name="type" type="list"){
  # "syslogtag" has the type name
  property(name="syslogtag")
}

# start the TCP listener on the port we pointed the Python script to
input(type="imtcp" port="5514")

# sending data to Elasticsearch, using the templates defined earlier
action(type="omelasticsearch"
  template="document"
  dynSearchIndex="on" searchIndex="index"
  dynSearchType="on" searchType="type"
  server="localhost"  # destination Elasticsearch host
  serverport="9200"   # and port
  bulkmode="on"  # use the bulk API
  action.resumeretrycount="-1"  # retry indefinitely if Elasticsearch is unreachable
)

This configuration doesn’t have to disturb your local syslog (i.e. by replacing /etc/rsyslog.conf). You can put it someplace else and run a different rsyslog process:

rsyslogd -i /var/run/rsyslog_reindexer.pid -f /home/me/rsyslog_reindexer.conf

And that’s it! With rsyslog started, you can start the Python script(s) and do the reindexing.

If you need any help with Elasticsearch, rsyslog, Logstash and the like, check out our Elasticsearch consulting, Logging consulting, Elasticsearch production support and Elasticsearch and Logging training info.

solr streaming expressions

Solr Streaming Expressions for Collection auto-updating

One of the things that was extensively changed in Solr 6.0 are the Streaming Expressions and what we can do with them (hint: amazing stuff!). We already described Solr SQL support. Today, we’ll dig into the functionality that makes Solr SQL support possible – the Streaming Expressions. Using Streaming Expressions we will put together a mechanism that lets us re-index data in a given Solr Collection – all within Solr, without any external moving parts or dependencies.

Read More