Recipe: rsyslog + Kafka + Logstash

This recipe is similar to the previous rsyslog + Redis + Logstash one, except that we’ll use Kafka as a central buffer and connecting point instead of Redis. You’ll have more of the same advantages:

  • rsyslog is light and crazy-fast, including when you want it to tail files and parse unstructured data (see the Apache logs + rsyslog + Elasticsearch recipe)
  • Kafka is awesome at buffering things
  • Logstash can transform your logs and connect them to N destinations with unmatched ease

There are a couple of differences to the Redis recipe, though:

  • rsyslog already has Kafka output packages, so it’s easier to set up
  • Kafka has a different set of features than Redis (trying to avoid flame wars here) when it comes to queues and scaling

As with the other recipes, I’ll show you how to install and configure the needed components. The end result would be that local syslog (and tailed files, if you want to tail them) will end up in Elasticsearch, or a logging SaaS like Logsene (which exposes the Elasticsearch API for both indexing and searching). Of course you can choose to change your rsyslog configuration to parse logs as well (as we’ve shown before), and change Logstash to do other things (like adding GeoIP info).

Getting the ingredients

First of all, you’ll probably need to update rsyslog. Most distros come with ancient versions and don’t have the plugins you need. From the official packages you can install:

If you don’t have Kafka already, you can set it up by downloading the binary tar. And then you can follow the quickstart guide. Basically you’ll have to start Zookeeper first (assuming you don’t have one already that you’d want to re-use):

bin/zookeeper-server-start.sh config/zookeeper.properties

And then start Kafka itself and create a simple 1-partition topic that we’ll use for pushing logs from rsyslog to Logstash. Let’s call it rsyslog_logstash:

bin/kafka-server-start.sh config/server.properties
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic rsyslog_logstash

Finally, you’ll have Logstash. At the time of writing this, we have a beta of 2.0, which comes with lots of improvements (including huge performance gains of the GeoIP filter I touched on earlier). After downloading and unpacking, you can start it via:

bin/logstash -f logstash.conf

Though you also have packages, in which case you’d put the configuration file in /etc/logstash/conf.d/ and start it with the init script.

Configuring rsyslog

With rsyslog, you’d need to load the needed modules first:

module(load="imuxsock")  # will listen to your local syslog
module(load="imfile")    # if you want to tail files
module(load="omkafka")   # lets you send to Kafka

If you want to tail files, you’d have to add definitions for each group of files like this:

input(type="imfile"
  File="/opt/logs/example*.log"
  Tag="examplelogs"
)

Then you’d need a template that will build JSON documents out of your logs. You would publish these JSON’s to Kafka and consume them with Logstash. Here’s one that works well for plain syslog and tailed files that aren’t parsed via mmnormalize:

template(name="json_lines" type="list" option.json="on") {
  constant(value="{")
  constant(value=""timestamp":"")
  property(name="timereported" dateFormat="rfc3339")
  constant(value="","message":"")
  property(name="msg")
  constant(value="","host":"")
  property(name="hostname")
  constant(value="","severity":"")
  property(name="syslogseverity-text")
  constant(value="","facility":"")
  property(name="syslogfacility-text")
  constant(value="","syslog-tag":"")
  property(name="syslogtag")
  constant(value=""}")
}

By default, rsyslog has a memory queue of 10K messages and has a single thread that works with batches of up to 16 messages (you can find all queue parameters here). You may want to change:
– the batch size, which also controls the maximum number of messages to be sent to Kafka at once
– the number of threads, which would parallelize sending to Kafka as well
– the size of the queue and its nature: in-memory(default), disk or disk-assisted

In a rsyslog->Kafka->Logstash setup I assume you want to keep rsyslog light, so these numbers would be small, like:

main_queue(
  queue.workerthreads="1"      # threads to work on the queue
  queue.dequeueBatchSize="100" # max number of messages to process at once
  queue.size="10000"           # max queue size
)

Finally, to publish to Kafka you’d mainly specify the brokers to connect to (in this example we have one listening to localhost:9092) and the name of the topic we just created:

action(
  broker=["localhost:9092"]
  type="omkafka"
  topic="rsyslog_logstash"
  template="json"
)

Assuming Kafka is started, rsyslog will keep pushing to it.

Configuring Logstash

This is the part where we pick the JSON logs (as defined in the earlier template) and forward them to the preferred destinations. First, we have the input, which will use to the Kafka topic we created. To connect, we’ll point Logstash to Zookeeper, and it will fetch all the info about Kafka from there:

input {
  kafka {
    zk_connect => "localhost:2181"
    topic_id => "rsyslog_logstash"
  }
}

At this point, you may want to use various filters to change your logs before pushing to Logsene or Elasticsearch. For this last step, you’d use the Elasticsearch output:

output {
  elasticsearch {
    hosts => "logsene-receiver.sematext.com:443" # it used to be "host" and "port" pre-2.0
    ssl => "true"
    index => "your Logsene app token goes here"
    manage_template => false
    #protocol => "http" # removed in 2.0
    #port => "443" # removed in 2.0
  }
}

And that’s it! Now you can use Kibana (or, in the case of Logsene, either Kibana or Logsene’s own UI) to search your logs!

14 thoughts on “Recipe: rsyslog + Kafka + Logstash

    1. Right, your logs will be stored in Kafka, too, but only according to your TTL. This would allow you to reindex data if something goes horribly wrong with ES and your shards get corrupted. A good practice here is to hold enough data in Kafka to be able to replay from your last backup. For example, if you have daily backups (e.g. via Curator), you might want to keep one day’s worth of data in Kafka.

  1. thanks for the recipe. I had a little problem with the rsyslog stuff. It was the template that is a bit wrong (due to HTML on the blog?). The inner ” should be escaped with backslash

    run rsyslog -N6 to check config.
    It looked liked rsyslog restarted when checking systemctl, but it hadn’t.

    1. Thanks, Raphael!

      I’m not sure what you mean by using this stack. You want something like Logstash -> Kafka -> Logstash -> Elasticsearch <- Kibana ? This should be doable by using the Kafka output on the first Logstash, and then the rest would work like in this post. But you'd probably want something light in the first step (like rsyslog here, and also Filebeat will support Kafka in version 5), then use Kafka for buffering and do the heavy Logstash processing after the buffer.

      Also, since I mentioned Filebeat 5, you can use it to push data directly to Elasticsearch and then have an Elasticsearch Ingest node do some of the heavy processing. If what Ingest can do is enough for you, then you'll end up with a simpler and faster setup, check this post for details: https://sematext.com/blog/2016/04/25/elasticsearch-ingest-node-vs-logstash-performance/

  2. Great post! I was trying to get rsyslogd v8 installed through yum on CentOS5. However, it does not come with omkafka. I can find very long and error prone instruction to compile rsyslogd with omkafka module. However, I was wondering if there’s a binary version of rsyslogd which comes with omkafka that I can use, without having to compile the whole thing.

    1. Just a bit more info that may help: I see there are omkafka packages on rsyslog.com for RHEL/CentOS 6 and 7, but not for 5. I suspect this is due to issues in building librdkafka there (a quick search revealed some clues here: https://github.com/edenhill/librdkafka/issues/319#issuecomment-117812222). If you can help with this (i.e. figure out how to build librdkafka on CentOS5, or maybe the issue in the link I pointed to was fixed in the meantime), then please post on the rsyslog mailing list and I’m sure Florian (in charge with packages here: https://github.com/rsyslog/rsyslog-pkg-rhel-centos) will gladly add omkafka packages for RHEL/CentOS5.

  3. Hi,
    Is it possible to redirect different topics to different logstash indices using some conditional statement in the logstash output plugin? e.g. let’s say I have two kafka topics, ‘topic-1’ and ‘topic-2’ and I want to send those to ‘logstash-1’, ‘logstash-2’ indices respectively.

    1. Hi Sayan,

      Yes, it’s possible. I think the way to go here is to have two Kafka inputs, each listening to one topic (via “topic_id”). You can mark messages coming to each topic by using “tags” and then on the output side you can use a conditional based on the tag. (something like “if tags=topic-1 then elasticsearch(logstash-1) else elasticsearch(logstash-2)).

      Alternatively, you can have multiple Logstash processes, each listening to one topic. This will add the overhead of having multiple JVMs but should be easier to scale (i.e. you can scale independently per topic) and you should get better throughput in the end because you won’t need to do any tagging or filtering. And because the flows are independent, if one of the flows blocks (e.g. failure to write to Elasticsearch) the other doesn’t necessarily block.

      1. Hi,
        Thanks for the response. The issue was resolved using ‘white_list’ and ‘decorate_events’ in kafka input plugin and then extracting the topic id from the kafka event using ruby filter.

  4. Hi there, thanks for your post.

    Do you use this setup in production ? How does it perform ? Do you have any number like event rate ?

    Thanks

    1. Hi,

      We use both rsyslog and Kafka in production, but we currently don’t connect them this way. I know of others who do use rsyslog+Kafka in production. I haven’t benchmarked this setup yet, but I’m expecting similar numbers to the rsyslog+Redis setup (assuming the I/O can take it), where I got 200K EPS on a single thread.

      Even if you parse unstructured data in rsyslog (and I would do that because it’s very fast and stays just as fast as you’re adding more rules), I would expect the heavier processing done by Logstash to be the bottleneck. In that case, you can easily add and remove Logstash “consumers” because they connect to Zookeeper.

      You cand find more numbers from the rsyslog&Logstash pipeline performance tests (as well as the configs) in our Lucene Revolution presentation: http://blog.sematext.com/2015/10/16/large-scale-log-analytics-with-solr/

      I hope this helps.

Leave a Reply