Kubernetes Containers: Logging and Monitoring support

In this post we will:

  • Introduce Kubernetes concepts and motivation for Kubernetes-aware monitoring and logging tooling
  • Show how to deploy the Sematext Docker Agent to each Kubernetes node with DaemonSet
  • Point out key Kubernetes metrics and log elements to help you troubleshoot and tune Docker and Kubernetes

Managing microservices in containers is typically done with Cluster Managers and Orchestration tools such as  Google Kubernetes, Apache Mesos, Docker Swarm, Docker Cloud, Amazon ECS, Hashicorp Nomad just to mention a few. However, each platform has slightly different of options to deploy containers or schedule tasks to each cluster node. This is why we started a Series of blog post with Docker Swarm Monitoring, and continue today with a quick tutorial for Container Monitoring and Log Collection on Kubernetes.

Read More

log management

Logging Libraries vs Log Shippers

Logging Libraries vs Log Shippers

In the context of centralizing logs (say, to Logsene or your own Elasticsearch), we often get the question of whether one should log directly from the application (e.g. via an Elasticsearch or syslog appender) or use a dedicated log shipper.

In this post, we’ll look at the advantages of each approach, so you’ll know when to use which.

Logging Libraries

Most programming languages have libraries to assist you with logging. Most commonly, they support local files or syslog, but more “exotic” destinations are often added to the list, such as Elasticsearch/Logsene. Here’s why you might want to use them:

  • Convenience: you’ll want a logging library anyway, so why not go with it all the way, without having to set up and manage a separate application for shipping? (well, there are some reasons below, but you get the point)
  • Fewer moving parts: logging from the library means you don’t have to manage the communication between the application and the log shipper
  • Lighter: logs serialized by your application can be consumed by Elasticsearch/Logsene directly, instead of having a log shipper in the middle to deserialize/parse it and then serialize it again

Log Shippers

Your log shipper can be Logstash or one of its alternatives. A logging library is still needed to get logs out of your application, but you’ll only write locally, either to a file or to a socket. A log shipper will take care of taking that raw log all the way to Elasticsearch/Logsene:

  • Reliability: most log shippers have buffers of some form. Whether it tails a file and remembers where it left off, or keeps data in memory/disk, a log shipper would be more resilient to network issues or slowdowns. Buffering can be implemented by a logging library too, but in reality most either block the thread/application or drop data
  • Performance: buffering also means a shipper can process data and send it to Elasticsearch/Logsene in bulks. This design will support higher throughput. Once again, logging libraries may have this functionality too (only tightly integrated into your app), but most will just process logs one by one
  • Enriching: unlike most logging libraries, log shippers often are capable of doing additional processing, such as pulling the host name or tagging IPs with Geo information
  • Fanout: logging to multiple destinations (e.g. local file + Logsene) is normally easier with a shipper
  • Flexibility: you can always change your log shipper to one that suits your use-case better. Changing the library you use for logging may be more involved

Conclusions

Design-wise, the difference between the two approaches is simply tight vs loose coupling, but the way most libraries and shippers are actually implemented are more likely to influence your decision on sending data to Elasticsearch/Logsene:

  • logging directly from the library might make sense for development: it’s easier to set up, especially if you’re not (yet) familiar with a log shipper
  • in production you’ll likely want to use one of the available log shippers, mostly because of buffers: blocking the application or dropping data (immediately) are often non-options in a production deployment

If logging isn’t critical to your environment (i.e. you can tolerate the occasional loss of data), you may want to fire-and-forget your logs to Logsene’s UDP syslog endpoint. This takes reliability out of the equation, meaning you can use a shipper if you need enriching or support for other destinations, or a library if you just want to send the raw logs (which may well be JSON).

Shippers or libraries, if you want to send logs with anything that can talk to Elasticsearch or syslog, you can sign up for Logsene here. No credit card or commitment is required, and we offer 30-day trials for all plans, in addition to the free ones.

If, on the other hand, you enjoy working with logs, metrics and/or search engines, come join us: we’re hiring worldwide.

Exploring Windows Kernel with Fibratus and Logsene

This is a guest post by Nedim Šabić, developer of Fibratus, a tool for exploration and tracing of the Windows kernel. 

Unlike Linux / UNIX environments which provide a plethora of open source and native tools to instrument the user / kernel space internals, the Windows operating systems are pretty limited when it comes to diversity of tools and interfaces to perform the aforementioned tasks. Prior to Windows 7, you could use some of not so legal techniques like SSDT hooking to intercept system calls issued from the user space and do your custom pre-processing, but they are far from efficient or stable. The kernel mode driver could be helpful if it wouldn’t require a digital signature granted by Microsoft. Actually, some tools like Sysmon or Process Monitor can be helpful, but they are closed-source and don’t leave much room for extensibility or integration with external systems such as message queues, databases, endpoints, etc.

Read More

black friday log management checklist

Black Friday log management (with the Elastic Stack) checklist

For this Black Friday, Sematext wishes you:

  • more products sold
  • more traffic and exposure
  • more logs 🙂

Now seriously, applications tend to generate a lot more logs on Black Friday, and they also tend to break down more – making those logs even more precious. If you’re using the Elastic Stack for centralized logging, in this post we’ll share some tips and tricks to prepare you for this extra traffic.

If you’re still grepping through your logs via ssh, doing that on Black Friday might be that more painful, so you have two options:

  • get started with the Elastic Stack now. Here’s a complete ELK howto. It should take you about an hour to get started and you can move on from there. Don’t forget to come back to this post for tips! 🙂
  • use Logsene, which takes care of the E(lasticsearch) and K(ibana) from ELK for you. Most importantly for this season, we take care of scaling Elasticsearch. You can get started in 5 minutes with Logstash or choose another log shipper. Anything that can push data to Elasticsearch via HTTP can work with Logsene, since it exposes the Elasticsearch API. So you can log directly from your app or from a log shipper (here are all the documented options).

Either way, let’s move to the tips themselves.

Tips for Logstash and Friends

The big question here is: can the pipeline easily max out Elasticsearch, or will it become the bottleneck itself? If your logs go directly from your servers to Elasticsearch, there’s little to worry about: as you spin more servers for Black Friday, your pipeline capacity for processing and buffering will grow as well.

You may get into trouble if your logs are funnelled through one (or a few) Logstash instances, though. If you find yourself in that situation you might check the following:

  • Bulk size. The ideal size depends on your Elasticsearch hardware, but usually you want to send a few MB at a time. Gigantic batches will put unnecessary strain on Elasticsearch, while tiny ones will add too much overhead. Calculate how many logs (of your average size) make up a few MB and you should be good.
  • Number of threads sending data. When one thread goes through a bulk reply, Elasticsearch shouldn’t be idling – it should get data from another thread. The optimal number of threads depends on whether these threads are doing something else (in Logstash, for example, pipeline threads also take care of parsing, which can be expensive) and on your destination hardware. As a rule of thumb, about 4 threads with few things to do (e.g. no grok or geoip in Logstash) per Elasticsearch data node should be enough to keep them busy. If threads have more processing to do, you may need more of them.
  • The same applies for processing data: many shippers work on logs in batches (recent versions of Logstash included) and can do this processing on multiple threads.
  • Distribute the load between all data nodes. This will prevent any one data node from becoming a hotspot. In Logstash specify an array of destination hosts. Or, you can start using Elasticsearch “client” nodes (with both node.data and node.master set to false in elasticsearch.yml) and point Logstash to two of those (for failover).
  • The same applies for the shipper sending data to the central Logstash servers – the load needs to be balanced between them. For example, in Filebeat you can specify an array of destination Logstash hosts or you can use Kafka as a central buffer.
  • Make sure there’s enough memory to do the processing (and buffering, if the shipper buffers in memory). For Logstash, the default 1GB of heap may not cope with heavy load – depending on how much processing you do, it may need 2GB or more (monitoring Logstash’s heap usage will tell for sure).
  • If you use grok and have multiple rules, put the rules matching more logs and the cheaper ones earlier in the array. Or use Ingest Nodes to do the grok instead of Logstash.

Tips for Elasticsearch

Let’s just dive into them:

  • Refresh interval. There’s an older blog post on how refresh interval influences indexing performance. The conclusions from it are still valid today: for Black Friday at least, you might want to relax the real-time-ness of your searches to get more indexing throughput.
  • Async transaction log. By default, Elasticsearch will fsync the transaction log after every operation (2.x) or request (5.x). You can relax this safety guarantee by setting index.translog.durability to async. This way it will fsync every 5s (default value for index.translog.sync_interval) and save you some precious IOPS.
  • Size based indices. If you’re using strict time-based indices (like one index every day), Black Friday traffic may cause a drop in indexing throughput like this (mainly because of merges):

black-friday-log-management

Indexing throughput graph from SPM Elasticsearch monitor

In order to continue writing at that top speed, you’ll need to rotate indices before they reach that “wall size”, which is usually at 5-10GB per shard. The point is to rotate when you reach a certain size, and not purely by time, and use an alias to always write to the latest index (in 5.x this is made easier with the Rollover Index API).

  • Ensure load is balanced across data nodes. Otherwise some nodes will become bottlenecks. This requires your number of shards to be proportional to the number of data nodes. Feel free to twist Elasticsearch’s arm into balancing shards by configuring index.routing.allocation.total_shards_per_node: for example, if you have 4 shards and one replica on a 4-data-node cluster, you’ll want a maximum of 2 shards per node.
  • Overshard so you can scale out if you need to, while keeping your cluster balanced. You’d do this by setting a [reasonable] number of shards that has enough divisors. For example, if you have 4 data nodes then 12 shards and 1 replica per shard might work well. You could scale up to 6, 8, 12 or even 24 nodes and your cluster will still be perfectly balanced.
  • Relax the merge policy. This will slow down your full-text searches a bit (though aggregations would perform about the same), use some more heap and open files in order to allow more indexing throughput. 50 segments_per_tier, 20 max_merge_at_once and 500mb max_merged_segment should give you a good boost.
  • Don’t store what you don’t need. Disable _all and search in specific fields (and search in “message” or some other general field by default via index.query.default_field to it). Skip indexing fields not used for full-text search and skip doc values for fields on which you don’t aggregate.
  • Use doc values for aggregations (instead of the in-memory field data) – this is the default for all fields except analyzed strings since 2.0, but you’ll need to be extra careful if you’re still on 1.x. Otherwise you’ll risk running out of heap and crash/slow down your cluster.
  • Use dedicated masters. This is also a stability measure that helps your cluster remain consistent even if load makes your data nodes unresponsive.

You’ll find even more tips and tricks, as well as more details on implementing the above, in our Velocity 2016 presentation. But the ones described above should give you the most bang per buck (or rather, per time, but you know what they say about time) for this Black Friday.

Final Words

Tuning & scaling Elasticsearch isn’t rocket science, but it often requires time, money or both. So if you’re not into taking care of all this plumbing, we suggest delegating this task to us by using Logsene, our log analytics SaaS. With Logsene, you’d get:

  • The same Elasticsearch API when it comes to indexing and querying. We have Kibana, too, in addition to our own UI, plus you can use Grafana Elasticsearch integration.
  • Free trials for any plan, even the Black Friday-sized ones. You can sign up for them without any commitment or credit card details.
  • No lock in – because of the Elasticsearch API, you can always go [back] to your own ELK Stack if you really want to manage your own Elasticsearch clusters. We can even help you with that via Elastic Stack consulting, training and production support.
  • A lot of extra goodies on top of Elasticsearch, like role-based authentication, alerting and integration with SPM for your application monitoring. This way you can have your metrics and logs in one place.

If, on the other hand, you are passionate about this stuff and work with it, you might like to hear that we’re hiring worldwide, on a wide range of positions (at the time of this writing there are openings for backend, frontend (UX, UI, ReactJS, Redux…), sales, work on Docker, consulting and training). 🙂

Elasticsearch on EC2 vs AWS Elasticsearch Service

Many of our clients use AWS EC2. In the context of Elasticsearch consulting or support, one question we often get is: should we use AWS Elasticsearch Service instead of deploying Elasticsearch ourselves? The question is valid whether “self hosted” means in EC2, some other cloud or your own datacenter. As always, the answer is “it depends”, but in this post we’ll show how the advantages of AWS Elasticsearch compared to those of deploying your own Elasticsearch cluster. This way, you’ll be able to decide what fits your use-case and knowledge.

Why AWS Elasticsearch?

  • It automatically replaces failed nodes: you don’t need to get paged in the middle of the night, spin a new node and add it to the cluster
  • You can add/remove nodes through an API – otherwise you’ll have to make sure you have all the automation in place so that when you spin a node you don’t spend extra time manually installing and configuring Elasticsearch
  • You can manage access rights via IAM: this is easier than setting up a reverse proxy or a security addon (cheaper, too, if the addon is paid)
  • Daily snapshots to S3 are included. This saves you the time and money to set it up (and the storage cost) for what is a mandatory step in most use-cases
  • CloudWatch monitoring included. You will want to monitor your Elasticsearch cluster anyway (whether you build or buy)

Why to install your own Elasticsearch?

  • On demand equivalent instances are cheaper by ~29%. The delta differs from instance to instance (we checked m3.2xl and i2.2xl ones). You get even more discount for your own cluster if you use reserved instances
  • More instance types and sizes are available. You can use bigger i2 instances than AWS Elasticsearch, and you have access to the latest generation of c4 and m4 instances. This way, you are likely to scale further and get more bang per buck, especially with logs and metrics (more specific hardware recommendations and Elasticsearch tuning here)
  • You can change more index settings, beyond analysis and number of shards/replicas. For example, delayed allocation, which is useful when you have a lot of data per node. You can also change the settings of all indices at once by hitting the /_settings endpoint. By properly utilizing various settings Elasticsearch makes available you can better optimize your setup for your particular use case, make better use of underlying resources, and thus drive the cost down further.
  • You can change more cluster-wide settings, such as number of shards to rebalance at once
  • You get access to all other APIs, such as Hot Threads, which is useful for debugging
  • You can use a more comprehensive Elasticsearch monitoring solution. Currently, CloudWatch only collects a few metrics, such as cluster status, number of nodes and documents, heap pressure and disk space. For most use-cases, you’ll need more info, such as the query latency and indexing throughput. And when something goes wrong, you’ll need more insight on JVM pool sizes, cache sizes, Garbage Collection or you may need to profile Elasticsearch
  • You can have clusters of more than 20 nodes

Conclusions

You may see a pattern emerging from the bullets above: AWS Elasticsearch is easy to set up and comes with a few features on top of Elasticsearch that you’ll likely need. However, it’s limited when it comes to scaling – both in terms of number&size of nodes and Elasticsearch features.

If you already know your way around Elasticsearch, AWS Elasticsearch service will likely only make sense for small clusters. If you’re just getting started, you can go a longer way until it will start to pay off for you to boost your knowledge (e.g. via an Elasticsearch training) and install your own Elasticsearch cluster (maybe with the help of our consulting or support). Or you can delegate the whole scaling part to us by using Logsene, especially if your use-case is about logs or metrics.

Finally, if you think there are too many “if”s in the above paragraph, here’s a flowchart to cover all the options:

blog-post-hosted-elasticsearch-vs-aws-elasticsearch-service-1

Elasticsearch for logs and metrics: A deep dive – Velocity 2016, O’REILLY CONFERENCES

We are known worldwide for our Elasticsearch, ELK stack and Solr consulting services, and we are always happy to help others improve their skills in these technologies, not only through Solr & Elastic Stack trainings, but also by sharing our knowledge in meetups and conferences. This week, on 7-9 November 2016, we joined O’REILLY Velocity 2016 conference, discussing the latest tech in Elasticsearch.

elasticsearch velocity 2016 sematext

Our colleagues Radu Gheorghe and Rafał Kuć were present in Amsterdam and gave a talk about: Elasticsearch for logs and metrics: A deep dive. And it was a great experience! They met with web operations and DevOps professionals interested in improving their Elastic Stack skills.

Lots of comments and questions were answered!

Curious to check their presentation? You may find it below.

Tuning Solr & Pipeline for Logs – Video & Slides

Not everyone uses Splunk or ELK stack for logs. A few weeks ago, at the Lucene/Solr Revolution conference in Boston, we gave a talk about using Solr for logging, along with lots of good info about how to tune the logging pipeline. The talk also goes over the best AWS instance types, optimal EBS setup, log shipping (see Top 5 Logstash Alternatives), and so on.

Docker “Swarm Mode”: Full Cluster Monitoring & Logging with 1 Command

Until recently, automating the deployment of Performance Monitoring agents in Docker Swarm clusters was challenging because monitoring agents had to be deployed to each cluster node and the previous Docker releases (<Docker engine v1.12 / Docker Swarm 1.2.4) had no global service scheduler (Github issue #601).  Scheduling services with via docker-compose and scheduling constraints required manual updates when the number of nodes changed in the swarm cluster – definitely not convenient for dynamic scaling of clusters! In Docker Swarm Monitoring and Logging we shared some Linux shell acrobatics as workaround for this issue.

The good news: All this has changed with Docker Engine v1.12 and new Swarm Mode. The latest release of Docker v1.12 provides many new features for orchestration and the new Swarm mode made it much easier to deploy Swarm clusters.  

 

With Docker v1.12 services can be scheduled globally – similar to Kubernetes DaemonSet, RancherOS global services or CoreOS global fleet services

 

Read More

5 Logstash Alternatives

When it comes to centralizing logs to Elasticsearch, the first log shipper that comes to mind is Logstash. People hear about it even if it’s not clear what it does:
– Bob: I’m looking to aggregate logs
– Alice: you mean… like… Logstash?

When you get into it, you realize centralizing logs often implies a bunch of things, and Logstash isn’t the only log shipper that fits the bill:

  • fetching data from a source: a file, a UNIX socket, TCP, UDP…
  • processing it: appending a timestamp, parsing unstructured data, adding Geo information based on IP
  • shipping it to a destination. In this case, Elasticsearch. And because Elasticsearch can be down or struggling, or the network can be down, the shipper would ideally be able to buffer and retry

In this post, we’ll describe Logstash and its alternatives – 5 “alternative” log shippers (Filebeat, Fluentd, rsyslog, syslog-ng and Logagent), so you know which fits which use-case.
Read More

RancherOS Monitoring and Logging Support

RancherOS is one of the few “container only” operating systems and it evolved into an excellent orchestration tool for containers, competing e.g. with CoreOS. It supports several types of schedulers such as its own “Cattle” scheduler, as well as Kubernetes, Docker Swarm, and Mesos. A unique feature of RancherOS is its GUI for container orchestration based on templates for application stacks. Rancher Labs maintains a catalog for such templates and has integrated a community catalog, which includes Sematext Docker Agent for the collection of metrics, events and logs from all RancherOS cluster nodes and containers.

Monitoring all RancherOS nodes can be done several different ways, depending on which orchestration tool you use:

  • Deployment via rancher-compose for the whole RancherOS cluster using the Cattle scheduler
  • Deployment via the GUI (rancher server) using the Community Catalog entry (available for the Cattle scheduler)
  • Deployment as Kubernetes DaemonSet via kubectl for the Kubernetes scheduler
  • Deployment as Swarm global service using Swarm scheduler

This post provides the walk-throughs for all these deployment/orchestration options, with the common goal of collecting metrics, logs and events from each RancherOS node and all containers.

rancheros-sematext-docker-agent

Sematext Logging and Monitoring on RancherOS

Setup via Sematext Catalog Entry

When you run the Rancher server user interface, simply search in the community catalog for “sematext”, “monitoring” or “logs” and select “Sematext Docker Agent”.

Rancher-Catalog-monitoring-search

Sematext Docker Agent in RancherOS Community Catalog

Choose “View Details”, and in the “Configuration Options” enter the SPM and Logsene App tokens. You can obtain these from https://apps.sematext.com, where you can sign up and create your SPM and Logsene apps. If your Rancher cluster runs behind a firewall, you might need to specify the proxy URL in the HTTPS_PROXY or HTTP_PROXY environment variable.

Sematext-catalog-options

If you’d like to collect all logs, just press “Launch” without specifying any filter for containers or images.

You can find all details about the template on RancherOS blog New in Rancher Community Catalog: Monitoring and Logging by Sematext.

Setup via rancher-compose command

If you prefer rancher-compose over GUIs and use the Cattle scheduler then rancher-compose is the right tool to deploy Sematext Docker Agent. The following configuration will activate Sematext Docker Agent on every node in the RancherOS cluster. You’ll need to replace the SPM/Logsene App tokens, of course:

# sematext/docker-compose.yml
sematext-docker-agent:  
 image: 'sematext/sematext-agent-docker:latest'
 environment:
   - LOGSENE_TOKEN=YOUR_LOGSENE_TOKEN
   - SPM_TOKEN=YOUR_SPM_TOKEN 
 restart: always
 volumes:
   - /var/run/docker.sock:/var/run/docker.sock
 labels:
   io.rancher.scheduler.global: 'true'

Store the configuration as sematext/docker-compose.yml and activate the service:

rancher-compose up -f sematext/docker-compose.yml -d

Setup via Kubernetes DaemonSet

If you already deployed Kubernetes on RancherOS the Cattle scheduler should not be activated and you should use Kubernetes DaemonSet to deploy Sematext Docker Agent to all cluster nodes:

  1. Create a sematext-agent.yml file with the DaemonSet definition (just replace SPM/Logsene App tokens):
    apiVersion: extensions/v1beta1
    kind: DaemonSet
    metadata:
     name: sematext-agent
    spec:
     template:
       metadata:
         labels:
           app: sematext-agent
       spec:
         nodeSelector: {}
         dnsPolicy: "ClusterFirst"
         restartPolicy: "Always"
         containers:
         - name: sematext-agent
           image: sematext/sematext-agent-docker:latest
           imagePullPolicy: "Always"
           env:
           - name: SPM_TOKEN
             value: "YOUR_SPM_TOKEN"
           - name: LOGSENE_TOKEN
             value: "YOUR_LOGSENE_TOKEN"
           - name: KUBERNETES
             value: "1"
           volumeMounts:
             - mountPath: /var/run/docker.sock
               name: docker-sock
             - mountPath: /etc/localtime
               name: localtime
           securityContext:
             privileged: true
         volumes:
           - name: docker-sock
             hostPath:
               path: /var/run/docker.sock
           - name: localtime
             hostPath:
               path: /etc/localtime
    
  1. Activate the DaemonSet in the Kubernetes cluster:
    kubectl create -f sematext-agent.yml

Setup via Swarm global services

Similar to Kubernetes on RancherOS the Cattle scheduler is deactivated for Swarm as well. Thus, you can deploy Sematext Docker Agent as a global service on Swarm (>Docker engine 1.12). Connect you Docker client to RancherOS Swarm API endpoint and run the following global service definition with your SPM/Logsene App tokens. This will add Sematext Docker Agent to each Swarm node as soon it get launched.

docker service create --mode global --name sematext-agent-docker \
--mount type=bind,src=/var/run/docker.sock,dst=/var/run/docker.sock \
-e SPM_TOKEN=YOUR_SPM_TOKEN -e LOGSENE_TOKEN=YOUR_LOGSENE_TOKEN \
sematext/sematext-agent-docker

Setup via Mesos Marathon

The following configuration will activate Sematext Docker Agent on every node in the Mesos cluster. Please note that you have to specify the number of Mesos nodes (instances), SPM App Token, and Logsene App Token. Example call to the Marathon API:

curl -XPOST -H "Content-type: application/json" http://your_marathon_server:8080/v2/apps -d '
{
"container": {
"type": "DOCKER",
"docker": {
"image": "sematext/sematext-agent-docker",
"privileged": true,
},
"volumes": [
{
"containerPath": "/var/run/docker.sock",
"hostPath": "/var/run/docker.sock",
"mode": "RW"
}
],
"network": "BRIDGE"
},
"env": {
"LOGSENE_TOKEN": "YOUR_LOGSENE_TOKEN",
"SPM_TOKEN": "YOUR_SPM_TOKEN"
},
"id": "sematext-agent-docker",
"instances": 1,
"cpus": 0.5,
"mem": 300,
"constraints": [["hostname","UNIQUE"]]
}

Summary

 

RancherOS provides many options and plays well with the most common orchestration tools.

Users can pick their preferred solution together with Sematext Docker Agent for the collection of metrics, events and logs. We have seen users moving from one orchestration platform to another when they hit some issues and having the flexibility to do so easily is extremely valuable in the rapidly changing container world.

As we hope you can see, RancherOS provides this flexibility, while Sematext Docker Agent support for all leading orchestration platforms ensures continuous visibility into Docker operational intelligence – metrics, events, and logs, regardless of which orchestration platform you use.

octi-sda

SIGN UP – FREE TRIAL