Kafka Consumer Lag Monitoring

SPM is one of the most comprehensive Kafka monitoring solutions, capturing some 200 Kafka metrics, including Kafka Broker, Producer, and Consumer metrics. While lots of those metrics are useful, there is one particular metric everyone wants to monitor – Consumer Lag.

What is Consumer Lag

When people talk about Kafka or about a Kafka cluster, they are typically referring to Kafka Brokers. You can think of a Kafka Broker as a Kafka server. A Broker is what actually stores and serves Kafka messages. Kafka Producers are applications that write messages into Kafka (Brokers). Kafka Consumers are applications that read messages from Kafka (Brokers).

Inside Kafka Brokers data is stored in one or more Topics, and each Topic consists of one or more Partitions. When writing data a Broker actually writes it into a specific Partition. As it writes data it keeps track of the last “write position” in each Partition. This is called Latest Offset also known as Log End Offset. Each Partition has its own independent Latest Offset.

Just like Brokers keep track of their write position in each Partition, each Consumer keeps track of “read position” in each Partition whose data it is consuming. That is, it keeps track of which data it has read. This is known as Consumer Offset. This Consumer Offset is periodically persisted (to ZooKeeper or a special Topic in Kafka itself) so it can survive Consumer crashes or unclean shutdowns and avoid re-consuming too much old data.

Kafka Consumer Lag and Read/Write Rates
Kafka Consumer Lag and Read/Write Rates

In our diagram above we can see yellow bars, which represents the rate at which Brokers are writing messages created by Producers.  The orange bars represent the rate at which Consumers are consuming messages from Brokers. The rates look roughly equal – and they need to be, otherwise the Consumers will fall behind.  However, there is always going to be some delay between the moment a message is written and the moment it is consumed. Reads are always going to be lagging behind writes, and that is what we call Consumer Lag. The Consumer Lag is simply the delta between the Latest Offset and Consumer Offset.

Why is Consumer Lag Important

Many applications today are based on being able to process (near) real-time data. Think about performance monitoring system like SPM or log management service like Logsene. They continuously process infinite streams of near real-time data. If they were to show you metrics or logs with too much delay – if the Consumer Lag were too big – they’d be nearly useless.  This Consumer Lag tells us how far behind each Consumer (Group) is in each Partition.  The smaller the lag the more real-time the data consumption.

Monitoring Read and Write Rates

Kafka Consumer Lag and Broker Offset Changes
Kafka Consumer Lag and Broker Offset Changes

As we just learned the delta between the Latest Offset and the Consumer Offset is what gives us the Consumer Lag.  In the above chart from SPM you may have noticed a few other metrics:

  • Broker Write Rate
  • Consume Rate
  • Broker Earliest Offset Changes

The rate metrics are derived metrics.  If you look at Kafka’s metrics you won’t find them there.  Under the hood SPM collects a few metrics with various offsets from which these rates are computed.  In addition, it charts Broker Earliest Offset Changes, which is  the earliest known offset in each Broker’s Partition.  Put another way, this offset is the offset of the oldest message in a Partition.  While this offset alone may not be super useful, knowing how it’s changing could be handy when things go awry.  Data in Kafka has has a certain TTL (Time To Live) to allow for easy purging of old data.  This purging is performed by Kafka itself.  Every time such purging kicks in the offset of the oldest data changes.  SPM’s Broker Earliest Offset Change surfaces this information for your monitoring pleasure.  This metric gives you an idea how often purges are happening and how many messages they’ve removed each time they ran.

There are several Kafka monitoring tools out there that, like LinkedIn’s Burrow, whose Offset and Consumer Lag monitoring approach is used in SPM.  If you need a good Kafka monitoring solution, give SPM a go.  Ship your Kafka and other logs into Logsene and you’ve got yourself a DevOps solution that will make troubleshooting easy instead of dreadful.

 

Monitoring rsyslog with Kibana and SPM

A while ago we published this post where we explained how you can get stats about rsyslog, such as the number of messages enqueued, the number of output errors and so on. The point was to send them to Elasticsearch (or Logsene, our logging SaaS, which exposes the Elasticsearch API) in order to analyze them.

This is part 2 of that story, where we share how we process these stats in production. We’ll cover:

  • an updated config, working with Elasticsearch 2.x
  • what Kibana dashboards we have in Logsene to get an overview of what rsyslog is doing
  • how we send some of these metrics to SPM as well, in order to set up alerts on their values: both threshold-based alerts and anomaly detection

Read More

Sematext JVM Profiler

Introducing On-demand Java Profiling

If you are running apps on top of JVM and want to be able to profile them in production, on-demand, without affecting your app’s performance and users, read on!  Screenshots, features, and other juicy stuff is further down.

Do you run any apps on the JVM?  How do you find bottlenecks in your apps once they are in production, so you can optimize them?  If they become slow, how do you find which part of code in your app is slow?

Maybe you look at metrics like Garbage Collection.  Maybe you run commands like jstat to see if various memory pools are full or if there are too many FGCs (Full Garbage Collections).  Or perhaps you run jstack or do kill -SIGHUP PID and look at thread dumps?

All of these are reasonable approaches… until your infrastructure grows and/or you get tired of running around, sshing to machines, running jstat and jstack or kill with sudo so you have sufficient rights to execute those commands, and so on.

Another way to tackle this is to simply have your standard profiler attached to the process, or try to attach it on the fly, but that tends to be difficult, requires more manual work, and we all know full-blown profilers typically slow down apps to the point where they could become unusable.  In short, such profiling approach is not really suitable for production.

There’s got to be a better way, right?

Indeed, there is.  Meet SPM‘s On-demand Java Profiler!
This low-impact profiler is designed to help you identify bottlenecks in your production environment without slowing down your apps.  It provides rich analysis of the running code, much like a typical profiler.

What Types of Apps can you Profile?

The SPM profiler can profile anything that runs on top of the JVM.  This means it can profile Java apps, Scala apps, even things like Clojure and Groovy.  You are not limited to profiling only your own apps – the apps you developed.  You can also use it to profile any of the other SPM Integrations that run on the JVM – you profile your webapps running in Tomcat, Jetty, JBoss, or Glassfish, you can  profile Solr or Elasticsearch, Spark, Kafka, Storm, and so on.

How to Profile your App with SPM

If your SPM agent version is 1.29.2 or newer, you’re set.  If you have an older version update it first.

You specify the application you want to profile by:

  • selecting the SPM agent that monitors it,
  • how long you want profile it,
  • … and SPM does the rest.  

Start Java Profiler in SPM
Start Java Profiler (click the image for a bigger view)

The selected SPM agent will then start profiling.  When done, the profiler will show you things like time spent in various methods and in their children.  It will show you the call tree, which you can look at top-down or bottom-up, etc.  It is really quite useful — in our very first profiler test run we immediately identified a suboptimal piece of code in one of our own applications! Talking about instant gratification!

Profiling Features and Benefits

With SPM’s On-demand profiler you can:

  • Profile anything that runs on the JVM, such as Java, Scala, Clojure, Groovy, etc.
  • Start the profiling on demand and without having to add anything to the Java command-line and without needing to restart the process/app you want to monitor.
  • Find the root cause of performance bottlenecks in your or in 3rd party apps.
  • Capture stack traces and view aggregated call trees that let you drill down to a specific line in a method.
  • View the profiled call stack using top-down or bottom-up fashion.  The former shows a standard call tree from the entry point/method down to the leaf methods being called, while bottom-up surfaces the hottest methods and points out from where they are called.
  • See both CPU and Wall clock time.  The former shows how much CPU time is taken by each method, while Wall clock time shows how much real time each method took.
  • Exclude specific classes and methods from profiling through easy to use filtering.  This helps reduce the noise and volume of output, thus making it easier and faster to analyze.
  • Hide outliers – methods which are either not called frequently or are super fast and thus not worth looking at.

View profiler results in SPM
View Profiler Results (click the image for a bigger view)

We hope you like this new addition to SPM.  Got ideas how we could make it more useful for you?  Let us know via comments, email or @sematext.

Not using SPM yet? Check out the free 30-day trial by registering here. There’s no commitment and no credit card required.  We even offer On Premises SPM and Logsene packages in addition to SaaS if that’s more to your liking.  And, even better — combine SPM with Logsene to make the integration of performance metrics, logs, events and anomalies more robust for those looking for a single pane of glass.

Presentation: Top Node.js Metrics to Watch

Fresh from Germany’s largest Node.js Meetup, hosted by Wikimedia in Berlin, is the latest presentation from Sematext DevOps Evangelist Stefan Thies — “Top Node.js Metrics To Watch”. The event was shared with the Node.js Meetup in London via video-live stream, the full recording is available on YouTube.

Stefan’s talk goes through challenges of developing Node.js monitoring solutions, such as SPM for Node.js  and key metrics including examples from monitoring Kibana’s Node.js Server

Here is the video:

and the slides:

Please note, the article “ Top Node.js Metrics to Watch” was originally published by O’Reilly Radar.

Have a look at our other Node.js posts — there is a lot more interesting material to discover, like  MongoDB monitoring made with Node.js.

Questions or Feedback?
If you have any questions or feedback for us, please contact us by email or hit us on Twitter.  We love talking about performance monitoring — and log management!

 

Introducing NetMaps

New Year, New Feature in SPM!  We are happy to announce the immediate availability of NetMaps in SPM!  Check out why they are useful or watch the short video below.

Ever wondered how different components of distributed apps are actually connected over the network? When it comes to troubleshooting of distributed application stacks like Apache Kafka, Spark, Hadoop, Cassandra,  Solr, or Elasticsearch — not to mention Microservice architectures or Docker Containers — information about the deployed infrastructure becomes critical. That architecture diagram you drew N months ago?  It’s probably out of date.  Apps we run today are often very dynamic. Instances, nodes, and containers come and go, whether because of elastic up/down scaling or other reasons.

Discovering This Dynamic Infrastructure

Watching the actual network traffic on all nodes could quickly answer many questions for DevOps engineers doing troubleshooting or planning setup changes. For example:

  • Which nodes are online and active?
  • How nodes are connected to other nodes?
  • What are the dependencies between network services?
  • What is the consumed bandwidth between nodes?
  • Which applications run on a specific network node?

Visualize Network Connections

Designed to visualize network connections and answer the above questions instantly, NetMaps also include:

  • Automatic Discovery of network nodes and applications
  • Filtering by application and host name
  • Automatic Visualizations as Network Map and Chord Diagrams
  • Interactive Explorer for following network links for each application node
  • Bandwidth consumption for all incoming and outgoing network connections
  • Navigation from the NetMap to all nodes and related performance metrics of the monitored App

The best practice is to activate network monitoring on all application server nodes, which communicate with databases, message brokers, search engines etc. in that way it is easy to see how client applications communicate with backend servers.

NetMap “Map” View

NetMap_map

NetMap “Chord” View

NetMap_chord

It is very easy to activate Network Monitoring in SPM Client, a collector for Host and Application Metrics. Intelligent network filters ensure that the resource usage for the network monitoring stays low while capturing all relevant packets to explore your infrastructure using the “NetMap” Tab in SPM. If you find network maps interesting, you might also be interested in SPM’s AppMap feature for JVM applications to discover relationships between monitored JVM applications such as Elasticsearch, Solr, Cassandra, Spark or Kafka, …

We hope you like this new addition to SPM.  Got ideas how we could make it more useful for you?  Let us know via comments, email or @sematext.

Not using SPM yet? Check out the free 30-day SPM trial by registering here.  There’s no commitment and no credit card required.

SPM vs. New Relic APM – Performance Monitoring Solution Comparison

If you’ve found your way to this post then chances are high that you’re having second thoughts about diving into a New Relic APM subscription.  You’re not alone.  In fact, we hear from many fellow DevOps engineers looking at performance monitoring solutions who check out New Relic APM — or who are already using it — because it is so widely known, but wonder if there is a better tool, specifically with traits like:

  • Better pricing
  • On Premises deployment (not just SaaS)
  • Integration of metrics, logs and events in a single UI
  • Across-the-board anomaly detection

In all of the above cases — and others — SPM meets or exceeds New Relic APM.  This SPM vs. New Relic APM comparison document has all the details.

SPM_NR_header

So why not give SPM a try?  You can check out a free 30-day trial by registering here.  There’s no commitment and no credit card required.  Even better — combine SPM with Logsene to make the integration of performance metrics, logs, events and anomalies more robust for those looking for a single pane of glass.

MongoDB Monitoring Support

For many of us in the DevOps field, MongoDB is a critical part of our IT stack.  With today’s acquisition of WiredTiger, MongoDB is further establishing itself as the NoSQL DB built to support massive data processing and storage.  It would be an understatement to say that MongoDB does a lot, with many organizations using it as their backend storage framework, analytics backend, and so on.

So your MongoDB cluster really, really needs to be in tip-top shape.  All the time.  And if it’s not then you need to know asap — or better yet — prevent problems before they kick in and make your life difficult.  That’s where SPM comes in — with MongoDB monitoring, alerting and anomaly detection.  MongoDB exposes a boatload of metrics, but instead of just throwing all of them on endless charts, we’ve taken the time to cherry pick what we think are the top 50 most valuable MongoDB metrics to monitor. We have furthermore made it possible to filter the MongoDB metrics by server, as well as a database and table where possible.

The key metric groups we track are:

  • Database Operations
  • Database Memory
  • Database Storage
  • Documents
  • Locks
  • Network
  • Database Journal
  • Background Flushes

The Overview chart below provides 9 charts with MongoDB key metrics:

  • Row 1 displays CPU, Memory and Disk Metrics
  • Row 2 displays Database Operations, Database Memory and Database Storage Metrics.
  • Row 3 adds Collection/Document Metrics, Locks, and wait times; followed by Network Metrics for MongoDB

MongoDB_Overview

SPM for MongoDB Overview

In case you monitor a MongoDB cluster, the Server Tab provides a quick overview for the Health of each node:

MongoDB_Server_view

SPM Server View

The Reports on the left side of the screen below provide detailed information for each group of metrics. Let’s have a quick look at them.

MongoDB_CPU_details

OS Metrics: CPU Metrics, Memory Usage, Disk Space and I/O

Below is an example of some of the key MongoDB Metrics found in SPM:

  • Database Operations: Counters for Queries, Insert, Update, Delete and other commands for the main database plus replica operations
  • Database Memory: Resident-, Virtual-, Mapped-, and Journal Memory
  • Database Storage: Size of Data Files, Namespace Files, DB Files etc., plus Size of Objects, Number of Collections and Objects

MongoDB_Storage

MongoDB Storage & Collections

The screenshot below shows:

  • Documents: Counters for Documents inserted, updated or returned by queries
  • Locks: Lock counters and lock acquisition wait times for Global, Database, Collection and Journal level. Since MongoDB 3.x Locks are not always global. SPM shows a breakdown for all lock types. These metrics are good candidates for alerting, when anomalies are detected.  Simply add an alert from the menu in the top-left corner in each chart.

MongoDB_Locks

Metrics for all MongoDB Locks

Other key MongoDB metrics that SPM displays are:

  • Network: Number of client connections, Received and transmitted data, Request rate
  • Database Journal: Commits, Early Commits,  Commit times and lock times

MongoDB_Journal_Commits

MongoDB Journal Metrics

In case you like to see MongoDB metrics together with the Top Node.js Metrics, you might like the idea of putting MongoDB and Node.js metrics from SPM for Node.js in a custom dashboard:

MongoDB_Locks-Node.js_Loop

SPM Custom Dashboard with MongoDB Locks and Node.js Event Loop Latency

We hope you like this new addition to SPM.  Got ideas how we could make it more useful for you?  Let us know via comments, email or @sematext.

Not using SPM yet? Check out the free 30-day trial by registering here.  There’s no commitment and no credit card required.  Even better — combine SPM with Logsene to make the integration of performance metrics, logs, events and anomalies more robust for those looking for a single pane of glass.

Docker + Solr How-to: Monitoring the Official Solr Docker Image

The official Solr Image on Docker Hub was released just a few weeks ago and already has 16K pulls. Why not more? Well, there are more than 200 different Solr images on Docker Hub — probably because no official Image was available!

A rapidly growing number of organizations are using Solr and Docker in production and they are probably happy about the new official Image. Needless to say, monitoring Solr is essential in production. Docker is disruptive in many ways, and there are many things that are slightly different and worth mentioning.  These include:

  1. Changed deployment for Solr and its monitoring tools using Dockerfile, Docker Compose or various Orchestration Tools
  2. There is a new Layer to monitor: Container Metrics and Events, see: Docker Events and Metrics monitoring and SPM for Docker
  3. Logging has changed: containers log to the console and logs need to be retrieved from Docker-Daemon instead getting them from the Solr log file.  Check out our post on the subject: Innovative Docker Log Management
  4. Official Images may not provide options for monitoring (such as JMX).  However, the official Image for Solr provides an option to pass parameters to the Java Runtime Environment.  We we will use this option for Solr monitoring in this post.

Next, I’m going to demonstrate the setup of a Solr node with SPM. The final setup will provide the full Solr & Docker Monitoring and Logging package:

  • Detailed Application Metrics for Solr, deployed on Docker
  • Detailed Container Metrics and Docker Events
  • Centralized Logs for all Containers by SPM for Docker

Let’s first decide on one of the following options to monitor Solr on Docker:

  1. Build your own Solr container with a mix of open-source monitoring/alerting tools. I’m not going to go into detail about this option today because dealing with a mix of open-source DevOps tools and a non-official Solr image doesn’t sound clean; plus, we can do better.
  2. Use a standalone monitoring agent, which queries metrics from the Solr container. This requires a setup for JMX and Docker networking configurations for the monitor and Solr. The metrics gathered by remote agents are limited and, in the Docker context, running an external monitoring process plus Solr processes consumes more resources.  And the next option …
  3. Inject an SPM in-process monitoring agent into Solr. This option has the lowest resource usage and has support for advanced monitoring functions like Transaction Tracing and AppMap.

We’ll go with Option #3 in this blog post, as it provides the best insights into Solr.  Sematext provides the SPM Client (this includes the monitoring agent and metrics sender) pre-installed in a Docker Image.  We refer to this dockerized SPM Client as “SPM Client Image/Container” in the following instructions.  The main trick here is to mount a volume from SPM Client Container into Solr Containers in order to load the monitoring library that’s part of the SPM Client Container.

Let’s have a look at the desired setup and how to get there:

SPM-Solr-Docker-Schema
Monitoring Setup

We’ll use the latest Docker-Compose Version (> v1.5) because we can than use environment variables substitution in Docker-Compose.

1) Configure and start SPM-Client Container

The SPM Token is a unique identifier for monitored applications – if you haven’t created an SPM App for Solr, then create one here first. Should take about 37 seconds.

# Set the SPM Token as Environment Variable
export SPM_TOKEN=4feb144c-4da8-4081-83b5-b0b8e06e743a
# Set the JVM Name, which appears in SPM JVM Metrics Report
# In addition we will use it as Hostname for the Solr container
export JVM_NAME=SOLR1

2) Create SPM Client and Solr service in docker-compose.yml Note: you may copy this file to make changes for additional Solr options; all parameters are set as Environment Variables.

spm-client-solr:
 image: sematext/spm-client
 container_name: spm-client-solr
 hostname: spm-client-solr
 environment:
 - SPM_CONFIG=${SPM_TOKEN} solr javaagent ${JVM_NAME}

SOLR1:
 image: solr
 hostname: solr1
 ports:
 - "8983:8983"
 volumes_from:
 - spm-client-solr
 environment:
 - SOLR_OPTS=-Dcom.sun.management.jmxremote -javaagent:/opt/spm/spm-monitor/lib/spm-monitor-solr.jar=${SPM_TOKEN}::${JVM_NAME}
 command: bin/solr -f

In the Environment variable “SOLR_OPTS” in the Docker-Compose file above we see options for the SPM in-process monitor to inject a .jar file from the SPM Client Volume.  The SOLR_OPTS string is taken from SPM install instructions.  It includes the SPM Token (the ${SPM_TOKEN} part) and provides the JVM name so we can distinguish between multiple Solr instances if we run N of them on the same host (the ::${JVM_NAME} part).

3) Run Solr and SPM Monitor  

We are now ready to fire up Solr:

    docker-compose up -d

Solr_image_code

All done! After about a minute, metrics for the Docker Host, JVMs and Solr nodes will appear in SPM.  Because we chose a consistent naming for Container hostname, and JVM name we can immediately see, in every chart, the relevant filters named “SOLR1”.  This is much better than some random Container IDs.

Solr_image_screen_4

Solr Metrics Overview

But what about my Solr Logs and the Container Metrics?

Simply run SPM for Docker – it collect logs as well as container and host metrics.  It can also parse Solr logs and store them in Logsene (see Logsene 1-Click ELK Stack), which is awesome because it means you can have both Solr/OS/JVM metrics AND Solr logs all in one place!  Or do you maybe like to ssh to your servers and grep log files?

Docker Logs & Metrics Steps:

First we create the SPM App with the type “Docker” for Docker-specific metrics and then we create a Logsene App for our logs. Then we use the generated App Tokens to run Sematext Agent for Docker.

docker run -d -name sematext-agent -e SPM_TOKEN=SPM_DOCKER_APP_TOKEN -e LOGSENE_TOKEN=LOGSENE_APP_TOKEN sematext/sematext-agent-docker

After a few minutes, you will get Host and Container Metrics together with Events and Logs in SPM, as shown here:

Solr_image_screen_2

Please note that logs from the containers are automatically shipped and parsed! No setup for log shippers? That is correct — there is NO complicated setup of syslog, Logstash, Docker log drivers, etc.  All this work is done by SPM for Docker. For example, each log line has a “node_name” field for the Solr node. It takes the timestamp, severity, class, thread and source from the Solr log and each log is automatically tagged with the container ID and image name. Moving from SPM Metrics to detailed Solr Logs including Exceptions and parsed Stack Traces is just another mouse click away! Look:

Solr_image_screen_3
Multi-Line Exception, captured and parsed from Solr container

 

solr-logsene

The filters next to field stats on the right side of the screen make it easy to identify containers with the most logs by choosing “container_name”.  That’s just a little detail in the Logsene UI – feel free to explore it by creating Alerts or Kibana 4 Dashboards for your container logs.

Like what you saw here? To monitor Docker and Solr with SPM just get a free account here!  And drop us an email or hit us on Twitter with suggestions, questions or comments.  Solr and Docker are topics we enjoy chatting about with the community!

Top Node.js Metrics to Watch

Monitoring Node.js Applications has special challenges. The dynamic nature of the language provides many “opportunities” for developers to produce memory leaks, and a single function blocking the event queue can have a huge impact on the overall application performance. Parallel execution of jobs is done using multiple worker processes using the “cluster” functionality to take full advantage of multi-core CPUs – but the master and worker processes belong to a single application, which means that they should be monitored together. Let’s have a deep look at the Top Metrics in Node.js Applications to get a better understanding of why they are so important to monitor.

Note: All images in this post are from Sematext’s SPM Performance Monitoring solution and its Node.js integration.

Garbage Collection & Process Memory – Node.js is based on Google’s Chrome V8 Javascript engine. Garbage Collection reclaims memory used by objects that are not longer required. The V8 garbage collection stops the program execution. Incremental GC cycles (scavenging) process only a part of the Heap and are very fast. Full GC cycles deal with objects that survived multiple Incremental GC runs. Full GC runs are executed less frequently to minimize pauses in the program execution.

With regard to garbage collection metrics, we should first measure all the time spent for garbage collection. In addition, it is useful to see how often a full GC cycle — or incremental GC cycle — is executed. The size of heap memory can be compared with the size of the last GC run to see if there is a growing trend. That’s why the following metrics should be monitored:

  • Time consumed for garbage collection
  • Counters for full GC cycles
  • Counters for incremental GC cycles

Nodejs_garbage_collection

Garbage Collection Metrics

Aside from how often GC happens and how long it takes, we can measure the effect on memory by providing the following metrics:

  • Released memory between GC cycles (see above)
  • Process Heap Size and Heap Usage

Nodejs_process_memory

Process Memory Information

Event Loop – The secret of Node.js’s performance is its ability to be CPU bound and use async operations; in that way CPU can be highly utilized and doesn’t waste cycles waiting for I/O operations. This means a server can take many connections and will not be blocked for async operations. As soon as the operation is finished, callback functions are used to continue processing.   The implementation is based on a single event loop, which processes the async function calls in a separate thread. Using synchronous operations drags down performance because other operations need to wait to be executed.  That’s why the golden rule for Node.js performance is “don’t block the event loop”.

The metric to watch is the Latency to process the next event:

  • Slowest Event Handling (Max Latency)
  • Average Event Loop Latency
  • Fastest Event Handling (Min Latency)

Nodejs_slow_avg_fast

Slowest, average and fastest event processing

A high latency in the event loop might indicate the use of blocking (sync) or time-consuming functions in event handlers, which could impact the performance of the whole Node.js application.

Cluster Mode and number of processes – To scale Node.js beyond the capacity of a single process the use of master and worker processes is required – the so called “cluster” mode. Master processes share sockets with the forked worker process and can exchange messages with it. A typical use case for web servers is forking N worker processes, which operate on the shared server socket and handle the requests in round robin (since Node v0.12). In many cases programs choose N with the number of CPUs the server provides – that’s why a constant number of worker processes should be the regular case.  If this number changes it means worker processes have been terminated for some reason.  In the case of processing queues, workers might be started on demand.  In this scenario it would be normal that the number of workers changes all the time, but it might be interesting to see how long a higher number of workers was active. Using a tool like SPM for Node.js lets one track the number of workers. When picking a monitoring solution or developing your own monitoring for Node.js, make sure it is capable of filtering by hostname and worker ID.  Keep in mind Node.js workers can have a very short lifetime that traditional monitoring tools may not be able to handle well.

Nodejs_worker_count

Worker Count

For example, to compare event loop latency in different Node.js sub-processes, we need to be able to select workers we want to compare:

Nodejs_event_loop

Event Loop Latency for each Worker

Web Frameworks – There is a steadily growing number of frameworks to build web services using Node.js.  The most popular are: Express, Hapi.js, Restify, Mean.io, Meteor, and many more. When doing HTTP monitoring, here are some of the key metrics to pay attention to:

  • Response time (http/https)
  • Request rate
  • Error rates (total, error categories)
  • Request/Response content size

Nodejs_HTTP:HTTPS

HTTP/HTTPS Metrics Overview

Of course, Node.js apps don’t run in a vacuum.  They connect to other services, other types of applications, caches, data stores, etc.  As such, while knowing what key Node.js metrics are, monitoring Node.js alone or monitoring it separately from other parts of the infrastructure is not the best practice.  If there is one piece of advice I can give to anyone looking into (Node.js) monitoring it is this: when you buy a monitoring solution — or if you are building it for your own use — make sure you end up with a solution that is capable of showing you the big picture.  For example, Node.js is often used with Elasticsearch (see Top 10 Elasticsearch Metrics to Watch post), Redis, etc.  Seeing metrics for all the systems that surround Node.js apps is precious.  Here is just a small example of a dashboard showing a few Node.js and Elasticsearch metrics together.

Nodejs_combined_dashboard

Combined Dashboard of Node.js and Elasticsearch Metrics

So, those are our top Node.js metrics — what are YOUR top 10 metrics? We’d love to know so we can compare and contrast them with ours in a future post.  Please leave a comment, or send them to us via email or hit us on Twitter: @sematext.

And…if you’d like try SPM to monitor Node.js yourself, check out a Free 30-day trial by registering here.  There’s no commitment and no credit card required. Small startups, startups with no or very little outside funding, non-profit and educational institutions get special pricing – just get in touch with us.

[Note: this post originally appeared on Radar.com]

Death to APM and Logging Silos

Sematext has combined the power of SPM and Logsene in a single pane of glass – a unified view into all the key bits of operational intelligence every DevOps engineer needs: server and application performance metrics, logs, events, anomalies, alerts, ChatOps integrations, etc.  In other words, the whole is greater than the sum of its parts.

Metrics + Logs Correlation using SPM and Logsene Together in One UI

This video demonstrates how the SPM + Logsene combination solves the problems of having too much data to manage yourself and the disconnect when metrics and logs are siloed.  We address two of the most common problems — and their solutions — below the video.

Problem 1 – Big Data, Big Burden: Servers, Containers, Apps, and Devices spew out more and more data: more metrics, more logs, more events. Collecting and storing all this data is a challenge and is often not cheap both in terms of time invested in building large-scale data collection, storage, and retrieval systems, maintaining them, as well as providing the adequate infrastructure to run them.

Solution: Focus on your organization’s core business, your core strength, and outsource needlessly painful or expensive parts to those who specialize in them.  We already outsource all the time, except we don’t call it “outsourcing”: we buy food, we don’t grow or raise it.  We buy cars and don’t build them.  Most of us don’t buy physical servers any more.  Why?  Because others do that better, faster, cheaper.

Problem 2 – Metrics vs. Log Silos: Collecting and visualizing performance metrics and getting alerts when things go awry is great, but performance charts can tell us only so much.  Code instrumentation, like SPM’s Transaction Tracing, goes deeper and provides more insight, but still doesn’t tell us the whole story.  Similarly, collecting logs and being able to search them is very valuable.  Unfortunately, oftentimes APM and log management solutions live in separate silos that don’t really talk to each other.

Solution: Don’t waste your time jumping between multiple disconnected solutions, be they open-source or commercial.  Time is the most precious thing each of us has, and our time as DevOps engineers is very expensive.  Use a tool or service that gives you access to as many bits of operational information that you need as possible.  Not only is this more efficient, and thus cheaper, it’s also much more pleasant than jumping between solutions for browser and terminal, top, vmstat, dstat, less, grep, etc. which are needlessly manual and get boring.

Troubleshooting Doesn’t Need To Take Over Your Life

Troubleshooting production performance issues, dealing with APM alerts and even looking at logs (don’t even think about grepping!) isn’t that hard or time consuming.  Well, as long as you have the right tools, that is.

Correlate_1

Cloud & On Premises Deployments

Unlike most monitoring and logging solutions, Sematext offers both Cloud and On Premises deployments for SPM and Logsene.  We’re happy to discuss package pricing if you’d like to combine both products.

Got ideas how we could make metrics and logs correlation more useful for you?  Let us know via comments, email or @sematext.

Not using SPM and/or Logsene yet? Check out the free 30-day trial by registering here (ping us if you’re a startup, a non-profit, or educational institution – we’ve got special pricing for you!).  There’s no commitment and no credit card required.