At the end of November, we’ll be migrating the Sematext Logs backend from Elasticsearch to OpenSearch

Full Guide to Linux Disk IO Monitoring, Alerting and Tuning

February 5, 2025

Table of contents

Disk IO (Input/Output) is a core aspect of system performance. Whether you’re managing a database, a web application, or a cloud server, how efficiently your system reads and writes data affects everything from response times to stability.

Unlike high CPU usage or memory bottlenecks that often manifest immediately, disk IO issues tend to creep up silently—until they slow down critical processes. A sluggish database query, an application taking too long to load, or a system hanging under load can often be traced back to disk performance.

This guide walks through setting up disk IO monitoring on Linux, covering both built-in tools and more advanced solutions. By the end, you’ll have a clear understanding of how to monitor, alert on, and, and optimize disk performance to keep your systems running smoothly.

Understanding Disk IO in Linux

Disk IO refers to the read and write operations between RAM and storage devices (HDD, SSD, or network storage). When applications request data, the system either retrieves it from memory (fast) or from disk (slower). Multiple processes competing for disk access can lead to contention and performance degradation.

Key Metrics to Monitor

  1. Throughput – Measures data transfer speed (MB/s, GB/s).
  2. IOPS – Tracks how many individual disk operations occur per second.
  3. Latency – The time it takes for a read/write operation to complete (ms).
  4. Disk Utilization – The percentage of time the disk is actively processing requests.

Fun fact: In AWS the IOPS you get is tied to the size of the disk….

Built-in Linux Tools for Disk IO Monitoring

If you are a console lover, this section is for you. We’re covering 5 powerful tools for monitoring disk IO. My favorite is dstat, but all these tools help track read/write speeds, disk utilization, and IOPS in real-time, making them essential for performance analysis and troubleshooting.

1. iostat – General Disk Performance Overview

iostat is one of the most effective for tracking disk IO performance. 

Installation:
Most Linux distributions don’t include iostat by default. Why not!? Anyway, install it using:

sudo apt install sysstat   # Debian/Ubuntu
sudo yum install sysstat   # RHEL/CentOS
sudo dnf install sysstat   # Fedora

Basic Usage:

iostat -x 1
  • -x provides extended statistics (including utilization and queue depth).
  • 1 updates the stats every second.

Output:

Key Metrics in Output:

That -x output really is extended, but here are the key metrics in all that output that you want to pay extra attention to when troubleshooting disk IO.

  • r/s, w/s (Reads/Writes per second): How many read/write operations happen each second.
  • rMB/s, wMB/s (Read/Write throughput): Amount of data read/written per second in MB.
  • await (Average IO wait time in ms): High values indicate slow disk response times.
  • %util (Disk utilization): Percentage of time the disk is busy. If this is consistently above 80-90%, the disk may be a bottleneck.

If you are new to disk IO performance, the table below should point you in the right direction.

Symptom Possible Cause
High await (above 20ms) Slow storage device or IO bottleneck
Low r/s, w/s but high %util Disk is struggling with large requests
High avgqu-sz IO requests are piling up in the queue
High wrqm/s but low w/s Writes are waiting too long before being committed

If %util is near 100% and await is high, the storage system is overloaded and may need tuning or hardware upgrades.

2. iotop – Process-Based IO Monitoring

iotop is a real-time disk monitoring tool that works similarly to top, but specifically for tracking disk read and write activity by the process. It will help you figure out which of your applications or services are generating the most IO load.

Installation:

sudo apt install iotop   # Debian/Ubuntu
sudo yum install iotop   # RHEL/CentOS

Basic Usage:

sudo iotop

Example Output:

It looks like top, surprise, surprise 🙂 

Understanding iotop Output

  • DISK READ/DISK WRITE – This shows how much data each process is reading and writing per second.
  • SWAPIN % – Indicates if the process is using swap space (low values are good).
  • IO> – The percentage of time a process is waiting for IO (higher means the process is disk-bound).
  • COMMAND – Displays the exact process using disk resources.

A high IO> value (above 80%) means the process is IO-limited, which means this application is spending a lot of time waiting to read the data from the disk. Never a good thing. Moreover, not only is this application going to be slow. This may also slow down other applications that are utilizing the same disk.

Detecting Performance Issues Using iotop

Symptom Possible Cause
Process with high IO> but low CPU usage IO bottleneck slowing down the app
mysqld consuming most disk reads/writes Database queries might need optimization
High disk writes from rsync or logrotate Excessive logging or backups impacting performance
nginx showing unexpected high reads Serving large static files from disk instead of caching

 

1. vmstat – System-Wide Performance Metrics

vmstat (Virtual Memory Statistics) is a versatile tool for monitoring overall system performance, including disk IO, CPU, memory, and processes. While it doesn’t provide per-process details like iotop, it offers a quick snapshot of your system’s health.

Basic Usage:

vmstat

Example Output:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
1  0  70400 137776  41652 369940    1    2   185   295   60  113  1  0 98  0  0

Relevant Columns:

  • r and b (Run/Block Processes):
  • r shows the number of processes waiting for CPU time.
  • b shows the number of processes blocked, often due to disk IO.
  • bi and bo (Block Input/Output):
  • bi measures data read from the disk (blocks per second).
  • bo measures data written to the disk (blocks per second).
  • Consistently high bo values may indicate excessive write activity.
  • wa (IO Wait):
  • Percentage of CPU time spent waiting for IO operations to complete.
  • High wa values (e.g., above 20%) suggest the system is IO-bound.
  • id (CPU Idle):
  • Percentage of CPU time spent idle. If id is low and wa is high, it’s a clear sign of IO bottlenecks. In plain English, this means that applications running on the host are waiting for the disk while not doing much. 

Analyzing Performance with vmstat

Symptom Possible Cause
High b values Processes are blocked, likely due to disk IO contention.
High bo but low bi Write-heavy workload, possibly from logs or backups.
High wa Disk IO bottleneck; the storage device may be too slow or overloaded.
High bi and low bo Read-heavy workload, common in database queries or file access.

2. dstat – Customizable Performance Monitoring

dstat is a powerful and flexible tool that combines features from iostat, vmstat, and netstat, which is why it’s my tool of choice when I’m working in the terminal. It provides real-time statistics for disk IO, network activity, CPU, memory, and more in an easy-to-read format. 

Installation:

sudo apt install dstat   # Debian/Ubuntu
sudo yum install dstat   # RHEL/CentOS

Basic Usage:

dstat

Example Output:

--total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw
  1   0  99   0   0| 577k  923k|   0     0 |1408B 3786B| 101   188
  0   0 100   0   0|   0     0 |  70B  246B|   0     0 |  57    83
  0   0 100   0   0|   0     0 |  70B  134B|   0     0 |  41    58
  0   0 100   0   0|   0     0 |  70B  110B|   0     0 |  42    69
  0   0 100   0   0|   0     0 | 164B  208B|   0     0 |  51    76
  0   0 100   0   0|   0     0 |  70B  118B|   0     0 |  37    59
  0   0 100   0   0|   0     0 |  70B  110B|   0     0 |  39    69

Analyzing Disk IO Metrics with dstat

Metric What It Tells You Action to Take
read/writ Real-time read/write throughput High values may indicate heavy IO load.
util Disk utilization percentage Consistently above 80% may indicate a bottleneck.
tps Transactions per second Low TPS but high utilization may suggest inefficient IO patterns

 

3. sar – Historical Disk IO Monitoring

sar (System Activity Reporter) is ideal for capturing and analyzing historical performance data. So when you need to diagnose disk IO issues that occurred in the past or during specific time windows, this is the tool to use, not the other ones listed above. But note that, unlike the other tools above, which are only command-line tools, sar actually involves a service that runs all the time to collect data.

sar is part of the sysstat package and can record various system metrics, including disk IO, at regular intervals.

Installing and Enabling sar

To use sar, install the sysstat package:

sudo apt install sysstat   # Debian/Ubuntu 
sudo yum install sysstat   # RHEL/CentOS

Once installed, enable the sysstat service to start collecting data:

sudo systemctl enable --now sysstat

By default, sar collects system metrics every 10 minutes and stores them in /var/log/sysstat/. This interval can be adjusted in the configuration file located at /etc/sysstat/sysstat.

Basic Usage:

To view current disk IO metrics:

sar -d 1 5
  • -d specifies disk activity.
  • 1 5 collects data every 1 second for 5 iterations.

Example Output:

vagrant@vagrant:~$ sar -d 1 5
Linux 5.4.0-89-generic (vagrant) 01/28/25 _aarch64_ (2 CPU)
22:20:55          DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
22:20:56       dev7-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
22:20:56       dev7-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
22:20:56       dev7-2      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
22:20:56       dev7-3      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
22:20:56       dev7-4      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
22:20:56       dev7-5      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
22:20:56       dev7-6      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
22:20:56       dev7-7      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
22:20:56     dev259-0      2.00      0.00     32.00      0.00     16.00      0.00      0.50      0.40
22:20:56     dev253-0      8.00      0.00     32.00      0.00      4.00      0.00      0.00      0.40
22:20:56       dev7-8      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
22:20:56       dev7-9      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

Key columns include:

  • tps – Transactions per second.
  • rd_sec/s & wr_sec/s – Sectors read and written per second.
  • await – Average time (ms) for disk IO operations to complete.
  • %util – Disk utilization percentage.

Setting Up Alerts for Disk IO Issues

Monitoring disk IO metrics manually or periodically is helpful, but in production environments, automation is key. Setting up alerts ensures that you’re notified the moment disk performance issues occur, allowing for proactive troubleshooting before users or applications are affected.

This section covers how to automate disk IO monitoring and configure alerts using scripts, system tools, and external monitoring solutions.

1. Using Shell Scripts for Custom Alerts

You can write a shell script to monitor key disk IO metrics (e.g., from iostat) and trigger alerts when thresholds are exceeded.

Example: Monitoring Disk Utilization

Here’s a basic script that checks if disk utilization (%util) exceeds 80%:

#!/bin/bash

# Threshold for disk utilization
THRESHOLD=80

# Check disk utilization using iostat
UTIL=$(iostat -dx 1 2 | grep 'sda' | tail -1 | awk '{print $NF}')

# Compare utilization with threshold
if (( $(echo "$UTIL > $THRESHOLD" | bc -l) )); then
  echo "Disk utilization is high: ${UTIL}% on /dev/sda" | mail -s "Disk Alert" admin@example.com
fi

Steps to Deploy

1. Save the script as disk_alert.sh and make it executable:

chmod +x disk_alert.sh

2. Schedule the script to run periodically using cron:

crontab -e

Add a line to run the script every 5 minutes:

*/5 * * * * /path/to/disk_alert.sh

2. Cloud Solutions for Disk IO Monitoring and Alerting

While one could use the above approach to set up alerting, I wouldn’t really recommend doing that in production to anyone. That’d be a seriously poor man’s approach to monitoring, an approach that would quickly drive anyone crazy. There are several cloud-based solutions that I would recommend using instead. Monitoring and alerting are their core functionality. Just for example, in Sematext you will see charts like these out of the box:

Note the I/O Read/Write chart. That’s the visual version of the read/write metrics we saw the above Linux tools output to the terminal. Of course, alerting is built into Sematext and you can easily set it up – note that little bell icon in the screenshot above. You can use that to set up anomaly detection to get alerted about unusual spikes or dips in read or write performance.

Yes, I’ve purposely generated a very “messy” chart with too many data series to show you that even in such situations you can pick out insights about strange or high disk IO.  You can see here that some set of hosts perform a ton of disk writes every night between XXX and XXX. If my job is to run such infrastructure, I’ll want to know about this, I’ll want to dig into what is happening there, ensure there is enough disk IO capacity, and so on.

Here is just another example of a more distilled down view of disk IO performance:

If your infrastructure is hosted in the cloud, you can also use platform-native monitoring services with built-in alerting features. AWS comes with CloudWatch, Google Cloud has Google Cloud Operations Suit, and Azure has Azure Monitor.

Tuning Disk IO for Better Performance

Monitoring disk IO is only half the battle—optimizing and tuning disk performance ensures that your system runs efficiently. Below are several strategies to improve disk IO performance, reduce bottlenecks, and maximize throughput.

1. Optimize File System and Mount Options

Using the right file system and mount options can significantly improve performance.

  • Use a modern file system:
    • ext4 is optimized for general-purpose use.
    • XFS is ideal for large-scale and high-performance workloads.
    • btrfs provides advanced features like snapshotting and data integrity.
  • Enable write-back caching:
mount -o remount,noatime,commit=60 /dev/sda1 /mnt
  • noatime: Prevents unnecessary metadata writes when files are accessed.
  • commit=60: Reduces the frequency of metadata commits to disk (default is 5s).
  • Tune journal settings (for ext4 and XFS):
tune2fs -o journal_data_writeback /dev/sda1

2. Adjust IO Scheduler for Workload-Specific Optimization

Linux offers different IO schedulers that impact how disk requests are handled. Choosing the right one depends on your workload.

  • Check current scheduler:
cat /sys/block/sda/queue/scheduler
  • Change scheduler (temporary):
echo "none" > /sys/block/sda/queue/scheduler
  • Make it persistent (GRUB method):

Edit /etc/default/grub and modify the kernel parameters:

GRUB_CMDLINE_LINUX_DEFAULT="elevator=none"

Run: 

sudo update-grub
sudo reboot

Scheduler choices:

  • none: Best for SSDs and NVMe drives.
  • mq-deadline: Good for databases and mixed workloads.
  • bfq: Ideal for desktop users to ensure responsive performance.

3. Increase Read/Write Buffers

Tuning kernel parameters can help improve disk throughput, especially in write-heavy workloads.

Adjust disk readahead (improves sequential reads):

blockdev --setra 4096 /dev/sda

Verify with:

blockdev --getra /dev/sda

Increase dirty writeback timers (delays syncing dirty pages to disk):

sysctl -w vm.dirty_ratio=40
sysctl -w vm.dirty_background_ratio=10
  • vm.dirty_ratio=40: Allows up to 40% of RAM to be used for dirty pages before forcing a flush to disk.
  • vm.dirty_background_ratio=10: When dirty pages exceed 10% of RAM, the background flush daemon starts writing data.

Enable TRIM for SSDs (Improves Performance and Longevity)

If using SSDs, enabling TRIM ensures efficient space reclamation.

  • Check if TRIM is supported:
lsblk --discard
  • Enable TRIM manually:
fstrim -av
  • Reduce Swap Usage

Excessive swap usage can degrade disk IO performance. If you have enough RAM, consider reducing swappiness.

  • Check current swappiness:
cat /proc/sys/vm/swappiness
  • Reduce swappiness (recommended for servers):
sysctl -w vm.swappiness=10
  1. Monitor and Reduce Unnecessary IO

Identify processes generating excessive IO and optimize them.

  • Find IO-intensive processes:
iotop -o
  • Limit IO usage per process (ionice):
    ionice -c3 -p <PID>

Conclusion

Like maxed-out CPU, maxed-out disk reads or writes can really degrade your users’ experience with your product and negatively impact revenue.  And who wants that!? If you are running a small operation, the Linux tools we covered here –  iostat, iotop, vmstat, dstat, and sar – will help you keep an eye on your disk utilization.

If you are running a proper production system, use a proper monitoring solution, be it Sematext or Datadog, or something else.

 

Exploring Windows Kernel with Fibratus and Sematext

This is a guest post by Nedim Šabić, developer of Fibratus, a...

JVM Heap

What Is Java Heap Memory? Java heap memory is a...

Migration from Elasticsearch to OpenSearch

Introduction In this tutorial, we will guide you through the...