At the end of November, we’ll be migrating the Sematext Logs backend from Elasticsearch to OpenSearch

5 Ways to Prevent CPU Overload on Linux Servers

February 5, 2025

Table of contents

Every server administrator’s nightmare starts with a message: “CPU usage at 100%” 

It’s that critical moment when your Linux server transforms from a reliable workhorse into a sluggish mess, taking your applications and user experience down.

We’ve all been there… staring at a terminal, watching load averages climb, while frantically trying to figure out which process decided to throw a CPU-hungry party on our server.

Every point above normal means slower response times, frustrated users, and potential revenue loss. 

Why Prevention Trumps Reaction

Think about the last time you dealt with a CPU overload crisis. 

How many hours did you spend digging through logs, tracking down runaway processes, and trying to restore services? 

Now imagine if you could have prevented that entire situation with proper monitoring and management.

Here’s what happens when you’re stuck in reaction mode:

  • Your applications start failing as they compete for CPU time
  • Users experience timeouts and delays
  • Database connections pile up
  • Cache misses increase as processes wait for CPU time
  • Background jobs start backing up
  • Other system resources (like memory and I/O) get impacted
  • Your team drops everything to handle the emergency

Sidenote: Ask any engineering manager how they feel about the above. Every time this happens and one or more people jump on a problem to troubleshoot they switch context from whatever they were doing to this. 

This is massively expensive for productivity and hurts the bottom line rather directly. Is paying a little extra for a little more CPU headroom a better ROI? 🙂 

When you proactively monitor and manage your CPU usage, you build a fortress around your system’s stability. 

You’ll spot potential issues while they’re still small ripples, not tsunamis. 

#1 Understanding Process Management to Prevent Linux CPU Overload

Think of Linux processes like guests at an all-you-can-eat CPU buffet. Without proper management, some will hog all the resources while others starve. Let’s fix that.

Process Priorities and Nice Values

Every Linux process has a priority that determines its CPU time allocation. The nice value, ranging from -20 (highest priority) to 19 (lowest priority), helps you control this priority. 

Here’s how to use nice and renice:

# Check current process nice values
ps axo pid,comm,nice

# Start a new process with a specific nice value
nice -n 10 ./my_script.sh

# Change nice value of a running process
renice 15 -p 1234

# For CPU-intensive background tasks, always use a higher nice value
nice -n 15 find / -name "*.log" > /dev/null 2>&1 &

Real-time Process Monitoring

The top and htop commands are your first line of defense. They show you exactly what’s eating your CPU:

# Basic top command with CPU-specific view
top -o %CPU

# Sort by CPU usage and show specific processes
ps aux --sort=-%cpu | head -n 5

Pro tip: In htop, press F6 ( For Linux: Esc → 6; For Mac: Hold fn + F6 ) and select PERCENT_CPU to sort by CPU usage. It’s more user-friendly than top and shows CPU cores individually. But if you want to stick with top, you can press `1` to see a breakdown by CPU core.

Process Limits with cgroups

Control groups (cgroups) let you set hard limits on CPU usage. Here’s a practical example:

# Create a new cgroup and set CPU limits
sudo cgcreate -g cpu:/cpulimited
sudo cgset -r cpu.cfs_quota_us=50000 cpulimited  # 50% CPU limit
sudo cgset -r cpu.cfs_period_us=100000 cpulimited

# Run a process within this limited group
sudo cgexec -g cpu:cpulimited stress --cpu 2

Process Scheduling Best Practices

Here are two really cool tools and one trick. I suspect over 90% of DevOps/Ops/SREs out there don’t really know these tools. So this is show-off material for you!

#1 Use the batch command for CPU-intensive tasks:

batch << EOF
./heavy_computation.sh
EOF

#2 Set CPU affinity for critical processes:

# Pin a process to specific CPU cores
taskset -cp 0,1 1234  # Pins PID 1234 to cores 0 and 1

#3 Monitor and adjust process priorities dynamically:

while true; do
  pid=$(ps aux --sort=-%cpu | awk 'NR==2 {print $2}')
  if [ ! -z "$pid" ]; then
    cpu=$(ps -p $pid -o %cpu | tail -1)
    if [ "${cpu%.*}" -gt 80 ]; then
      renice 10 -p $pid
      logger "Adjusted priority for PID $pid (CPU: $cpu%)"
    fi
  fi
  sleep 30
done

Monitor these settings regularly and adjust based on your workload patterns. 

Remember: the goal isn’t to cripple high-CPU processes but to ensure fair resource distribution.

#2 Optimize System Resources to Prevent Linux CPU Overload

CPU Scheduling Optimization

The Linux kernel’s CPU scheduler is highly configurable. Here’s how to tune it:

# Check current CPU scaling governor
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Set performance mode for all CPUs
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# View current sched_migration_cost_ns
sysctl kernel.sched_migration_cost_ns

Load Balancing Across Cores

Modern servers have multiple CPU cores, and proper load distribution is crucial:

# Check CPU core usage
mpstat -P ALL 1

# Enable IRQ balance for better interrupt distribution
systemctl enable --now irqbalance

# View interrupt distribution
cat /proc/interrupts

Here’s a quick script to check core balance:

#!/bin/bash
# Monitor core balance
while true; do
    echo "CPU Core Usage Distribution:"
    sar -P ALL 1 1 | grep -v Linux | grep -v Average | sort -k 2 -n
    sleep 5
done

Kernel Parameter Tuning

# Key kernel parameters for CPU optimization
sudo sysctl -w kernel.sched_autogroup_enabled=0
sudo sysctl -w kernel.sched_latency_ns=24000000
sudo sysctl -w kernel.sched_min_granularity_ns=3000000

# Make changes permanent
cat << EOF | sudo tee -a /etc/sysctl.conf
kernel.sched_autogroup_enabled=0
kernel.sched_latency_ns=24000000
kernel.sched_min_granularity_ns=3000000
EOF

CPU Throttling Management

Prevent thermal throttling from impacting performance. Yeah, that’s a thing.

# Check current CPU frequencies
cat /proc/cpuinfo | grep MHz

# Monitor thermal throttling
while true; do
    cur_freq=$(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq)
    max_freq=$(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq)
    if [ $cur_freq -lt $max_freq ]; then
        echo "Warning: CPU potentially throttled"
        echo "Current: $cur_freq"
        echo "Maximum: $max_freq"
    fi
    sleep 5
done

Resource Allocation Tips

#1 Use CPU sets for critical services:

# Create a dedicated CPU set for database services
cset shield -c 0-1 -k on
cset proc -m pid 1234 -t shield

This uses CPU limits to dedicate CPU cores 0 and 1 exclusively to your critical process (PID 1234). 

This ensures consistent performance for your most important services, especially useful for latency-sensitive applications like databases or real-time processing systems.

#2 Implement CPU quotas:

# Set CPU quota for a service
systemctl set-property myservice.service CPUQuota=200%

This limits service to using at most 200% of CPU resources – meaning it can use up to 2 full CPU cores worth of processing power. 

Perfect for preventing a single service from consuming all available CPU resources during unexpected spikes. 

Example: if your web server suddenly gets hit with requests, it won’t be able to starve other critical systems.

#3 Monitor and adjust CPU shares:

# Adjust CPU shares for a cgroup
echo 2048 > /sys/fs/cgroup/cpu/mygroup/cpu.shares

This sets relative priority between process groups when competing for CPU time. 

The default is 1024, so 2048 means this group gets double the CPU time when there’s contention. 

Unlike hard limits, shares are proportional – they only matter when there’s competition for CPU resources. 

It’s like giving certain processes a “VIP pass” to get more CPU time when things get busy.

#3 Setting Up Monitoring & Alerting to Detect Linux CPU Overload

Raw CPU metrics without proper monitoring are like trying to predict weather patterns by looking at the sky. 

Essential CPU Metrics to Monitor

First, let’s understand what CPU metrics one might want to monitor. We can use vmstat for that, like this:

# Quick system metrics check
vmstat 1 5

From the vmstat output we can see some of the key metrics for CPU monitoring:

  • Load averages (1, 5, 15 minutes)
  • Per-core utilization
  • User time vs. system time
  • I/O wait
  • Context switches
  • Run queue length

Setting Up Basic Monitoring

Monitoring and alerting using shell scripts are…..very 1990s. Do that if you like hacking around, but if you want to monitor production systems please do yourself a favor and use one of the real server monitoring solutions.

But OK, let’s go with a simple shell script first. It could be handy for playing around.

#!/bin/bash
# CPU monitoring with alerts
THRESHOLD=80
EMAIL="admin@yourdomain.com"

while true; do
    CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d. -f1)
   
    if [ $CPU_USAGE -gt $THRESHOLD ]; then
        echo "High CPU Alert: ${CPU_USAGE}%" | \
        mail -s "CPU Usage Alert" $EMAIL
       
        # Collect diagnostic info
        top -bn1 > /tmp/cpu_spike.log
        ps aux --sort=-%cpu | head -10 >> /tmp/cpu_spike.log
    fi
    sleep 60
done

This script monitors your CPU usage every minute and springs into action when usage exceeds 80%, sending you an email alert along with detailed diagnostics of the top CPU-consuming processes.

Here’s another script.

This one monitors multiple metrics and uses them to decide whether an alert should be triggered. Specifically, the alerting condition is met when the load average exceeds your CPU core count or when I/O wait times get too high.

# Different thresholds for different metrics
LOAD_THRESHOLD=$(nproc)  # Number of CPU cores
IOWAIT_THRESHOLD=20
RUNQ_THRESHOLD=$(($(nproc) * 2))

# Check multiple conditions
if [ $(cut -d" " -f1 /proc/loadavg) > $LOAD_THRESHOLD ] || \
  [ $(vmstat 1 1 | tail -1 | awk '{print $16}') > $IOWAIT_THRESHOLD ]; then
    # Alert logic here
fi

Effective Alert Thresholds

So far we mostly covered monitoring of CPU metrics, meaning just collecting/tracking, just touching on alerting. Based on 15+ years of running Sematext Cloud (an observability solution), here are some tips around CPU monitoring, alerting, and rightsizing:

#1 Use the right tool for the job 

Even though we played around with shell script here and the terminal, don’t think you can use that approach in a serious production deployment without going insane. Just pick a good monitoring solution that you like, that doesn’t break the bank, and set up your core infrastructure monitoring there. We’ve compared several solutions here.

#2 Know the metrics

CPU usage is really comprised of 7(?) separate metrics and they all mean something specific. If you are not familiar with that, read XXXX now.

#3 CPU usage can be very spiky. 

This means that if you are not careful about how you set up CPU usage alerts you could end up triggering a ton of useless alerts. You don’t want that. You want to create alerts that are triggered only if high CPU usage persists for a long enough time that it signals that something is truly wrong. Short CPU bursts are OK.

#4 Maxing out CPU usage can hurt performance

You (or anyone who cares about infrastructure costs, and maximizing its utilization) may be tempted to stuff applications (or Kubernetes pods) onto hosts to nearly max out their CPU usage. It feels like that’s how you maximize your investment. 

If you look at the CPU utilization of such hosts they may look a little like this:


But beware that such hosts may not have enough CPU headroom. 

This means that when CPU usage goes up, instead of a short spike, the CPU will remain maxed out for a longer period. This typically results in poor user experience and can lead to timeouts even if there are no user-facing applications. In such situations, you will typically see a correlation with system load. System load indicates how many processes are waiting on the CPU.

Capacity Planning Through Monitoring

Use collected data for future planning:

# Collect and analyze CPU usage patterns
sar -u 1 > cpu_usage_$(date +%Y%m%d).log

# Generate daily CPU usage report
awk '
    BEGIN {printf "Hour\tCPU Usage\n"}
    /^[0-9]/ {
        split($1,t,":");
        cpu[t[1]]+=$8;
        count[t[1]]++
    }
    END {
        for (h in cpu)
            printf "%d\t%.2f\n", h, cpu[h]/count[h]
    }
' cpu_usage_$(date +%Y%m%d).log | sort -n

This script collects detailed CPU usage data throughout the day and generates an hourly usage report, helping you identify usage patterns and plan capacity upgrades before you hit performance bottlenecks.

#5 Implementing Automated Resource Controls

Resource automation is your 24/7 CPU bouncer. It watches the door, maintains order, and kicks out troublemakers before they can cause chaos.

Using systemd Resource Controls

Systemd resource control sets hard limits for your application – it can only use 150% CPU (1.5 cores), has lower CPU priority (50), max 2GB memory, and can’t spawn more than 100 tasks.

First, let’s set up systematic resource management:

# Create a service with resource limits
cat << EOF > /etc/systemd/system/myapp.service
[Unit]
Description=My Application

[Service]
ExecStart=/usr/bin/myapp
CPUQuota=150%
CPUWeight=50
MemoryLimit=2G
TasksMax=100

[Install]
WantedBy=multi-user.target
EOF

# Reload and restart
systemctl daemon-reload
systemctl restart myapp

Automated Process Control

Continuously monitors processes and when it finds one using more than 90% CPU, it reduces its priority (increases nice value) unless it’s a parent process, preventing single processes from hogging the CPU.

Here’s a script that automatically manages runaway processes:

#!/bin/bash
# Auto-control high CPU processes
THRESHOLD=90
NICE_LEVEL=15
MAX_MEMORY=80

while true; do
    # Get highest CPU consuming process
    PID=$(ps aux | sort -nrk 3,3 | head -n 2 | tail -n 1 | awk '{print $2}')
    CPU=$(ps aux | sort -nrk 3,3 | head -n 2 | tail -n 1 | awk '{print $3}')
   
    if [ $(echo "$CPU > $THRESHOLD" | bc) -eq 1 ]; then
        # Check if process is critical
        if ! pgrep -P $PID > /dev/null; then
            renice $NICE_LEVEL -p $PID
            logger "Adjusted priority of PID $PID (CPU: $CPU%)"
        fi
    fi
    sleep 30
done

Dynamic Resource Allocation

Watches system load and automatically adjusts CPU quotas for services – when the system is under heavy load, it reduces the quota to 80%, and when the load decreases, allows up to 150% CPU usage.

Implement adaptive resource management:

#!/bin/bash
# Dynamic CPU quota adjustment
adjust_quota() {
    service=$1
    current_load=$(uptime | awk '{print $10}' | cut -d. -f1)
    cores=$(nproc)
   
    if [ $current_load -gt $cores ]; then
        systemctl set-property $service CPUQuota=80%
    else
        systemctl set-property $service CPUQuota=150%
    fi
}

# Monitor and adjust every 5 minutes
while true; do
    adjust_quota myapp.service
    sleep 300
done

Automated Scaling Rules

Monitors system load average and triggers scaling actions when the load exceeds 75% of available CPU cores – useful for container environments or process managers like PM2.

Set up rules for automatic scaling decisions:

#!/bin/bash
# Auto-scaling trigger script
LOAD_THRESHOLD=$(( $(nproc) * 75 / 100 ))  # 75% of CPU cores

check_load() {
    load=$(cut -d ' ' -f1 /proc/loadavg)
    if [ $(echo "$load > $LOAD_THRESHOLD" | bc) -eq 1 ]; then
        # Trigger scaling action
        logger "High load detected: $load - triggering scale up"
        # Your scaling logic here (e.g., K8s scale, PM2 scale)
    fi
}

# Run check every minute
while true; do
    check_load
    sleep 60
done

Emergency Response Automation

Acts as a last-resort defense – when CPU usage hits 95%, it identifies the top CPU consumers and temporarily suspends them if they’re not critical system processes (like systemd or kernel processes).

Create an automated emergency response system:

#!/bin/bash
# Emergency CPU relief script
CRITICAL_THRESHOLD=95

emergency_response() {
    # Log the event
    logger "CRITICAL: CPU Usage exceeded $CRITICAL_THRESHOLD%"
   
    # Find top 3 CPU-consuming processes
    top_processes=$(ps aux --sort=-%cpu | head -4 | tail -3)
   
    # Take action on non-critical processes
    echo "$top_processes" | while read line; do
        pid=$(echo $line | awk '{print $2}')
        name=$(echo $line | awk '{print $11}')
       
        # Check if process is critical
        if ! echo "$name" | grep -qE "^(systemd|kernel|init)$"; then
            kill -SIGSTOP $pid
            logger "Suspended process $name (PID: $pid) due to critical CPU usage"
        fi
    done
}

# Monitor and respond
while true; do
    cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d. -f1)
    if [ $cpu_usage -gt $CRITICAL_THRESHOLD ]; then
        emergency_response
    fi
    sleep 10
done

#5 Regular System Maintenance and Updates to Prevent Linux CPU Overload

Even the best-tuned system needs regular maintenance. Think of it like servicing your car – skip it, and things will eventually break down.

Update Management

Here’s a smart way to handle system updates that won’t impact your CPU performance:

#!/bin/bash
# Schedule updates during low-usage periods
LOAD_THRESHOLD=2.0

if [ $(cat /proc/loadavg | cut -d ' ' -f1) < $LOAD_THRESHOLD ]; then
    apt-get update && apt-get upgrade -y
    logger "System updates completed successfully"
else
    logger "Updates deferred due to high system load"
fi

This smart update script only kicks in when your system isn’t busy. It checks the current load and either proceeds with updates or backs off, preventing those dreaded update-related slowdowns.

Service Optimization

#!/bin/bash
# Service audit and optimization
for service in $(systemctl list-units --type=service --state=active --no-legend | awk '{print $1}'); do
    cpu_usage=$(ps -p $(systemctl show -p MainPID $service | cut -d= -f2) -o %cpu=)
    if [ ! -z "$cpu_usage" ] && [ ${cpu_usage%.*} -gt 20 ]; then
        echo "High CPU service detected: $service ($cpu_usage%)"
        systemctl status $service >> /var/log/service-audit.log
    fi
done

This script monitors your active services and flags the hungry ones. It’s like having a security camera for your CPU usage, catching resource hogs in the act.

Resource Usage Analysis

#!/bin/bash
# Resource trending script
sar -u 1 3600 > /var/log/cpu_trending_$(date +%Y%m%d).log
awk '!/^$/ && !/^Linux/ && !/^Average/ && !/^%/ {
    if ($8 > 80) print strftime("%Y-%m-%d %H:%M:%S"), "CPU Usage:", $8 "%"
}' /var/log/cpu_trending_* > /var/log/high_cpu_incidents.log

Think of this as your system’s black box recorder. 

It tracks CPU patterns and creates a historical record of high-usage incidents, giving you the data needed to prevent future problems.

Automated Maintenance Schedule

#!/bin/bash
# Smart maintenance scheduler
HOUR=$(date +%H)
LOAD=$(uptime | awk '{print $10}' | cut -d. -f1)

if [ $HOUR -ge 2 ] && [ $HOUR -le 4 ] && [ $LOAD -lt 5 ]; then
    # Run maintenance tasks
    echo "Starting maintenance at $(date)"
   
    # Rotate logs
    logrotate -f /etc/logrotate.conf
   
    # Clear page cache if memory is tight
    if [ $(free | grep Mem | awk '{print $3/$2 * 100.0}') -gt 80 ]; then
        echo 1 > /proc/sys/vm/drop_caches
    fi
   
    logger "Maintenance completed successfully"
fi

This night owl script runs your maintenance when everyone else is asleep. It checks both the time and system load to ensure it won’t interfere with critical operations.

Keeping Your Linux Servers Healthy: The Path Forward

Managing CPU resources is about creating a sustainable, proactive approach to system health. 

Each technique we’ve covered, from process management to automated maintenance, forms part of a comprehensive defense against CPU overload.

Remember: Prevention will always be less stressful (and less expensive) than firefighting. It may feel expensive because you will need to spend time on setting up systems, but it’s just about paying for it upfront. 

In the long run that ends up being cheaper than living without systems in place and reacting to problems as they arise over days, weeks, months, or years.

While the scripts and configurations we’ve discussed provide a solid foundation, consider implementing a robust monitoring solution like Sematext Infrastructure Monitoring to get deeper insights into your system’s behavior patterns. 

Sematext can help you establish these monitoring patterns with real-time CPU metrics, custom alerting thresholds, and historical data analysis that shows exactly when and how your CPU usage patterns change. 

This insight is crucial for making informed decisions about resource allocation and capacity planning.

Most importantly, remember that every system is unique. These strategies aren’t one-size-fits-all solutions – they’re starting points. Adapt them to your specific needs, monitor their effectiveness, and adjust as your system evolves.

10 Best Pingdom Alternatives [2023 Comparison]

In today's digital landscape, website performance is paramount. To ensure...

How to Monitor RabbitMQ Performance: Tools & Metrics You Should Know About

Nowadays, most applications we build are composed of microservices and...

Top 10+ Best Log Monitoring Tools & Software: Free & Paid [2023 Comparison]

Log monitoring tools enhance visibility by centralizing data from multiple...