No matter how well-designed, flashy, or useful your application is for your target users, they may not take kindly to it being slow or, even worse, crashing once in a while. You will lose customers and revenue as a result.
The solution is definitely not to add additional features to the application to bring back users. Instead, it’s as simple as paying close attention to the health of the servers where your application is hosted. More specifically, you need to take a closer look at server performance metrics.
Server performance metrics give you a window to your server’s health, thereby helping to identify and prevent anomalies. In this blog, we will cover the key metrics worth monitoring to provide you with crucial insights into server performance.
What Are Server Performance Metrics?
Server metrics are measurements that provide insights into the health, efficiency and performance of your processes when the system responds to users requests. They help you to understand better what’s happening on the server hosting your application, as well as troubleshoot issues that could be affecting your application performance and user experience. Also known as server-side metrics, they help you drill down deeper and identify the root cause of hardware/software issues so that you can prevent them going forward.
For example, continuous alerts/flags related to system capacity would later lead to an application crash during peak hours, affecting your SLAs. Monitoring and tracking the associated system metrics helps you take remedial action proactively.
Some of the key areas to consider for monitoring server metrics are system capacity, availability, load tolerance, and application health. If server-side metrics are underperforming, the ripple effects will be felt by your users as well. However, what happens on the client side is also a crucial factor to consider.
Difference Between Client-Side and Server-Side Performance Metrics
Clients might be accessing your application from different platforms—mobile phones, laptops, tablets—and from different browsers/client-side apps. Client-side metrics give you end-to-end performance visibility into your application: interaction speed, location and connection demographics, and performance on different devices/browsers/apps. Together, these indicate the overall user experience.
Let’s look at a few client-side metrics to keep on your radar to ensure optimal UX.
Page Load Time
Page load time measures how long it takes for your application page to load on the client side. It is calculated as the duration between the instant you click on the application link until it’s completely loaded. The lower the load time, the higher customer satisfaction will be. It will also help you boost your app’s search engine ranking.
CPU Idle Time
CPU idle time how long the CPU remains idle while waiting for a response from a third party. Improving this metric also plays an important role in improving your SEO score and user experience.
Render time measures how long it takes for an app to be ready for consumers to interact with it. When there’s a host of external APIs and elements in your application, the rendering time of each can vary. Unless all aspects of your application are rendered within a reasonable time, user experience will be impacted.
This determines how long it takes for a page’s content to become visible. It should be in the range of a few milliseconds to ensure user engagement.
Most Important Server Performance Metrics
While client-side metrics are important, the root of many issues might actually originate in the server, from which your application is delivered. Let’s take a look at what server-side metrics you should monitor for server and application health and performance.
Server Availability and Uptime Metrics
Your servers need to be up and running to serve your app, so getting notified about availability ahead of time can help you save the day before things go wrong.
Server uptime refers to how long the server that hosts your application is up and functioning. Unreliable server system uptime metrics could lead to bad customer experience. Although 100% uptime is the desired target for this metric, it is often unachievable in real-world scenarios. Any uptime above 99% is acceptable; you should be able to push this to 99.5% or 99.9% depending on how well-maintained your servers are.
Even if your servers hosting an app are up, there could be scenarios where the service running on it is not available. Hence, you also need to monitor the availability of both your application and its functionalities. The rule of thumb for ensuring high availability is to eliminate single points of failure. Applications designed for high availability with load distributed across multiple servers can help achieve uptime of closer to 99.9%.
The error rate is a metric that shows you the percentage of failed requests out of the total number of requests received by your application. These codes could either be an internal server code or HTTP error code. When customers receive an HTTP error code instead of the expected response from an application, it impacts their engagement.
An error code can give critical insights into application malfunctions to help you improve performance and UX. For example, 5xx or 404 error codes could indicate underlying configuration issues with your application or server. These HTTP 5xx error codes are internal errors and may be detrimental to the user experience.
Server Capacity Metrics
While ensuring your server’s availability and uptime might make your application more dependable, if the server runs out of capacity to service requests, your users may still have a negative experience. Here are the server metrics you should monitor to avoid such issues.
Requests per Second
Requests per second, also known as throughput, is the number of requests handled by your server during a specific period, providing a good idea of how busy it is. A higher number of requests per second will lead to a server crash if it does not have enough capacity to serve the load.
Implementing a scaling strategy based on the insights provided by this metric can save the day. Requests per second should be read in conjunction with the average time taken by the server to respond to requests before deciding on a scaling plan.
“Data in” measures the request payload size received by the server, which should be kept small for better performance. A larger payload combined with a high number of requests could result in the application requesting more data than the server is capable of handling.
“Data out” is the size of the response payload sent back to users from the server. The larger the size here, the more time it will take for the application to load for users. It may also put strain on your server’s network infrastructure, as well as the network infrastructure of your users. Keeping track of your data out will help keep your responses light and ensure faster load times for the application.
Application Server Monitoring Metrics
Average server monitoring metrics that can provide crucial indicators of how well your application is operating.
Average Response Time
Average response time indicates the average amount of time a server takes to process a request from the application. Studies indicate that it is recommended to keep response times to less than a second to ensure user engagement. A low average response time could indicate that some components in the server are operating at suboptimal levels and need remediation. Also, keep in mind that, as the metrics represent an average, there may be outliers over the examined period of time that could cause the number to be inaccurate.
Peak Response Time
Peak response time is the longest response time during a given period. The value should be read in conjunction with average response time. If the difference between the two is significantly greater for a specific type of request, it may indicate a performance bottleneck; it could also be a transient issue. Consistent higher values for both, on the other hand, can indicate underlying server performance concerns. Keep in mind that average response time gives you a general overview, while peak response time can lead you to the root cause of the issue.
Server Load Sharing Metrics
These metrics provide insights into how well application load is being managed by your backend servers. This is especially significant in high-availability architectures, where multiple servers behind a load balancer serve user requests. Below are the key metrics to consider here.
Thread count is the total number of requests received at a given point in time. It gives you valuable information about how a server is handling load. The requests are held until there is enough time to process them once the maximum threshold is reached. If the deferred requests take too long, they will time out. The number of threads per process that can be handled by a single server is often throttled, which can lead to errors if the threshold is exceeded. Hence, it is important to monitor and optimize the thread count by scaling requests to additional servers behind a load balancer.
Latency tracks the time it takes for a user to receive a response after sending a connection request to the server. Latency is one of the crucial metrics that can make or break user experience. For load-balanced servers, latency should be measured as the time taken for the response to be sent back from the load balancer. Optimizing your network path from the load balancer to the backend servers can also help improve latency.
A host can be deemed unhealthy if it is incapable of serving the application to users due to underlying hardware or software stack errors. If any of the servers serving user requests from behind a load balancer is unhealthy, it could impact the capacity as well as performance of your application. Monitoring hosts to identify unhealthy ones is a proactive measure to mitigate such issues. Healthy/unhealthy host metrics also help improve application availability and latency.
A well-designed architecture ensures that your application is still accessible—even if a few servers serving the application from behind the load balancer are faulty. This is measured by fault tolerance metrics and aids in determining the load balancer’s ability to efficiently handle load. Some of the common fault tolerance metrics include mean time to detect (MTTD), mean time to failure (MTTF), mean time to repair (MTTR), and mean time between system incidents (MTBSI). When tracked and measured, these metrics give insights into how resilient your application is. For example, MTTR tells you how long it typically takes to get a faulty server back up and running.
Throughput measures the number of requests successfully served by the load balancer without any errors per unit time. It is a good indicator of the effectiveness of the load balancer in serving user requests in a timely manner. Lower throughput could indicate issues with the load balancer or with the servers to which the requests are being redirected.
Migration time measures how long it takes for requests to be migrated across servers behind a load balancer. It helps improve load balancing efficiency so that users are always served requests from healthy servers.
Server reliability metrics give you a holistic view of the availability and performance consistency of your servers over time. These should be measured at both the server and load balancer level in distributed architectures. To mitigate overall reliability issues, you need to address any identified points of failure or constraints.
System-Level Performance Metrics
Server performance monitoring metrics give you ground-zero data on your server health. Along with server availability monitoring and metrics, you should also include server utilization metrics and OS logs in your list of server performance metrics to monitor.
OS logs contain information about any errors happening in the environment that need to be addressed. You could further create alerts based on critical OS error codes to flag issues promptly. Analyzing these logs helps, as it can be difficult to determine what is being written or updated on the server operating systems with so many application-related tasks running concurrently.
Hardware utilization is measured by some basic but important metrics, like CPU/memory utilization, disk I/O, and disk usage, that play an important role in server performance. Any of these metrics can get throttled, so all of them should be considered to ensure comprehensive performance monitoring.
Security-related metrics also play a key role in ensuring application performance, as any infiltrations or attacks can slow down your systems and have far-reaching consequences. Below are the server metrics you should measure.
Unauthorized access is a metric that helps detect infiltrations in your organization indicating compromised security. They help you keep track of any suspicious user or admin activity that could impact application performance, including sensitive file access, system configuration changes, and file modifications to name a few.
Data breach metrics indicate the removal or theft of data from your servers by unauthorized personnel/services. It also covers data exfiltration by insiders. You should closely monitor any data movement patterns outside your normal security perimeters to identify a data breach. Data breach metrics can be enumerated by cyberattacks that occur due to malware, phishing, ransomware, etc.
Intrusion detection metrics focus on the server network and help identify anomalous traffic patterns. For example, repeated attempts to access vulnerable ports in your server could indicate an attack vector. The detection can either be pattern-based or signature-based. Alerts generated can notify administrators about active attacks in real time.
Vulnerability and Malware Detection
Attackers can target common application and server vulnerabilities to gain access to your environment. This could mean exploiting unprotected server ports, gaining backdoor entry via unsecure services, or even executing application-level attacks like SQL injection. Metrics based on vulnerability detection will help flag such issues in a timely manner.
Malware can wreak havoc on your server by encrypting data or collecting sensitive information without being discovered. You can implement signature-based or heuristic analysis for malware detection, with metrics helping you identify these issues before the attack vector spreads and impacts server performance.
Log Pattern Monitoring
Servers generate logs based on different application-related activities executed on them. The logs can either be natively generated by the OS or custom logs generated by application instrumentation. In both cases, monitoring log patterns to identify anomalies can provide early indications of underlying server issues.
How Do We Measure Server Performance Metrics
Server performance metrics provide much-needed information on how your server resources are being utilized by your applications and associated services. Server monitoring tools enable this process by monitoring these standard server metrics, giving you insights via reports and visualizations.
Your servers could be deployed across hybrid environments including on-premises and multiple cloud platforms, so it’s important to identify the right monitoring tool that can track key server performance metrics irrespective of where they are hosted.
Server Metrics Monitoring with Sematext
Monitoring a large-scale infrastructure with multiple servers can be an overwhelming process. It takes careful deliberation to first determine what server metrics are worth monitoring for your environment. The next step is to zero in on a tool that can support those server metrics along with other metrics, logs, and events relevant to your application.
Sematext Monitoring offers a comprehensive monitoring solution for your servers and application stacks deployed across multiple cloud platforms and servers. It provides one tool for you to track important server performance as well as application-specific metrics. The data gathered is then summarized in simple dashboards that help you get a better understanding of your server’s overall performance.
Monitoring all of your components using a single tool cuts down on the time required to correlate and troubleshoot problems, allowing faster resolution.
Ehab has extensive experience in software engineering and technical leadership roles for over ten years. His main interests involve large-scale back-end development, microservices architecture, cloud infrastructures/DevOps, distributed systems, data engineering, technical writing, and people management. Ehab holds a master’s degree in computer science from the University of Bonn, Germany and he is currently leading the R&D team at Alma Health (UAE-based healthcare startup).