At the end of November, we’ll be migrating the Sematext Logs backend from Elasticsearch to OpenSearch

How to Extract Numerical Data from a Web Page for Dashboarding and Alerting

Table of contents

Over the years working as a software engineer and now a product manager, I’ve encountered multiple situations where I needed to extract numerical data from a page on a periodic basis and create visualizations, typically line charts to help me see trends over time. For example, I wanted to extract product prices and monitor them over time. Or, I wanted to query a search engine periodically and extract the number of matches or the position of a specific page for SEO purposes. So I’d hack together scripts to fetch pages, parse them, store the extracted numerical data in a file, and then turn them into charts. Here are some more examples, some use cases that are all basically about this same need:

Then there are related use cases that are more about monitoring performance of web pages and websites:

What all these use cases have in common is that they can all be handled using more or less the same approach, as they all need the same pieces of functionality:

  1. Something that runs periodically, like a cronjob
  2. A mechanism to fetch a web page or make a call to an HTTP API (aka REST API or JSON API)
  3. The ability to parse the response to such requests and extract numerical data from it
  4. Charting and dashboarding capability to turn the collected data into a visual representation
  5. The ability to create alert rules with conditions and notification mechanisms like email, text/SMS, Slack, etc.

The old me would install a bunch of open source tools together on some server, then write scripts to curl, parse the response, stick the script(s) in a cronjob, etc. I could still do that, but that approach feels like a hack to me now. Times have changed and there are easier ways. In this article, I’ll show you how I used a synthetic monitoring tool – specifically Sematext Synthetics – to handle several use cases listed above.

What is Synthetic Monitoring?

The primary synthetic monitoring use case is monitoring the performance of websites or APIs.

When you are monitoring a website or an API performance with a synthetic monitoring tool, there are typically several metrics offered out-of-the-box such as various Core Web Vitals, page response times, availability, and more.

Conveniently, synthetic monitoring tools tend to provide exactly what we need:

  1. They are designed to test the website or API periodically, so they act a little like a cronjob, but without you needing to have access to any server to run that cronjob.
  2. Because they test websites and APIs, they obviously can fetch their content.
  3. Not all synthetic monitoring solutions let you parse out numerical data, but we’ll use Sematext Synthetics, which has this functionality (see XXXX documentation), and of course, you can take this extracted data and create dashboards with charts.
  4. Finally, alerting is table stakes for monitoring tools, and typically they integrate with multiple notification mechanisms.

Tips

Here are some best practices tips that apply to use cases like the ones described above. Follow these suggestions to make the best use of synthetic monitoring and keep your costs minimal.

  1. Use a single location. When monitoring websites and APIs you often want to do that from multiple locations, so you can test performance from different geographical locations or different parts of the internet. When using a synthetic monitoring tool for use cases described here you really need to use just one location.
  2. Use a long interval. When monitoring performance you typically want to be notified of performance degradations ASAP. However, when the goal is visualizing trends over longer periods of time, you typically don’t need to collect data frequently. So use the longest reasonable interval for running the monitor.
  3. Use the appropriate monitor. If you are extracting data from an API that returns JSON or XML, use the HTTP monitor. If you are extracting data from a web page that returns HTML or if you are looking to collect a performance metric from a web browser API then, of course, use the Browser monitor.

Summary

Who says synthetic monitoring tools have to be used only for monitoring performance? Think of them as a friendly cronjob running in the cloud. And because it’s all in the cloud it doesn’t require any installation – everything listed above can be done via the UI. There is nothing to install, update, upgrade, patch, manage, and, perhaps best of all, it’s all very affordable!

Java Logging Basics: Concepts, Tools, and Best Practices

Imagine you're a detective trying to solve a crime, but...

Best Web Transaction Monitoring Tools in 2024

Websites are no longer static pages.  They’re dynamic, transaction-heavy ecosystems...

17 Linux Log Files You Must Be Monitoring

Imagine waking up to a critical system failure that has...