OpenSearch Monitoring Integration
Integration¶
- Instructions: https://apps.sematext.com/ui/howto/opensearch/overview
Important Metrics to Watch and Alert on¶
OpenSearch specific metrics¶
Search Query performance metrics: Request Rate and Latency¶
When the cluster receives a request, it may need to access data from multiple shards, across multiple nodes. Knowing the rate at which the system is processing and returning requests, how many requests are currently in progress, and how long requests are taking can provide valuable insights into the health and performance of the OpenSearch cluster.
Indexing Rate and Merge Times¶
Monitoring the OpenSearch document indexing rate and merge time can help identify anomalies and related problems before they begin to affect the performance of the cluster. Considering these metrics in parallel with the health of each node can provide essential clues to potential problems within the system, or opportunities to optimize performance.
System and JVM Metrics¶
OS metrics like CPU, memory, disk I/O, and network play an essential role in OpenSearch monitoring.
OpenSearch runs within a Java Virtual Machine (JVM) and monitoring JVM heap usage is critical to ensure cluster performance. Moreover, JVM supports garbage collection, which means that garbage collection frequency and duration are just as important to measure.
Finally, high disk reads and writes can indicate a poorly tuned system. Since accessing the disk is an expensive process in terms of time, a well-tuned system should reduce disk I/O wherever possible.
Metrics¶
Metric Name Key (Type) (Unit) |
Description |
---|---|
outgoing searches adaptiveReplicaSelection.searches.outgoing (long gauge) |
Searches from the monitored node to the remote node |
average queue size adaptiveReplicaSelection.queue.size.avg (double gauge) |
Exponentially weighted moving average queue size for searches on the remote node |
average service time adaptiveReplicaSelection.service.time.avg (long gauge) (ns) |
Exponentially weighted moving average task execution time on the remote node |
average response time adaptiveReplicaSelection.response.time.avg (long gauge) (ns) |
Exponentially weighted moving average response time on the remote node |
rank adaptiveReplicaSelection.rank (double gauge) |
Rank of the remote node used for replica selection |
inFlightRequests max size circuitBreaker.inFlightRequests.size.max (long gauge) (bytes) |
Max in-flight requests size |
inFlightRequests estimated size circuitBreaker.inFlightRequests.size.estimate (long gauge) (bytes) |
Estimated in-flight requests size |
inFlightRequests overhead circuitBreaker.inFlightRequests.size.overhead (double gauge) |
In-flight requests overhead |
inFlightRequests tripped circuitBreaker.inFlightRequests.tripped (counter) |
In-flight requests circuit breaker tripped |
nodes cluster.nodes (long gauge) |
Number of nodes in the cluster |
data nodes cluster.nodes.data (long gauge) |
Number of data nodes in the cluster |
active primary shards cluster.health.shards.active.primary (long gauge) |
Number of active primary shards |
active shards cluster.health.shards.active (long gauge) |
Number of active shards |
relocating shards cluster.health.shards.relocating (long gauge) |
Number of currently relocating shards |
unassigned shards cluster.health.shards.unassigned (long gauge) |
Number of currently unassigned shards |
pending tasks cluster.health.pendingTasks.total (long gauge) |
Number of currently pending tasks in master's queue |
pending tasks max waiting time cluster.health.pendingTasks.maxQueueTime (long gauge) (ms) |
Maximum time for a task in master's queue |
open TCP conns connection.tcp.server.open (long gauge) |
Open TCP conns (server_open) |
network received packets transport.rx.packets (counter) |
Network received packets count (rx_count) |
network received size transport.rx.bytes (counter) (bytes) |
Network received size (rx_size) |
network transmitted packets transport.tx.packets (counter) |
Network transmitted packets count (tx_count) |
network transmitted size transport.tx.bytes (counter) (bytes) |
Network transmitted size (tx_size) |
cluster state incompatible diff updates cluster.state.published.diff.incompatible (counter) |
Cluster state incompatible diff updates published |
cluster state compatible diff updates cluster.state.published.diff.compatible (counter) |
Cluster state compatible diff updates published |
docs count index.docs.total (long gauge) |
Docs count on all (primary and replica) shards |
docs deleted index.docs.deleted.total (long gauge) |
Docs deleted on all (primary and replica) shards |
ingest calls ingest.calls.total (counter) |
Number of calls to this pipeline |
ingest failures ingest.calls.errors (counter) |
Number of failed calls to this pipeline |
ingest time ingest.time (counter) (ms) |
Time spent in this pipeline |
read ops disk.io.operations.read (counter) |
Disk IO read operations |
write ops disk.io.operations.write (counter) |
Disk IO write operations |
gc collection count gc.collection.count (counter) |
Count of GC collections |
gc collection time gc.collection.time (counter) (ms) |
Duration of GC collections |
heap_used heap.used (gauge) (bytes) |
JVM heap used memory |
heap.committed heap.committed (gauge) (bytes) |
JVM heap committed |
non_heap_used nonheap.used (gauge) (bytes) |
JVM non-heap used memory |
non_heap_committed nonheap.committed (gauge) (bytes) |
JVM non-heap committed |
open files files.open (gauge) |
JVM currently open files |
max open files files.max (gauge) |
JVM max open files limit |
used pool.used (gauge) (bytes) |
JVM pool used memory |
max pool.max (gauge) (bytes) |
JVM pool max memory |
thread count threads (gauge) |
JVM thread count |
peak thread count threads.peak (gauge) |
Peak JVM thread count |
merge count indexing.merges.total (counter) |
Merge count on all (primary and replica) shards |
merge time indexing.merges.time.total (counter) (ms) |
Merge time on all (primary and replica) shards |
merged docs count indexing.merges.docs.total (counter) |
Merged docs count on all (primary and replica) shards |
merged docs size indexing.merges.docs.size.total (counter) (bytes) |
Merged docs size on all (primary and replica) shards |
throttled merge time indexing.merges.throttled.time.total (counter) (ms) |
Throttled time for merges (i.e. when merges fall behind) on all (primary and replica) shards |
throttled merge size indexing.merges.throttled.size.total (counter) (bytes) |
Size of throttled merges on all (primary and replica) shards |
field cache evictions cache.field.evicted (counter) |
Field cache evictions |
field cache size cache.field.size (long gauge) |
Field cache size |
number of processors cpu.allocated.count (long gauge) |
Number of processors allocated to the OpenSearch process |
refresh count indexing.refreshes.total (counter) |
Refresh count on all (primary and replica) shards |
refresh time indexing.refreshes.time.total (counter) (ms) |
Refresh time on all (primary and replica) shards |
script compilations script.compilations.total (counter) |
Script compilations (use params in scripts to reduce them) |
script cache evictions script.cache.evictions (counter) |
Script cache evictions |
script compilations limit triggered script.compilations.limitTriggered (counter) |
Script compilations circuit breaker triggered (see script.max_compilations_rate setting) |
query count query.count.total (counter) |
Query count on all (primary and replica) shards |
query latency query.latency.time.total (counter) (ms) |
Query latency on all (primary and replica) shards |
fetch count fetch.count.total (counter) |
Fetch count on all (primary and replica) shards |
fetch latency fetch.latency.time.total (counter) (ms) |
Fetch latency on all (primary and replica) shards |
suggest count suggest.count.total (counter) |
Suggest count on all (primary and replica) shards |
suggest latency suggest.latency.time.total (counter) (ms) |
Suggest latency on all (primary and replica) shards |
scroll count scroll.count.total (counter) |
Scroll count on all (primary and replica) shards |
scroll latency scroll.latency.time.total (counter) (ms) |
Scroll latency on all (primary and replica) shards |
search open contexts search.opencontexts.total (long gauge) |
Open search contexts on all (primary and replica) shards |
avg. query latency query.latency.total.avg (gauge) (ms) |
Avg. query latency on all (primary and replica) shards |
segments count segments.count.total (long gauge) |
Number of segments |
segments memory segments.memory.total (long gauge) (bytes) |
Total memory for segment-related data structures |
terms memory segments.memory.terms (long gauge) (bytes) |
Memory used by the terms dictionary |
stored fields memory segments.memory.storedFields (long gauge) (bytes) |
Memory used by stored fields |
term vectors memory segments.memory.termVectors (long gauge) (bytes) |
Memory used by term vectors |
norms memory segments.memory.norms (long gauge) (bytes) |
Memory used by (length) norms |
points memory segments.memory.points (long gauge) (bytes) |
Memory used by point fields (includes numeric, date, geo) |
doc values memory segments.memory.docValues (long gauge) (bytes) |
Memory used by doc values |
indexing buffer memory segments.memory.indexWriter (long gauge) (bytes) |
Memory used by the IndexWriter |
version map memory segments.memory.versionMap (long gauge) (bytes) |
Memory used by the version map |
fixed bitset memory segments.memory.fixedBitSet (long gauge) (bytes) |
Memory used by the fixed bitset that speeds up nested queries/aggregations |
unassigned shards cluster.shards.unassigned (long gauge) |
Number of unassigned shards |
active shards cluster.shards.active (long gauge) |
Number of active shards |
active primary shards cluster.shards.active.primary (long gauge) |
Number of active primary shards |
initializing shards cluster.shards.initializing (long gauge) |
Number of initializing shards |
relocating shards cluster.shards.relocating (long gauge) |
Number of relocating shards |
active threads thread.pool.active (long gauge) |
Active threads |
thread pool size thread.pool.size (long gauge) |
Thread pool size |
thread pool queue thread.pool.queue (long gauge) |
Thread pool queue |
thread pool queue size thread.pool.queue.size (long gauge) |
Thread pool queue size |
rejected threads thread.pool.rejected (counter) |
Rejected threads |
thread pool largest thread.pool.largest (long gauge) |
Thread pool largest |
completed threads thread.pool.completed (counter) |
Complete threads |
thread pool min thread.pool.min (long gauge) |
Thread pool min |
thread pool max thread.pool.max (long gauge) |
Thread pool max |