Metrics
CommCare Infrastructure Metrics
CommCare uses Datadog and Prometheus for monitoring various system, application and custom metrics. Datadog supports a variety of applications and is easily extendable.
Below are a few tables tabulating various metrics of the system and service infrastructure used to run CommCare. The list is not absolute nor exhaustive, but it will provide you with a basis for monitoring the following aspects of your system:
Performance
Throughput
Utilization
Availability
Errors
Saturation
Each table has the following format:
Metric |
Metric type |
Why care |
User impact |
How to measure |
Name of metric |
Category or aspect of system the metric speaks to |
Brief description of why metric is important |
Explains the impact on user if undesired reading is recorded |
A note on how the metric might be obtained. Please note that it is assumed that Datadog will be used as a monitoring solution unless specified otherwise. |
General Host
The Datadog Agent ships with an integration which can be used to collect metrics from your base system. See the System Integration for more information.
Metric |
Metric type |
Why care |
User impact |
How to measure |
CPU usage (%) |
Utilization |
Monitoring server CPU usage helps you understand how much your CPU is being used, as a very high load might result in overall performance degradation. |
Lagging experience |
system.cpu.idle
system.cpu.system
system.cpu.iowait
system.cpu.user
|
Load averages 1-5-15 |
Utilization |
Load average (CPU demand) over 1 min, 5 min and 15 min which includes the sum of running and waiting threads. What is load average |
User might experience trouble connecting to the server |
system.load.1
system.load.5
system.load.15
|
Memory |
Utilization |
It shows the amount of memory used over time. Running out of memory may result in killed processes or more swap memory used, which will slow down your system. Consider optimizing processes or increasing resources. |
Slow performance |
system.mem.usable
system.mem.total
|
Swap memory |
Utilization |
This metric shows the amount of swap memory used. Swap memory is slow, so if your system depends too much on swap, you should investigate why RAM usage is so high. Note that it is normal for systems to use a little swap memory even if RAM is available. |
Server unresponsiveness |
system.swap.free
system.swap.used
|
Disk usage |
Utilization |
Disk usage is important to prevent data loss in the event that the disk runs out of available space. |
Data loss |
system.disk.in_use |
Disk latency |
Throughput |
The average time for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. High disk latency will result in slow response times for things like reports, app installs and other services that read from disk. |
Slow performance |
system.io.await |
Network traffic |
Throughput |
This indicates the amount of incoming and outgoing traffic on the network. This metric is a good gauge on the average network activity on the system. Low or consistently plateauing network throughput will result in poor performance experienced by end users, as sending and receiving data from them will be throttled. |
Slow performance |
system.net.bytes_rcvd
system.net.bytes_sent
|
Gunicorn
The Datadog Agent ships with an integration which can be used to collect metrics. See the Gunicorn Integration for more information.
Metric |
Metric type |
Why care |
User impact |
How to measure |
Requests per second |
Throughput |
This metric shows the rate of requests received. This can be used to give an indication of how busy the application is. If you’re constantly getting a high request rate, keep an eye out for bottlenecks on your system. |
Slow user experience or trouble accessing the site. |
gunicorn.requests |
Request duration |
Throughput |
Long request duration times can point to problems in your system / application. |
Slow experience and timeouts |
gunicorn.request.duration.* |
Http status codes |
Performance |
A high rate of error codes can either mean your application has faulty code or some part of your application infrastructure is down. |
User might get errors on pages |
gunicorn.request.status.* |
Busy vs idle Gunicorn workers |
Utilization |
This metric can be used to give an indication of how busy the gunicorn workers are over time. If most of the workers are busy most of the time, it might be necessary to start thinking of increasing the number of workers before users start having trouble accessing your site. |
Slow user experience or trouble accessing the site. |
gunicorn.workers |
Nginx
The Datadog Agent ships with an integration which can be used to collect metrics. See the Nginx Integration for more information.
Metric |
Metric type |
Why care |
User impact |
How to measure |
Total requests |
Throughput |
This metric indicates the number of client requests your server handles. High rates means bigger load on the system. |
Slow experience |
nginx.requests.total |
Requests per second |
Throughput |
This metric shows the rate of requests received. This can be used to give an indication of how busy the application is. If you’re constantly getting a high request rate, keep an eye out for services that might need additional resources to perform optimally. |
Slow user experience or trouble accessing the site. |
nginx.net.request_per_s |
Dropped connections |
Errors |
If NGINX starts to incrementally drop connections it usually indicates a resource constraint, such as NGINX’s worker_connections limit has been reached. An investigation might be in order. |
Users will not be able to access the site. |
nginx.connections.dropped |
Server error rate |
Error |
Your server error rate is equal to the number of 5xx errors divided by the total number of status codes. If your error rate starts to climb over time, investigation may be in order. If it spikes suddenly, urgent action may be required, as clients are likely to report errors to the end user. |
User might get errors on pages |
nginx.server_zone.responses.5xx
nginx.server_zone.responses.total_count
|
Request time |
Performance |
This is the time in seconds used to process the request. Long response times can point to problems in your system / application. |
Slow experience and timeouts |
PostgreSQL
PostgreSQL has a statistics collector subsystem that collects and reports on information about the server activity.
The Datadog Agent ships with an integration which can be used to collect metrics. See the PostgreSQL Integration for more information.
Metric |
Metric type |
Why care |
User impact |
How to measure |
Sequential scans on table vs. Index scans on table |
Other |
This metric speaks directly to the speed of query execution. If the DB is making more sequential scans than indexed scans you can improve the DB’s performance by creating an index. |
Tasks that require data to be fetched from the DB will take a long time to execute. |
|
Rows fetched vs. returned by queries to DB |
Throughput |
This metric shows how effectively the DB is scanning through its data. If many more rows are constantly fetched vs returned, it means there’s room for optimization. |
Slow experience for tasks that access large parts of the database. |
|
Amount of data written temporarily to disk to execute queries |
Saturation |
If the DB’s temporary storage is constantly used up, you might need to increase it in order to optimize performance. |
Slow experience for tasks that read data from the database. |
|
Rows inserted, updated, deleted (by database) |
Throughput |
This metric gives an indication of what type of write queries your DB serves most. If a high rate of updated or deleted queries persist, you may want to keep an eye out for increasing dead rows as this will degrade DB performance. |
No direct impact |
|
Locks |
Other |
A high lock rate in the DB is an indication that queries could be long-running and that future queries might start to time out. |
Slow experience for tasks that read data from the database. |
|
Deadlocks |
Other |
The aim is to have no deadlocks as it’s resource intensive for the DB to check for them. Having many deadlocks calls for reevaluating execution logic. Read more |
Slow experience for tasks that read data from the database. Some tasks may even hang and the user will get errors on pages. |
|
Dead rows |
Other |
A constantly increasing number of dead rows show that the DB’s VACUUM process is not working properly. This will affect DB performance negatively. |
Slow experience for tasks that read data from the database. |
|
Replication delay |
Other |
A higher delay means data is less consistent across replication servers. |
In the worst case, some data may appear missing. |
|
Number of checkpoints requested vs scheduled |
Other |
Having more requested checkpoints than scheduled checkpoints means decreased writing performance for the DB.`Read more <https://www.cybertec-postgresql.com/en/postgresql-what-is-a-checkpoint/?gclid=CjwKCAjw7fuJBhBdEiwA2lLMYbUeLBrWYvSMjishfoa-RAEbkTNIL315MGdx6nrHnDK0A4UpjkbZIRoCTwYQAvD_BwE>`__ |
Slow experience for tasks that read data from the database. |
|
Active connections |
Utilization |
Having the number of active connections consistently approaching the number of maximum connections, this can indicate that applications are issuing long-running queries and constantly creating new connections to send other requests, instead of reusing existing connections. Using a connection pool can help ensure that connections are consistently reused as they go idle, instead of placing load on the primary server to frequently have to open and close connections. Typically, opening a DB connection is an expensive operation. |
Users might get errors on pages which need to access the database but cannot due to too many currently active connections. |
|
Elasticsearch
The Datadog Agent ships with an integration which can be used to collect metrics. See the Elasticsearch Integration for more information.
Metric |
Metric type |
Why care |
User impact |
How to measure |
Query load |
Utilization |
Monitoring the number of queries currently in progress can give you a rough idea of how many requests your cluster is dealing with at any particular moment in time. |
A high load might slow down any tasks that involve searching users, groups, forms, cases, apps etc. |
elasticsearch.primaries.search.query.current |
Average query latency |
Throughput |
If this metric shows the query latency is increasing it means your queries are becoming slower, meaning either bottlenecks or inefficient queries. |
Slow user experience when generating or reports, filtering groups or users, etc. |
elasticsearch.primaries.search.query.total
elasticsearch.primaries.search.query.time
|
Average fetch latency |
Throughput |
This should typically take less time than the query phase. If this metric is constantly increasing it could indicate problems with slow disks or requesting of too many results. |
Slow user experience when generating or reports, filtering groups or users, etc. |
elasticsearch.primaries.search.fetch.total
elasticsearch.primaries.search.fetch.time
|
Average index latency |
Throughput |
If you notice an increasing latency it means you may be trying to index too many documents simultaneously.Increasing latency may slow down user experience. |
Slow user experience when generating or reports, filtering groups or users, etc. |
elasticsearch.indexing.index.total
elasticsearch.indexing.index.time
|
Average flush latency |
Throughput |
Data is only persisted on disk after a flush. If this metric increases with time it may indicate a problem with a slow disk. If this problem escalates it may prevent you from being able to add new information to your index. |
Slow user experience when generating or reports, filtering groups or users, etc. In the worst case there may be some data loss. |
elasticsearch.primaries.flush.total
elasticsearch.primaries.flush.total.time
|
Percent of JVM heap currently in use |
Utilization |
Garbage collections should initiate around 75% of heap use. When this value is consistently going above 75% it indicates that the rate of garbage collection is not keeping up with the rate of garbage creation which might result in memory errors down the line. |
Users might experience errors on some pages |
jvm.mem.heap_in_use |
Total time spent on garbage collection |
Other |
The garbage collection process halts the node, during which the node cannot complete tasks. If this halting duration exceeds the routine status check (around 30 seconds) the node might mistakenly be marked as offline. |
Users can have a slow experience and in the worst case might even get errors on some pages. |
jvm.gc.collectors.young.collection_time
jvm.gc.collectors.old.collection_time
|
Total HTTP connections opened over time |
Other |
If this number is constantly increasing it means that HTTP clients are not properly establishing persistent connections. Reestablishing adds additional overhead and might result in requests taking unnecessarily long to complete. |
Slow user experience when generating or reports, filtering groups or users, etc. |
elasticsearch.http.total_opened |
Cluster status |
Other |
The status will indicate when at least one replica shard is unallocated or missing. If more shards disappear you may lose data. |
Missing data (not data loss, as Elasticsearch is a secondary database) |
elasticsearch.cluster_health |
Number of unassigned shards |
Availability |
When you first create an index, or when a node is rebooted, its shards will briefly be in an “initializing” state before transitioning to a status of “started” or “unassigned”, as the primary node attempts to assign shards to nodes in the cluster. If you see shards remain in an initializing or unassigned state too long, it could be a warning sign that your cluster is unstable. |
Slow user experience when generating or reports, filtering groups or users, etc. |
elasticsearch.unassigned_shards |
Thread pool queues |
Large queues are not ideal because they use up resources and also increase the risk of losing requests if a node goes down. |
Slow user experience when generating or reports, filtering groups or users, etc. In the worst case |
elasticsearch.thread_pool.bulk.queue |
|
Pending tasks |
Saturation |
The number of pending tasks is a good indication of how smoothly your cluster is operating. If your primary node is very busy and the number of pending tasks doesn’t subside, it can lead to an unstable cluster. |
Slow user experience when generating or reports, filtering groups or users, etc. |
elasticsearch.pending_tasks_total |
Unsuccessful GET requests |
Error |
An unsuccessful get request means that the document ID was not found. You shouldn’t usually have a problem with this type of request, but it may be a good idea to keep an eye out for unsuccessful GET requests when they happen. |
User might get errors on some pages |
elasticsearch.get.missing.total |
CouchDB
The Datadog Agent ships with an integration which can be used to collect metrics. See the CouchDB Integration for more information.
Metric |
Metric type |
Why care |
User impact |
How to measure |
Open databases |
Availability |
If the number of open databases are too low you might have database requests starting to pile up. |
Slow user experience if the requests start to pile up high. |
couchdb.couchdb.open_databases |
File descriptors |
Utilization |
If this number reaches the max number of available file descriptors, no new connections can be opened until older ones have closed. |
The user might get errors on some pages. |
couchdb.couchdb.open_os_files over |
Data size |
Utilization |
This indicates the relative size of your data. Keep an eye on this as it grows to make sure your system has enough disk space to support it. |
Data loss |
couchdb.by_db.file_size |
HTTP Request Rate |
Throughput |
Gives an indication of how many requests are being served. |
Slow performance |
couchdb.couchdb.httpd.requests |
Request with status code of 2xx |
Performance |
Statuses in the 2xx range are generally indications of successful operation. |
No negative impact |
couchdb.couchdb.httpd_status_codes |
Request with status code of 4xx and 5xx |
Performance |
Statuses in the 4xx and 5xx ranges generally tell you something is wrong, so you want this number as low as possible, preferably zero. However, if you constantly see requests yielding these statuses, it might be worth looking into the matter. |
Users might get errors on some pages. |
couchdb.couchdb.httpd_status_codes |
Workload - Reads & Writes |
Performance |
These numbers will depend on the application, but having this metric gives an indication of how busy the database generally is. In the case of a high workload, consider ramping up the resources. |
Slow performance |
couchdb.couchdb.database_reads |
Average request latency |
Throughput |
If the average request latency is rising it means somewhere exists a bottleneck that needs to be addressed. |
Slow performance |
couchdb.couchdb.request_time.arithmetic_mean |
Cache hits |
Other |
CouchDB stores a fair amount of user credentials in memory to speed up the authentication process. Monitoring usage of the authentication cache can alert you for possible attempts to gain unauthorized access. |
A low number of hits might mean slower performance |
couchdb.couchdb.auth_cache_hits |
Cache misses |
Error |
If CouchDB reports a high number of cache misses, then either the cache is undersized to service the volume of legitimate user requests, or a brute force password/username attack is taking place. |
Slow performance |
couchdb.couchdb.auth_cache_misses |
Kafka
The Datadog Agent ships with a Kafka Integration to collect various Kafka metrics. Also see Integrating Datadog, Kafka and Zookeper.
Broker Metrics
Metric |
Metric type |
Why care |
User impact |
How to measure |
UnderReplicatedPartitions |
Availability |
If a broker becomes unavailable, the value of UnderReplicatedPartitions will increase sharply. Since Kafka’s high-availability guarantees cannot be met without replication, investigation is certainly warranted should this metric value exceed zero for extended time periods. |
Fewer in-sync replicas means the reports might take longer to show the latest values. |
kafka.replication.under_replicated_partitions
|
IsrShrinksPerSec |
Availability |
The rate at which the in-sync replicas shrinks for a particular partition. This value should remain fairly static. You should investigate any flapping in the values of these metrics, and any increase in IsrShrinksPerSec without a corresponding increase in IsrExpandsPerSec shortly thereafter. |
As the in-sync replicas become fewer, the reports might take longer to show the latest values. |
kafka.replication.isr_shrinks.rate
|
IsrExpandsPerSec |
Availability |
The rate at which the in-sync replicas expands. |
As the in-sync replicas become fewer, the reports might take longer to show the latest values. |
kafka.replication.isr_expands.rate
|
TotalTimeMs |
Performance |
This metrics reports on the total time taken to service a request. |
Longer servicing times mean data-updates take longer to propagate to the reports. |
kafka.request.produce.time.avg
kafka.request.consumer.time.avg
kafka.request.fetch_follower.time.avg
|
ActiveControllerCount |
Error |
The first node to boot in a Kafka cluster automatically becomes the controller, and there can be only one. You should alert on any other value that lasts for longer than one second. In the case that no controller is found, Kafka might become unstable and new data might not be updated. |
Reports might not show new updated data, or even break. |
kafka.replication.active_controller_count
|
Broker network throughput |
Throughput |
This metric indicates the broker throughput. |
If the throughput becomes less, the user might find that reports take longer to reflect updated data. |
kafka.net.bytes_in.rate
kafka.net.bytes_out.rate
|
Clean vs unclean leaders elections |
Error |
When a partition leader dies, an election for a new leader is triggered. New leaders should only come from replicas that are in-sync with the previous leader, however, this is a configuration setting that can allow for unclean elections. |
Data might be missing in reports. (the data will not be lost, as the data is already stored in PostgreSQL or CouchDB, but the reports will not reflect the latest changes) |
kafka.replication.leader_elections.rate
kafka.replication.unclean_leader_elections.rate
|
Fetch/request purgatory |
Other |
An unclean leader is a leader that is not completely in-sync with the previous leader, so when an unclean leader is elected, you will lose any data that was produced to Kafka before the full sync happened. You should alert on any unclean leaders elected. |
Reports might take longer to reflect the latest data. |
kafka.request.producer_request_purgatory.size
kafka.request.fetch_request_purgatory.size
|
Producer Metrics
Metric |
Metric type |
Why care |
User impact |
How to measure |
Request rate |
Throughput |
The request rate is the rate at which producers send data to brokers. Keeping an eye on peaks and drops is essential to ensure continuous service availability. |
Reports might take longer to reflect the latest data. |
kafka.producer.request_rate |
Response rate |
Throughput |
Average number of responses received per second from the brokers after the producers sent the data to the brokers. |
Reports might take longer to reflect the latest data. |
kafka.producer.response_rate |
Request latency average |
Throughput |
Average request latency (in ms). Read more |
Reports might take longer to reflect the latest data. |
kafka.producer.request_latency_avg |
Outgoing byte rate |
Throughput |
Monitoring producer network traffic will help to inform decisions on infrastructure changes, as well as to provide a window into the production rate of producers and identify sources of excessive traffic. |
High network throughput might cause reports to take a longer time to reflect the latest data, as Kafka is under heavier load. |
kafka.net.bytes_out.rate |
Batch size average |
Throughput |
To use network resources more efficiently, Kafka producers attempt to group messages into batches before sending them. The producer will wait to accumulate an amount of data defined by the batch size. Read more |
If the batch size average is too low, reports might take a longer time to reflect the latest data. |
kafka.producer.batch_size_avg |
Consumer Metrics
Metric |
Metric type |
Why care |
User impact |
How to measure |
Records lag |
Performance |
Number of messages consumers are behind producers on this partition. The significance of these metrics’ values depends completely upon what your consumers are doing. If you have consumers that back up old messages to long-term storage, you would expect records lag to be significant. However, if your consumers are processing real-time data, consistently high lag values could be a sign of overloaded consumers, in which case both provisioning more consumers and splitting topics across more partitions could help increase throughput and reduce lag. |
Reports might take longer to reflect the latest data. |
kafka.consumer_lag |
Records consumed rate |
Throughput |
Average number of records consumed per second for a specific topic or across all topics. |
Reports might take longer to reflect the latest data. |
kafka.consumer.records_consumed |
Fetch rate |
Throughput |
Number of fetch requests per second from the consumer. |
requests per second from the consumer. |
kafka.request.fetch_rate |
Zookeeper
The Datadog Agent ships with an integration which can be used to collect metrics. See the Zookeeper Integration for more information.
Metric |
Metric type |
Why care |
User impact |
How to measure |
Outstanding requests |
Saturation |
This shows the number of requests still to be processed. Tracking both outstanding requests and latency can give you a clearer picture of the causes behind degraded performance. |
Reports might take longer to reflect the latest data. |
zookeeper.outstanding_requests |
Average latency |
Throughput |
This metric records the amount of time it takes to respond to a client request (in ms). |
Reports might take longer to reflect the latest data. |
zookeeper.latency.avg |
Open file descriptors |
Utilization |
Linux has a limited number of file descriptors available, so it’s important to keep an eye on this metric to ensure ZooKeeper can continue to function as expected. |
Reports might not reflect new data, as ZooKeeper will be getting errors. |
zookeeper.open_file_descriptor_count |
Celery
The Datadog Agent ships with a HTTP Check integration to collect various network metrics. In addition, CommCare HQ reports on many custom metrics for Celery. It might be worth having a look at Datadog’s Custom Metrics page. Celery also uses Celery Flower as a tool to monitor some tasks and workers.
Metric |
Metric type |
Why care |
User impact |
How to measure |
Celery uptime |
Availability |
The uptime rating is a measure of service availability. |
Background tasks will not execute (sending of emails, periodic reporting to external partners, report downloads, etc) |
network.http.can_connect |
Celery uptime by queue |
Availability |
The uptime rating as per queue. |
Certain background or asynchronous tasks will not get executed. The user might not notice this immediately. |
CommCare HQ custom metric |
Time to start |
Other |
This metric shows the time (seconds) it takes a task in a specific queue to start executing. If a certain task consistently takes a long time to start, it might be worth looking into. |
For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc. |
CommCare HQ custom metric |
Blockage duration by queue |
Throughput |
This metric indicates the estimated time (seconds) a certain queue was blocked. It might be worth it to alert if this blockage lasts longer than a specified time. |
For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc. |
CommCare HQ custom metric |
Task execution rate |
Throughput |
This metric gives a rough estimation of the amount of tasks being executed within a certain time bracket. This can be an important metric as it will indicate when more and more tasks take longer to execute, in which case an investigation might be appropriate. |
For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc. |
CommCare HQ custom metric |
Celery tasks by host |
Throughput |
Indicates the running time (seconds) for celery tasks by host. |
For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc. |
CommCare HQ custom metric |
Celery tasks by queue |
Throughput |
Indicates the running time (seconds) for celery tasks by queue. This way you can identify slower queues. |
For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc. |
CommCare HQ custom metric |
Celery tasks by task |
Throughput |
Indicates the running time (seconds) for celery tasks by each respective task. Slower tasks can be identified. |
For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc. |
CommCare HQ custom metric |
Tasks queued by queue |
Saturation |
Indicates the number of tasks queued by each respective queue. If this becomes increasingly large, keep an eye out for blockages. |
For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc. |
|
Tasks failing by worker |
Error |
Indicates tasks that failed to execute. Increasing numbers indicates some problems with the respective worker(s). |
If certain background or asynchronous tasks fail, certain features become unusable, for example sending emails, SMS’s, periodic reporting etc. |
|
Tasks by state |
Other |
This metric shows the number of tasks by their celery state. If the number of failed tasks increases for instance, it might be worth looking into. |
If certain background or asynchronous tasks fail, certain features become unusable, for example sending emails, SMS’s, periodic reporting etc. |
RabbitMQ
The Datadog Agent ships with an integration which can be used to collect metrics. See the RabbitMQ Integration for more information.
Metric |
Metric type |
Why care |
User impact |
How to measure |
Queue depth |
Saturation |
Using queue depth, messages ready and messages unacknowledged |
For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc. |
rabbitmq.queue.messages |
Messages ready |
Other |
Using queue depth, messages ready and messages unacknowledged |
For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc. |
rabbitmq.queue.messages_ready |
Messages unacknowledged |
Error |
Using queue depth, messages ready and messages unacknowledged |
Certain background tasks will fail to execute, like sending emails, SMS’s, alerts, etc. |
rabbitmq.queue.messages_unacknowledged |
Queue memory |
Utilization |
RabbitMQ keeps messages in memory for faster access, but if queues handle a lot of messages you could consider using lazy queues in order to preserve memory. Read more |
For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc. |
rabbitmq.queue.memory |
Queue consumers |
Other |
The number of consumers is configurable, so a lower-than-expected number of consumers could indicate failures in your application. |
Certain background tasks might fail to execute, like sending emails, SMS’s, alerts, etc. |
rabbitmq.queue.consumers |
Node sockets |
Utilization |
As you increase the number of connections to your RabbitMQ server, RabbitMQ uses a greater number of file descriptors and network sockets. Since RabbitMQ will block new connections for nodes that have reached their file descriptor limit, monitoring the available number of file descriptors helps you keep your system running. |
Background tasks might take longer to execute as, or in the worst case, might not execute at all. |
rabbitmq.node.sockets_used |
Node file descriptors |
Utilization |
As you increase the number of connections to your RabbitMQ server, RabbitMQ uses a greater number of file descriptors and network sockets. Since RabbitMQ will block new connections for nodes that have reached their file descriptor limit, monitoring the available number of file descriptors helps you keep your system running. |
Background tasks might take longer to execute as, or in the worst case, might not execute at all. |
rabbitmq.node.fd_used |