Metrics

CommCare Infrastructure Metrics

CommCare uses Datadog and Prometheus for monitoring various system, application and custom metrics. Datadog supports a variety of applications and is easily extendable.

Below are a few tables tabulating various metrics of the system and service infrastructure used to run CommCare. The list is not absolute nor exhaustive, but it will provide you with a basis for monitoring the following aspects of your system:

Performance
Throughput
Utilization
Availability
Errors
Saturation

Each table has the following format:

Metric	Metric type	Why care	User impact	How to measure
Name of metric	Category or aspect of system the metric speaks to	Brief description of why metric is important	Explains the impact on user if undesired reading is recorded	A note on how the metric might be obtained. Please note that it is assumed that Datadog will be used as a monitoring solution unless specified otherwise.

General Host

The Datadog Agent ships with an integration which can be used to collect metrics from your base system. See the System Integration for more information.

Metric	Metric type	Why care	User impact	How to measure
CPU usage (%)	Utilization	Monitoring server CPU usage helps you understand how much your CPU is being used, as a very high load might result in overall performance degradation.	Lagging experience	system.cpu.idle system.cpu.system system.cpu.iowait system.cpu.user
Load averages 1-5-15	Utilization	Load average (CPU demand) over 1 min, 5 min and 15 min which includes the sum of running and waiting threads. What is load average	User might experience trouble connecting to the server	system.load.1 system.load.5 system.load.15
Memory	Utilization	It shows the amount of memory used over time. Running out of memory may result in killed processes or more swap memory used, which will slow down your system. Consider optimizing processes or increasing resources.	Slow performance	system.mem.usable system.mem.total
Swap memory	Utilization	This metric shows the amount of swap memory used. Swap memory is slow, so if your system depends too much on swap, you should investigate why RAM usage is so high. Note that it is normal for systems to use a little swap memory even if RAM is available.	Server unresponsiveness	system.swap.free system.swap.used
Disk usage	Utilization	Disk usage is important to prevent data loss in the event that the disk runs out of available space.	Data loss	system.disk.in_use
Disk latency	Throughput	The average time for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. High disk latency will result in slow response times for things like reports, app installs and other services that read from disk.	Slow performance	system.io.await
Network traffic	Throughput	This indicates the amount of incoming and outgoing traffic on the network. This metric is a good gauge on the average network activity on the system. Low or consistently plateauing network throughput will result in poor performance experienced by end users, as sending and receiving data from them will be throttled.	Slow performance	system.net.bytes_rcvd system.net.bytes_sent

Gunicorn

The Datadog Agent ships with an integration which can be used to collect metrics. See the Gunicorn Integration for more information.

Metric	Metric type	Why care	User impact	How to measure
Requests per second	Throughput	This metric shows the rate of requests received. This can be used to give an indication of how busy the application is. If you’re constantly getting a high request rate, keep an eye out for bottlenecks on your system.	Slow user experience or trouble accessing the site.	gunicorn.requests
Request duration	Throughput	Long request duration times can point to problems in your system / application.	Slow experience and timeouts	gunicorn.request.duration.*
Http status codes	Performance	A high rate of error codes can either mean your application has faulty code or some part of your application infrastructure is down.	User might get errors on pages	gunicorn.request.status.*
Busy vs idle Gunicorn workers	Utilization	This metric can be used to give an indication of how busy the gunicorn workers are over time. If most of the workers are busy most of the time, it might be necessary to start thinking of increasing the number of workers before users start having trouble accessing your site.	Slow user experience or trouble accessing the site.	gunicorn.workers

Nginx

The Datadog Agent ships with an integration which can be used to collect metrics. See the Nginx Integration for more information.

Metric	Metric type	Why care	User impact	How to measure
Total requests	Throughput	This metric indicates the number of client requests your server handles. High rates means bigger load on the system.	Slow experience	nginx.requests.total
Requests per second	Throughput	This metric shows the rate of requests received. This can be used to give an indication of how busy the application is. If you’re constantly getting a high request rate, keep an eye out for services that might need additional resources to perform optimally.	Slow user experience or trouble accessing the site.	nginx.net.request_per_s
Dropped connections	Errors	If NGINX starts to incrementally drop connections it usually indicates a resource constraint, such as NGINX’s worker_connections limit has been reached. An investigation might be in order.	Users will not be able to access the site.	nginx.connections.dropped
Server error rate	Error	Your server error rate is equal to the number of 5xx errors divided by the total number of status codes. If your error rate starts to climb over time, investigation may be in order. If it spikes suddenly, urgent action may be required, as clients are likely to report errors to the end user.	User might get errors on pages	nginx.server_zone.responses.5xx nginx.server_zone.responses.total_count
Request time	Performance	This is the time in seconds used to process the request. Long response times can point to problems in your system / application.	Slow experience and timeouts	Need to include in NGINX configuration file

PostgreSQL

PostgreSQL has a statistics collector subsystem that collects and reports on information about the server activity.

The Datadog Agent ships with an integration which can be used to collect metrics. See the PostgreSQL Integration for more information.

Metric	Metric type	Why care	User impact	How to measure
Sequential scans on table vs. Index scans on table	Other	This metric speaks directly to the speed of query execution. If the DB is making more sequential scans than indexed scans you can improve the DB’s performance by creating an index.	Tasks that require data to be fetched from the DB will take a long time to execute.	PostgreSQL: pg_stat_user_tables Datadog integration: postgresql.seq_scans postgresql.index_scans
Rows fetched vs. returned by queries to DB	Throughput	This metric shows how effectively the DB is scanning through its data. If many more rows are constantly fetched vs returned, it means there’s room for optimization.	Slow experience for tasks that access large parts of the database.	PostgreSQL: pg_stat_database Datadog integration: postgresql.rows_fetched postgresql.rows_returned
Amount of data written temporarily to disk to execute queries	Saturation	If the DB’s temporary storage is constantly used up, you might need to increase it in order to optimize performance.	Slow experience for tasks that read data from the database.	PostgreSQL: pg_stat_database Datadog integration: postgresql.temp_bytes
Rows inserted, updated, deleted (by database)	Throughput	This metric gives an indication of what type of write queries your DB serves most. If a high rate of updated or deleted queries persist, you may want to keep an eye out for increasing dead rows as this will degrade DB performance.	No direct impact	PostgreSQL: pg_stat_database Datadog integration: postgresql.rows_inserted postgresql.rows_updated postgresql.rows_deleted
Locks	Other	A high lock rate in the DB is an indication that queries could be long-running and that future queries might start to time out.	Slow experience for tasks that read data from the database.	PostgreSQL: pg_locks Datadog integration: postgresql.locks
Deadlocks	Other	The aim is to have no deadlocks as it’s resource intensive for the DB to check for them. Having many deadlocks calls for reevaluating execution logic. Read more	Slow experience for tasks that read data from the database. Some tasks may even hang and the user will get errors on pages.	PostgreSQL: pg_stat_database Datadog integration: postgresql.deadlocks
Dead rows	Other	A constantly increasing number of dead rows show that the DB’s VACUUM process is not working properly. This will affect DB performance negatively.	Slow experience for tasks that read data from the database.	PostgreSQL: pg_stat_user_tables Datadog integration: postgresql.dead_rows
Replication delay	Other	A higher delay means data is less consistent across replication servers.	In the worst case, some data may appear missing.	PostgreSQL: pg_xlog Datadog integration: postgresql.replication_delay
Number of checkpoints requested vs scheduled	Other	Having more requested checkpoints than scheduled checkpoints means decreased writing performance for the DB.`Read more <https://www.cybertec-postgresql.com/en/postgresql-what-is-a-checkpoint/?gclid=CjwKCAjw7fuJBhBdEiwA2lLMYbUeLBrWYvSMjishfoa-RAEbkTNIL315MGdx6nrHnDK0A4UpjkbZIRoCTwYQAvD_BwE>`__	Slow experience for tasks that read data from the database.	PostgreSQL: pg_stat_bgwriter Datadog integration: postgresql.bgwriter.checkpoints_timed postgresql.bgwriter.checkpoints_requested
Active connections	Utilization	Having the number of active connections consistently approaching the number of maximum connections, this can indicate that applications are issuing long-running queries and constantly creating new connections to send other requests, instead of reusing existing connections. Using a connection pool can help ensure that connections are consistently reused as they go idle, instead of placing load on the primary server to frequently have to open and close connections. Typically, opening a DB connection is an expensive operation.	Users might get errors on pages which need to access the database but cannot due to too many currently active connections.	PostgreSQL: pg_stat_database Datadog integration: postgresql.connections postgresql.max_connections

Elasticsearch

The Datadog Agent ships with an integration which can be used to collect metrics. See the Elasticsearch Integration for more information.

Metric	Metric type	Why care	User impact	How to measure
Query load	Utilization	Monitoring the number of queries currently in progress can give you a rough idea of how many requests your cluster is dealing with at any particular moment in time.	A high load might slow down any tasks that involve searching users, groups, forms, cases, apps etc.	elasticsearch.primaries.search.query.current
Average query latency	Throughput	If this metric shows the query latency is increasing it means your queries are becoming slower, meaning either bottlenecks or inefficient queries.	Slow user experience when generating or reports, filtering groups or users, etc.	elasticsearch.primaries.search.query.total elasticsearch.primaries.search.query.time
Average fetch latency	Throughput	This should typically take less time than the query phase. If this metric is constantly increasing it could indicate problems with slow disks or requesting of too many results.	Slow user experience when generating or reports, filtering groups or users, etc.	elasticsearch.primaries.search.fetch.total elasticsearch.primaries.search.fetch.time
Average index latency	Throughput	If you notice an increasing latency it means you may be trying to index too many documents simultaneously.Increasing latency may slow down user experience.	Slow user experience when generating or reports, filtering groups or users, etc.	elasticsearch.indexing.index.total elasticsearch.indexing.index.time
Average flush latency	Throughput	Data is only persisted on disk after a flush. If this metric increases with time it may indicate a problem with a slow disk. If this problem escalates it may prevent you from being able to add new information to your index.	Slow user experience when generating or reports, filtering groups or users, etc. In the worst case there may be some data loss.	elasticsearch.primaries.flush.total elasticsearch.primaries.flush.total.time
Percent of JVM heap currently in use	Utilization	Garbage collections should initiate around 75% of heap use. When this value is consistently going above 75% it indicates that the rate of garbage collection is not keeping up with the rate of garbage creation which might result in memory errors down the line.	Users might experience errors on some pages	jvm.mem.heap_in_use
Total time spent on garbage collection	Other	The garbage collection process halts the node, during which the node cannot complete tasks. If this halting duration exceeds the routine status check (around 30 seconds) the node might mistakenly be marked as offline.	Users can have a slow experience and in the worst case might even get errors on some pages.	jvm.gc.collectors.young.collection_time jvm.gc.collectors.old.collection_time
Total HTTP connections opened over time	Other	If this number is constantly increasing it means that HTTP clients are not properly establishing persistent connections. Reestablishing adds additional overhead and might result in requests taking unnecessarily long to complete.	Slow user experience when generating or reports, filtering groups or users, etc.	elasticsearch.http.total_opened
Cluster status	Other	The status will indicate when at least one replica shard is unallocated or missing. If more shards disappear you may lose data.	Missing data (not data loss, as Elasticsearch is a secondary database)	elasticsearch.cluster_health
Number of unassigned shards	Availability	When you first create an index, or when a node is rebooted, its shards will briefly be in an “initializing” state before transitioning to a status of “started” or “unassigned”, as the primary node attempts to assign shards to nodes in the cluster. If you see shards remain in an initializing or unassigned state too long, it could be a warning sign that your cluster is unstable.	Slow user experience when generating or reports, filtering groups or users, etc.	elasticsearch.unassigned_shards
Thread pool queues		Large queues are not ideal because they use up resources and also increase the risk of losing requests if a node goes down.	Slow user experience when generating or reports, filtering groups or users, etc. In the worst case	elasticsearch.thread_pool.bulk.queue
Pending tasks	Saturation	The number of pending tasks is a good indication of how smoothly your cluster is operating. If your primary node is very busy and the number of pending tasks doesn’t subside, it can lead to an unstable cluster.	Slow user experience when generating or reports, filtering groups or users, etc.	elasticsearch.pending_tasks_total
Unsuccessful GET requests	Error	An unsuccessful get request means that the document ID was not found. You shouldn’t usually have a problem with this type of request, but it may be a good idea to keep an eye out for unsuccessful GET requests when they happen.	User might get errors on some pages	elasticsearch.get.missing.total

CouchDB

The Datadog Agent ships with an integration which can be used to collect metrics. See the CouchDB Integration for more information.

Metric	Metric type	Why care	User impact	How to measure
Open databases	Availability	If the number of open databases are too low you might have database requests starting to pile up.	Slow user experience if the requests start to pile up high.	couchdb.couchdb.open_databases
File descriptors	Utilization	If this number reaches the max number of available file descriptors, no new connections can be opened until older ones have closed.	The user might get errors on some pages.	couchdb.couchdb.open_os_files over
Data size	Utilization	This indicates the relative size of your data. Keep an eye on this as it grows to make sure your system has enough disk space to support it.	Data loss	couchdb.by_db.file_size
HTTP Request Rate	Throughput	Gives an indication of how many requests are being served.	Slow performance	couchdb.couchdb.httpd.requests
Request with status code of 2xx	Performance	Statuses in the 2xx range are generally indications of successful operation.	No negative impact	couchdb.couchdb.httpd_status_codes
Request with status code of 4xx and 5xx	Performance	Statuses in the 4xx and 5xx ranges generally tell you something is wrong, so you want this number as low as possible, preferably zero. However, if you constantly see requests yielding these statuses, it might be worth looking into the matter.	Users might get errors on some pages.	couchdb.couchdb.httpd_status_codes
Workload - Reads & Writes	Performance	These numbers will depend on the application, but having this metric gives an indication of how busy the database generally is. In the case of a high workload, consider ramping up the resources.	Slow performance	couchdb.couchdb.database_reads
Average request latency	Throughput	If the average request latency is rising it means somewhere exists a bottleneck that needs to be addressed.	Slow performance	couchdb.couchdb.request_time.arithmetic_mean
Cache hits	Other	CouchDB stores a fair amount of user credentials in memory to speed up the authentication process. Monitoring usage of the authentication cache can alert you for possible attempts to gain unauthorized access.	A low number of hits might mean slower performance	couchdb.couchdb.auth_cache_hits
Cache misses	Error	If CouchDB reports a high number of cache misses, then either the cache is undersized to service the volume of legitimate user requests, or a brute force password/username attack is taking place.	Slow performance	couchdb.couchdb.auth_cache_misses

Kafka

The Datadog Agent ships with a Kafka Integration to collect various Kafka metrics. Also see Integrating Datadog, Kafka and Zookeper.

Broker Metrics

Metric	Metric type	Why care	User impact	How to measure
UnderReplicatedPartitions	Availability	If a broker becomes unavailable, the value of UnderReplicatedPartitions will increase sharply. Since Kafka’s high-availability guarantees cannot be met without replication, investigation is certainly warranted should this metric value exceed zero for extended time periods.	Fewer in-sync replicas means the reports might take longer to show the latest values.	kafka.replication.under_replicated_partitions
IsrShrinksPerSec	Availability	The rate at which the in-sync replicas shrinks for a particular partition. This value should remain fairly static. You should investigate any flapping in the values of these metrics, and any increase in IsrShrinksPerSec without a corresponding increase in IsrExpandsPerSec shortly thereafter.	As the in-sync replicas become fewer, the reports might take longer to show the latest values.	kafka.replication.isr_shrinks.rate
IsrExpandsPerSec	Availability	The rate at which the in-sync replicas expands.	As the in-sync replicas become fewer, the reports might take longer to show the latest values.	kafka.replication.isr_expands.rate
TotalTimeMs	Performance	This metrics reports on the total time taken to service a request.	Longer servicing times mean data-updates take longer to propagate to the reports.	kafka.request.produce.time.avg kafka.request.consumer.time.avg kafka.request.fetch_follower.time.avg
ActiveControllerCount	Error	The first node to boot in a Kafka cluster automatically becomes the controller, and there can be only one. You should alert on any other value that lasts for longer than one second. In the case that no controller is found, Kafka might become unstable and new data might not be updated.	Reports might not show new updated data, or even break.	kafka.replication.active_controller_count
Broker network throughput	Throughput	This metric indicates the broker throughput.	If the throughput becomes less, the user might find that reports take longer to reflect updated data.	kafka.net.bytes_in.rate kafka.net.bytes_out.rate
Clean vs unclean leaders elections	Error	When a partition leader dies, an election for a new leader is triggered. New leaders should only come from replicas that are in-sync with the previous leader, however, this is a configuration setting that can allow for unclean elections.	Data might be missing in reports. (the data will not be lost, as the data is already stored in PostgreSQL or CouchDB, but the reports will not reflect the latest changes)	kafka.replication.leader_elections.rate kafka.replication.unclean_leader_elections.rate
Fetch/request purgatory	Other	An unclean leader is a leader that is not completely in-sync with the previous leader, so when an unclean leader is elected, you will lose any data that was produced to Kafka before the full sync happened. You should alert on any unclean leaders elected.	Reports might take longer to reflect the latest data.	kafka.request.producer_request_purgatory.size kafka.request.fetch_request_purgatory.size

Producer Metrics

Metric	Metric type	Why care	User impact	How to measure
Request rate	Throughput	The request rate is the rate at which producers send data to brokers. Keeping an eye on peaks and drops is essential to ensure continuous service availability.	Reports might take longer to reflect the latest data.	kafka.producer.request_rate
Response rate	Throughput	Average number of responses received per second from the brokers after the producers sent the data to the brokers.	Reports might take longer to reflect the latest data.	kafka.producer.response_rate
Request latency average	Throughput	Average request latency (in ms). Read more	Reports might take longer to reflect the latest data.	kafka.producer.request_latency_avg
Outgoing byte rate	Throughput	Monitoring producer network traffic will help to inform decisions on infrastructure changes, as well as to provide a window into the production rate of producers and identify sources of excessive traffic.	High network throughput might cause reports to take a longer time to reflect the latest data, as Kafka is under heavier load.	kafka.net.bytes_out.rate
Batch size average	Throughput	To use network resources more efficiently, Kafka producers attempt to group messages into batches before sending them. The producer will wait to accumulate an amount of data defined by the batch size. Read more	If the batch size average is too low, reports might take a longer time to reflect the latest data.	kafka.producer.batch_size_avg

Consumer Metrics

Metric	Metric type	Why care	User impact	How to measure
Records lag	Performance	Number of messages consumers are behind producers on this partition. The significance of these metrics’ values depends completely upon what your consumers are doing. If you have consumers that back up old messages to long-term storage, you would expect records lag to be significant. However, if your consumers are processing real-time data, consistently high lag values could be a sign of overloaded consumers, in which case both provisioning more consumers and splitting topics across more partitions could help increase throughput and reduce lag.	Reports might take longer to reflect the latest data.	kafka.consumer_lag
Records consumed rate	Throughput	Average number of records consumed per second for a specific topic or across all topics.	Reports might take longer to reflect the latest data.	kafka.consumer.records_consumed
Fetch rate	Throughput	Number of fetch requests per second from the consumer.	requests per second from the consumer.	kafka.request.fetch_rate

Zookeeper

The Datadog Agent ships with an integration which can be used to collect metrics. See the Zookeeper Integration for more information.

Metric	Metric type	Why care	User impact	How to measure
Outstanding requests	Saturation	This shows the number of requests still to be processed. Tracking both outstanding requests and latency can give you a clearer picture of the causes behind degraded performance.	Reports might take longer to reflect the latest data.	zookeeper.outstanding_requests
Average latency	Throughput	This metric records the amount of time it takes to respond to a client request (in ms).	Reports might take longer to reflect the latest data.	zookeeper.latency.avg
Open file descriptors	Utilization	Linux has a limited number of file descriptors available, so it’s important to keep an eye on this metric to ensure ZooKeeper can continue to function as expected.	Reports might not reflect new data, as ZooKeeper will be getting errors.	zookeeper.open_file_descriptor_count

Celery

The Datadog Agent ships with a HTTP Check integration to collect various network metrics. In addition, CommCare HQ reports on many custom metrics for Celery. It might be worth having a look at Datadog’s Custom Metrics page. Celery also uses Celery Flower as a tool to monitor some tasks and workers.

Metric	Metric type	Why care	User impact	How to measure
Celery uptime	Availability	The uptime rating is a measure of service availability.	Background tasks will not execute (sending of emails, periodic reporting to external partners, report downloads, etc)	network.http.can_connect
Celery uptime by queue	Availability	The uptime rating as per queue.	Certain background or asynchronous tasks will not get executed. The user might not notice this immediately.	CommCare HQ custom metric
Time to start	Other	This metric shows the time (seconds) it takes a task in a specific queue to start executing. If a certain task consistently takes a long time to start, it might be worth looking into.	For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc.	CommCare HQ custom metric
Blockage duration by queue	Throughput	This metric indicates the estimated time (seconds) a certain queue was blocked. It might be worth it to alert if this blockage lasts longer than a specified time.	For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc.	CommCare HQ custom metric
Task execution rate	Throughput	This metric gives a rough estimation of the amount of tasks being executed within a certain time bracket. This can be an important metric as it will indicate when more and more tasks take longer to execute, in which case an investigation might be appropriate.	For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc.	CommCare HQ custom metric
Celery tasks by host	Throughput	Indicates the running time (seconds) for celery tasks by host.	For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc.	CommCare HQ custom metric
Celery tasks by queue	Throughput	Indicates the running time (seconds) for celery tasks by queue. This way you can identify slower queues.	For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc.	CommCare HQ custom metric
Celery tasks by task	Throughput	Indicates the running time (seconds) for celery tasks by each respective task. Slower tasks can be identified.	For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc.	CommCare HQ custom metric
Tasks queued by queue	Saturation	Indicates the number of tasks queued by each respective queue. If this becomes increasingly large, keep an eye out for blockages.	For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc.	Celery Flower
Tasks failing by worker	Error	Indicates tasks that failed to execute. Increasing numbers indicates some problems with the respective worker(s).	If certain background or asynchronous tasks fail, certain features become unusable, for example sending emails, SMS’s, periodic reporting etc.	Celery Flower
Tasks by state	Other	This metric shows the number of tasks by their celery state. If the number of failed tasks increases for instance, it might be worth looking into.	If certain background or asynchronous tasks fail, certain features become unusable, for example sending emails, SMS’s, periodic reporting etc.	Celery Flower