Metrics

Metrics collection

This package exposes functions and utilities to record metrics in CommCare. These metrics are exported / exposed to the configured metrics providers. Supported providers are:

  • Datadog

  • Prometheus

Providers are enabled using the METRICS_PROVIDER setting. Multiple providers can be enabled concurrently:

METRICS_PROVIDERS = [
    'corehq.util.metrics.prometheus.PrometheusMetrics',
    'corehq.util.metrics.datadog.DatadogMetrics',
]

If no metrics providers are configured CommCare will log all metrics to the commcare.metrics logger at the DEBUG level.

Metric tagging

Metrics may be tagged by passing a dictionary of tag names and values. Tags should be used to add dimensions to a metric e.g. request type, response status.

Tags should not originate from unbounded sources or sources with high dimensionality such as timestamps, user IDs, request IDs etc. Ideally a tag should not have more than 10 possible values.

Read more about tagging:

Metric Types

Counter metric

A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.

Do not use a counter to expose a value that can decrease. For example, do not use a counter for the number of currently running processes; instead use a gauge.

metrics_counter('commcare.case_import.count', 1, tags={'domain': domain})

Gauge metric

A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.

Gauges are typically used for measured values like temperatures or current memory usage, but also “counts” that can go up and down, like the number of concurrent requests.

metrics_gauge('commcare.case_import.queue_length', queue_length)

For regular reporting of a gauge metric there is the metrics_gauge_task function:

corehq.util.metrics.metrics_gauge_task(name, fn, run_every, multiprocess_mode='all')[source]

Helper for easily registering gauges to run periodically

To update a gauge on a schedule based on the result of a function just add to your app’s tasks.py:

my_calculation = metrics_gauge_task(
    'commcare.my.metric', my_calculation_function, run_every=crontab(minute=0)
)
kwargs:

multiprocess_mode: See PrometheusMetrics._gauge for documentation.

Histogram metric

A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets.

metrics_histogram(
    'commcare.case_import.duration', timer_duration,
    bucket_tag='size', buckets=[10, 50, 200, 1000], bucket_unit='s',
    tags={'domain': domain}
)

Histograms are recorded differently in the different providers.

DatadogMetrics._histogram(name: str, value: float, bucket_tag: str, buckets: List[int], bucket_unit: str = '', tags: Optional[Dict[str, str]] = None, documentation: str = '')[source]

This implementation of histogram uses tagging to record the buckets. It does not use the Datadog Histogram metric type.

The metric itself will be incremented by 1 on each call. The value passed to metrics_histogram will be used to create the bucket tag.

For example:

h = metrics_histogram(
    'commcare.request.duration', 1.4,
    bucket_tag='duration', buckets=[1, 2, 3], bucket_unit='ms',
    tags=tags
)

# resulting metrics
# commcare.request.duration:1|c|#duration:lt_2ms

For more explanation about why this implementation was chosen see:

PrometheusMetrics._histogram(name: str, value: float, bucket_tag: str, buckets: List[int], bucket_unit: str = '', tags: Optional[Dict[str, str]] = None, documentation: str = '')[source]

A cumulative histogram with a base metric name of <name> exposes multiple time series during a scrape:

  • cumulative counters for the observation buckets, exposed as <name>_bucket{le=”<upper inclusive bound>”}

  • the total sum of all observed values, exposed as <name>_sum

  • the count of events that have been observed, exposed as <name>_count (identical to <name>_bucket{le=”+Inf”} above)

For example

h = metrics_histogram(
    'commcare.request_duration', 1.4,
    bucket_tag='duration', buckets=[1,2,3], bucket_units='ms',
    tags=tags
)

# resulting metrics
# commcare_request_duration_bucket{...tags..., le="1.0"} 0.0
# commcare_request_duration_bucket{...tags..., le="2.0"} 1.0
# commcare_request_duration_bucket{...tags..., le="3.0"} 1.0
# commcare_request_duration_bucket{...tags..., le="+Inf"} 1.0
# commcare_request_duration_sum{...tags...} 1.4
# commcare_request_duration_count{...tags...} 1.0

See https://prometheus.io/docs/concepts/metric_types/#histogram

Utilities

corehq.util.metrics.create_metrics_event(title: str, text: str, alert_type: str = 'info', tags: Optional[Dict[str, str]] = None, aggregation_key: Optional[str] = None)[source]

Send an event record to the monitoring provider.

Currently only implemented by the Datadog provider.

Parameters
  • title – Title of the event

  • text – Event body

  • alert_type – Event type. One of ‘success’, ‘info’, ‘warning’, ‘error’

  • tags – Event tags

  • aggregation_key – Key to use to group multiple events

corehq.util.metrics.metrics_gauge_task(name, fn, run_every, multiprocess_mode='all')[source]

Helper for easily registering gauges to run periodically

To update a gauge on a schedule based on the result of a function just add to your app’s tasks.py:

my_calculation = metrics_gauge_task(
    'commcare.my.metric', my_calculation_function, run_every=crontab(minute=0)
)
kwargs:

multiprocess_mode: See PrometheusMetrics._gauge for documentation.

corehq.util.metrics.metrics_histogram_timer(metric: str, timing_buckets: Iterable[int], tags: Optional[Dict[str, str]] = None, bucket_tag: str = 'duration', callback: Optional[Callable] = None)[source]

Create a context manager that times and reports to the metric providers as a histogram

Example Usage:

timer = metrics_histogram_timer('commcare.some.special.metric', tags={
    'type': type,
], timing_buckets=(.001, .01, .1, 1, 10, 100))
with timer:
    some_special_thing()

This will result it a call to metrics_histogram with the timer value.

Note: Histograms are implemented differently by each provider. See documentation for details.

Parameters
  • metric – Name of the metric (must start with ‘commcare.’)

  • tags – metric tags to include

  • timing_buckets – sequence of numbers representing time thresholds, in seconds

  • bucket_tag – The name of the bucket tag to use (if used by the underlying provider)

  • callback – a callable which will be called when exiting the context manager with a single argument of the timer duratio

Returns

A context manager that will perform the specified timing and send the specified metric

class corehq.util.metrics.metrics_track_errors(name)[source]

Record when something succeeds or errors in the configured metrics provider

Eg: This code will log to commcare.myfunction.succeeded when it completes successfully, and to commcare.myfunction.failed when an exception is raised.

@metrics_track_errors('myfunction')
def myfunction():
    pass

Other Notes

  • All metrics must use the prefix ‘commcare.’

CommCare Infrastructure Metrics

CommCare uses Datadog and Prometheus for monitoring various system, application and custom metrics. Datadog supports a variety of applications and is easily extendable.

Below are a few tables tabulating various metrics of the system and service infrastructure used to run CommCare. The list is not absolute nor exhaustive, but it will provide you with a basis for monitoring the following aspects of your system:

  • Performance

  • Throughput

  • Utilization

  • Availability

  • Errors

  • Saturation

Each table has the following format:

Metric

Metric type

Why care

User impact

How to measure

Name of metric

Category or aspect of system the metric speaks to

Brief description of why metric is important

Explains the impact on user if undesired reading is recorded

A note on how the metric might be obtained. Please note that it is assumed that Datadog will be used as a monitoring solution unless specified otherwise.

General Host

The Datadog Agent ships with an integration which can be used to collect metrics from your base system. See the System Integration for more information.

Metric

Metric type

Why care

User impact

How to measure

CPU usage (%)

Utilization

Monitoring server CPU usage helps you understand how much your CPU is being used, as a very high load might result in overall performance degradation.

Lagging experience

system.cpu.idle
system.cpu.system
system.cpu.iowait
system.cpu.user

Load averages 1-5-15

Utilization

Load average (CPU demand) over 1 min, 5 min and 15 min which includes the sum of running and waiting threads. What is load average

User might experience trouble connecting to the server

system.load.1
system.load.5
system.load.15

Memory

Utilization

It shows the amount of memory used over time. Running out of memory may result in killed processes or more swap memory used, which will slow down your system. Consider optimizing processes or increasing resources.

Slow performance

system.mem.usable
system.mem.total

Swap memory

Utilization

This metric shows the amount of swap memory used. Swap memory is slow, so if your system depends too much on swap, you should investigate why RAM usage is so high. Note that it is normal for systems to use a little swap memory even if RAM is available.

Server unresponsiveness

system.swap.free
system.swap.used

Disk usage

Utilization

Disk usage is important to prevent data loss in the event that the disk runs out of available space.

Data loss

system.disk.in_use

Disk latency

Throughput

The average time for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. High disk latency will result in slow response times for things like reports, app installs and other services that read from disk.

Slow performance

system.io.await

Network traffic

Throughput

This indicates the amount of incoming and outgoing traffic on the network. This metric is a good gauge on the average network activity on the system. Low or consistently plateauing network throughput will result in poor performance experienced by end users, as sending and receiving data from them will be throttled.

Slow performance

system.net.bytes_rcvd
system.net.bytes_sent

Gunicorn

The Datadog Agent ships with an integration which can be used to collect metrics. See the Gunicorn Integration for more information.

Metric

Metric type

Why care

User impact

How to measure

Requests per second

Throughput

This metric shows the rate of requests received. This can be used to give an indication of how busy the application is. If you’re constantly getting a high request rate, keep an eye out for bottlenecks on your system.

Slow user experience or trouble accessing the site.

gunicorn.requests

Request duration

Throughput

Long request duration times can point to problems in your system / application.

Slow experience and timeouts

gunicorn.request.duration.*

Http status codes

Performance

A high rate of error codes can either mean your application has faulty code or some part of your application infrastructure is down.

User might get errors on pages

gunicorn.request.status.*

Busy vs idle Gunicorn workers

Utilization

This metric can be used to give an indication of how busy the gunicorn workers are over time. If most of the workers are busy most of the time, it might be necessary to start thinking of increasing the number of workers before users start having trouble accessing your site.

Slow user experience or trouble accessing the site.

gunicorn.workers

Nginx

The Datadog Agent ships with an integration which can be used to collect metrics. See the Nginx Integration for more information.

Metric

Metric type

Why care

User impact

How to measure

Total requests

Throughput

This metric indicates the number of client requests your server handles. High rates means bigger load on the system.

Slow experience

nginx.requests.total

Requests per second

Throughput

This metric shows the rate of requests received. This can be used to give an indication of how busy the application is. If you’re constantly getting a high request rate, keep an eye out for services that might need additional resources to perform optimally.

Slow user experience or trouble accessing the site.

nginx.net.request_per_s

Dropped connections

Errors

If NGINX starts to incrementally drop connections it usually indicates a resource constraint, such as NGINX’s worker_connections limit has been reached. An investigation might be in order.

Users will not be able to access the site.

nginx.connections.dropped

Server error rate

Error

Your server error rate is equal to the number of 5xx errors divided by the total number of status codes. If your error rate starts to climb over time, investigation may be in order. If it spikes suddenly, urgent action may be required, as clients are likely to report errors to the end user.

User might get errors on pages

nginx.server_zone.responses.5xx
nginx.server_zone.responses.total_count

Request time

Performance

This is the time in seconds used to process the request. Long response times can point to problems in your system / application.

Slow experience and timeouts

Need to include in NGINX configuration file

PostgreSQL

PostgreSQL has a statistics collector subsystem that collects and reports on information about the server activity.

The Datadog Agent ships with an integration which can be used to collect metrics. See the PostgreSQL Integration for more information.

Metric

Metric type

Why care

User impact

How to measure

Sequential scans on table vs. Index scans on table

Other

This metric speaks directly to the speed of query execution. If the DB is making more sequential scans than indexed scans you can improve the DB’s performance by creating an index.

Tasks that require data to be fetched from the DB will take a long time to execute.

PostgreSQL:
pg_stat_user_tables
Datadog integration:
postgresql.seq_scans
postgresql.index_scans

Rows fetched vs. returned by queries to DB

Throughput

This metric shows how effectively the DB is scanning through its data. If many more rows are constantly fetched vs returned, it means there’s room for optimization.

Slow experience for tasks that access large parts of the database.

PostgreSQL:
pg_stat_database
Datadog integration:
postgresql.rows_fetched
postgresql.rows_returned

Amount of data written temporarily to disk to execute queries

Saturation

If the DB’s temporary storage is constantly used up, you might need to increase it in order to optimize performance.

Slow experience for tasks that read data from the database.

PostgreSQL:
pg_stat_database
Datadog integration:
postgresql.temp_bytes

Rows inserted, updated, deleted (by database)

Throughput

This metric gives an indication of what type of write queries your DB serves most. If a high rate of updated or deleted queries persist, you may want to keep an eye out for increasing dead rows as this will degrade DB performance.

No direct impact

PostgreSQL:
pg_stat_database
Datadog integration:
postgresql.rows_inserted
postgresql.rows_updated
postgresql.rows_deleted

Locks

Other

A high lock rate in the DB is an indication that queries could be long-running and that future queries might start to time out.

Slow experience for tasks that read data from the database.

PostgreSQL:
pg_locks
Datadog integration:
postgresql.locks

Deadlocks

Other

The aim is to have no deadlocks as it’s resource intensive for the DB to check for them. Having many deadlocks calls for reevaluating execution logic. Read more

Slow experience for tasks that read data from the database. Some tasks may even hang and the user will get errors on pages.

PostgreSQL:
pg_stat_database
Datadog integration:
postgresql.deadlocks

Dead rows

Other

A constantly increasing number of dead rows show that the DB’s VACUUM process is not working properly. This will affect DB performance negatively.

Slow experience for tasks that read data from the database.

PostgreSQL:
pg_stat_user_tables
Datadog integration:
postgresql.dead_rows

Replication delay

Other

A higher delay means data is less consistent across replication servers.

In the worst case, some data may appear missing.

PostgreSQL:
pg_xlog
Datadog integration:
postgresql.replication_delay

Number of checkpoints requested vs scheduled

Other

Having more requested checkpoints than scheduled checkpoints means decreased writing performance for the DB.`Read more <https://www.cybertec-postgresql.com/en/postgresql-what-is-a-checkpoint/?gclid=CjwKCAjw7fuJBhBdEiwA2lLMYbUeLBrWYvSMjishfoa-RAEbkTNIL315MGdx6nrHnDK0A4UpjkbZIRoCTwYQAvD_BwE>`_

Slow experience for tasks that read data from the database.

PostgreSQL:
pg_stat_bgwriter
Datadog integration:
postgresql.bgwriter.checkpoints_timed
postgresql.bgwriter.checkpoints_requested

Active connections

Utilization

Having the number of active connections consistently approaching the number of maximum connections, this can indicate that applications are issuing long-running queries and constantly creating new connections to send other requests, instead of reusing existing connections. Using a connection pool can help ensure that connections are consistently reused as they go idle, instead of placing load on the primary server to frequently have to open and close connections. Typically, opening a DB connection is an expensive operation.

Users might get errors on pages which need to access the database but cannot due to too many currently active connections.

PostgreSQL:
pg_stat_database
Datadog integration:
postgresql.connections
postgresql.max_connections

Elasticsearch

The Datadog Agent ships with an integration which can be used to collect metrics. See the Elasticsearch Integration for more information.

Metric

Metric type

Why care

User impact

How to measure

Query load

Utilization

Monitoring the number of queries currently in progress can give you a rough idea of how many requests your cluster is dealing with at any particular moment in time.

A high load might slow down any tasks that involve searching users, groups, forms, cases, apps etc.

elasticsearch.primaries.search.query.current

Average query latency

Throughput

If this metric shows the query latency is increasing it means your queries are becoming slower, meaning either bottlenecks or inefficient queries.

Slow user experience when generating or reports, filtering groups or users, etc.

elasticsearch.primaries.search.query.total
elasticsearch.primaries.search.query.time

Average fetch latency

Throughput

This should typically take less time than the query phase. If this metric is constantly increasing it could indicate problems with slow disks or requesting of too many results.

Slow user experience when generating or reports, filtering groups or users, etc.

elasticsearch.primaries.search.fetch.total
elasticsearch.primaries.search.fetch.time

Average index latency

Throughput

If you notice an increasing latency it means you may be trying to index too many documents simultaneously.Increasing latency may slow down user experience.

Slow user experience when generating or reports, filtering groups or users, etc.

elasticsearch.indexing.index.total
elasticsearch.indexing.index.time

Average flush latency

Throughput

Data is only persisted on disk after a flush. If this metric increases with time it may indicate a problem with a slow disk. If this problem escalates it may prevent you from being able to add new information to your index.

Slow user experience when generating or reports, filtering groups or users, etc. In the worst case there may be some data loss.

elasticsearch.primaries.flush.total
elasticsearch.primaries.flush.total.time

Percent of JVM heap currently in use

Utilization

Garbage collections should initiate around 75% of heap use. When this value is consistently going above 75% it indicates that the rate of garbage collection is not keeping up with the rate of garbage creation which might result in memory errors down the line.

Users might experience errors on some pages

jvm.mem.heap_in_use

Total time spent on garbage collection

Other

The garbage collection process halts the node, during which the node cannot complete tasks. If this halting duration exceeds the routine status check (around 30 seconds) the node might mistakenly be marked as offline.

Users can have a slow experience and in the worst case might even get errors on some pages.

jvm.gc.collectors.young.collection_time
jvm.gc.collectors.old.collection_time

Total HTTP connections opened over time

Other

If this number is constantly increasing it means that HTTP clients are not properly establishing persistent connections. Reestablishing adds additional overhead and might result in requests taking unnecessarily long to complete.

Slow user experience when generating or reports, filtering groups or users, etc.

elasticsearch.http.total_opened

Cluster status

Other

The status will indicate when at least one replica shard is unallocated or missing. If more shards disappear you may lose data.

Missing data (not data loss, as Elasticsearch is a secondary database)

elasticsearch.cluster_health

Number of unassigned shards

Availability

When you first create an index, or when a node is rebooted, its shards will briefly be in an “initializing” state before transitioning to a status of “started” or “unassigned”, as the primary node attempts to assign shards to nodes in the cluster. If you see shards remain in an initializing or unassigned state too long, it could be a warning sign that your cluster is unstable.

Slow user experience when generating or reports, filtering groups or users, etc.

elasticsearch.unassigned_shards

Thread pool queues

Large queues are not ideal because they use up resources and also increase the risk of losing requests if a node goes down.

Slow user experience when generating or reports, filtering groups or users, etc. In the worst case

elasticsearch.thread_pool.bulk.queue

Pending tasks

Saturation

The number of pending tasks is a good indication of how smoothly your cluster is operating. If your primary node is very busy and the number of pending tasks doesn’t subside, it can lead to an unstable cluster.

Slow user experience when generating or reports, filtering groups or users, etc.

elasticsearch.pending_tasks_total

Unsuccessful GET requests

Error

An unsuccessful get request means that the document ID was not found. You shouldn’t usually have a problem with this type of request, but it may be a good idea to keep an eye out for unsuccessful GET requests when they happen.

User might get errors on some pages

elasticsearch.get.missing.total

CouchDB

The Datadog Agent ships with an integration which can be used to collect metrics. See the CouchDB Integration for more information.

Metric

Metric type

Why care

User impact

How to measure

Open databases

Availability

If the number of open databases are too low you might have database requests starting to pile up.

Slow user experience if the requests start to pile up high.

couchdb.couchdb.open_databases

File descriptors

Utilization

If this number reaches the max number of available file descriptors, no new connections can be opened until older ones have closed.

The user might get errors on some pages.

couchdb.couchdb.open_os_files over

Data size

Utilization

This indicates the relative size of your data. Keep an eye on this as it grows to make sure your system has enough disk space to support it.

Data loss

couchdb.by_db.file_size

HTTP Request Rate

Throughput

Gives an indication of how many requests are being served.

Slow performance

couchdb.couchdb.httpd.requests

Request with status code of 2xx

Performance

Statuses in the 2xx range are generally indications of successful operation.

No negative impact

couchdb.couchdb.httpd_status_codes

Request with status code of 4xx and 5xx

Performance

Statuses in the 4xx and 5xx ranges generally tell you something is wrong, so you want this number as low as possible, preferably zero. However, if you constantly see requests yielding these statuses, it might be worth looking into the matter.

Users might get errors on some pages.

couchdb.couchdb.httpd_status_codes

Workload - Reads & Writes

Performance

These numbers will depend on the application, but having this metric gives an indication of how busy the database generally is. In the case of a high workload, consider ramping up the resources.

Slow performance

couchdb.couchdb.database_reads

Average request latency

Throughput

If the average request latency is rising it means somewhere exists a bottleneck that needs to be addressed.

Slow performance

couchdb.couchdb.request_time.arithmetic_mean

Cache hits

Other

CouchDB stores a fair amount of user credentials in memory to speed up the authentication process. Monitoring usage of the authentication cache can alert you for possible attempts to gain unauthorized access.

A low number of hits might mean slower performance

couchdb.couchdb.auth_cache_hits

Cache misses

Error

If CouchDB reports a high number of cache misses, then either the cache is undersized to service the volume of legitimate user requests, or a brute force password/username attack is taking place.

Slow performance

couchdb.couchdb.auth_cache_misses

Kafka

The Datadog Agent ships with a Kafka Integration to collect various Kafka metrics. Also see Integrating Datadog, Kafka and Zookeper.

Broker Metrics

Metric

Metric type

Why care

User impact

How to measure

UnderReplicatedPartitions

Availability

If a broker becomes unavailable, the value of UnderReplicatedPartitions will increase sharply. Since Kafka’s high-availability guarantees cannot be met without replication, investigation is certainly warranted should this metric value exceed zero for extended time periods.

Fewer in-sync replicas means the reports might take longer to show the latest values.

kafka.replication.under_replicated_partitions

IsrShrinksPerSec

Availability

The rate at which the in-sync replicas shrinks for a particular partition. This value should remain fairly static. You should investigate any flapping in the values of these metrics, and any increase in IsrShrinksPerSec without a corresponding increase in IsrExpandsPerSec shortly thereafter.

As the in-sync replicas become fewer, the reports might take longer to show the latest values.

kafka.replication.isr_shrinks.rate

IsrExpandsPerSec

Availability

The rate at which the in-sync replicas expands.

As the in-sync replicas become fewer, the reports might take longer to show the latest values.

kafka.replication.isr_expands.rate

TotalTimeMs

Performance

This metrics reports on the total time taken to service a request.

Longer servicing times mean data-updates take longer to propagate to the reports.

kafka.request.produce.time.avg
kafka.request.consumer.time.avg
kafka.request.fetch_follower.time.avg

ActiveControllerCount

Error

The first node to boot in a Kafka cluster automatically becomes the controller, and there can be only one. You should alert on any other value that lasts for longer than one second. In the case that no controller is found, Kafka might become unstable and new data might not be updated.

Reports might not show new updated data, or even break.

kafka.replication.active_controller_count

Broker network throughput

Throughput

This metric indicates the broker throughput.

If the throughput becomes less, the user might find that reports take longer to reflect updated data.

kafka.net.bytes_in.rate
kafka.net.bytes_out.rate

Clean vs unclean leaders elections

Error

When a partition leader dies, an election for a new leader is triggered. New leaders should only come from replicas that are in-sync with the previous leader, however, this is a configuration setting that can allow for unclean elections.

Data might be missing in reports. (the data will not be lost, as the data is already stored in PostgreSQL or CouchDB, but the reports will not reflect the latest changes)

kafka.replication.leader_elections.rate
kafka.replication.unclean_leader_elections.rate

Fetch/request purgatory

Other

An unclean leader is a leader that is not completely in-sync with the previous leader, so when an unclean leader is elected, you will lose any data that was produced to Kafka before the full sync happened. You should alert on any unclean leaders elected.

Reports might take longer to reflect the latest data.

kafka.request.producer_request_purgatory.size
kafka.request.fetch_request_purgatory.size

Producer Metrics

Metric

Metric type

Why care

User impact

How to measure

Request rate

Throughput

The request rate is the rate at which producers send data to brokers. Keeping an eye on peaks and drops is essential to ensure continuous service availability.

Reports might take longer to reflect the latest data.

kafka.producer.request_rate

Response rate

Throughput

Average number of responses received per second from the brokers after the producers sent the data to the brokers.

Reports might take longer to reflect the latest data.

kafka.producer.response_rate

Request latency average

Throughput

Average request latency (in ms). Read more

Reports might take longer to reflect the latest data.

kafka.producer.request_latency_avg

Outgoing byte rate

Throughput

Monitoring producer network traffic will help to inform decisions on infrastructure changes, as well as to provide a window into the production rate of producers and identify sources of excessive traffic.

High network throughput might cause reports to take a longer time to reflect the latest data, as Kafka is under heavier load.

kafka.net.bytes_out.rate

Batch size average

Throughput

To use network resources more efficiently, Kafka producers attempt to group messages into batches before sending them. The producer will wait to accumulate an amount of data defined by the batch size. Read more

If the batch size average is too low, reports might take a longer time to reflect the latest data.

kafka.producer.batch_size_avg

Consumer Metrics

Metric

Metric type

Why care

User impact

How to measure

Records lag

Performance

Number of messages consumers are behind producers on this partition. The significance of these metrics’ values depends completely upon what your consumers are doing. If you have consumers that back up old messages to long-term storage, you would expect records lag to be significant. However, if your consumers are processing real-time data, consistently high lag values could be a sign of overloaded consumers, in which case both provisioning more consumers and splitting topics across more partitions could help increase throughput and reduce lag.

Reports might take longer to reflect the latest data.

kafka.consumer_lag

Records consumed rate

Throughput

Average number of records consumed per second for a specific topic or across all topics.

Reports might take longer to reflect the latest data.

kafka.consumer.records_consumed

Fetch rate

Throughput

Number of fetch requests per second from the consumer.

requests per second from the consumer.

kafka.request.fetch_rate

Zookeeper

The Datadog Agent ships with an integration which can be used to collect metrics. See the Zookeeper Integration for more information.

Metric

Metric type

Why care

User impact

How to measure

Outstanding requests

Saturation

This shows the number of requests still to be processed. Tracking both outstanding requests and latency can give you a clearer picture of the causes behind degraded performance.

Reports might take longer to reflect the latest data.

zookeeper.outstanding_requests

Average latency

Throughput

This metric records the amount of time it takes to respond to a client request (in ms).

Reports might take longer to reflect the latest data.

zookeeper.latency.avg

Open file descriptors

Utilization

Linux has a limited number of file descriptors available, so it’s important to keep an eye on this metric to ensure ZooKeeper can continue to function as expected.

Reports might not reflect new data, as ZooKeeper will be getting errors.

zookeeper.open_file_descriptor_count

Celery

The Datadog Agent ships with a HTTP Check integration to collect various network metrics. In addition, CommCareHQ reports on many custom metrics for Celery. It might be worth having a look at Datadog’s Custom Metrics page. Celery also uses Celery Flower as a tool to monitor some tasks and workers.

Metric

Metric type

Why care

User impact

How to measure

Celery uptime

Availability

The uptime rating is a measure of service availability.

Background tasks will not execute (sending of emails, periodic reporting to external partners, report downloads, etc)

network.http.can_connect

Celery uptime by queue

Availability

The uptime rating as per queue.

Certain background or asynchronous tasks will not get executed. The user might not notice this immediately.

CommCareHQ custom metric

Time to start

Other

This metric shows the time (seconds) it takes a task in a specific queue to start executing. If a certain task consistently takes a long time to start, it might be worth looking into.

For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc.

CommCareHQ custom metric

Blockage duration by queue

Throughput

This metric indicates the estimated time (seconds) a certain queue was blocked. It might be worth it to alert if this blockage lasts longer than a specified time.

For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc.

CommCareHQ custom metric

Task execution rate

Throughput

This metric gives a rough estimation of the amount of tasks being executed within a certain time bracket. This can be an important metric as it will indicate when more and more tasks take longer to execute, in which case an investigation might be appropriate.

For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc.

CommCareHQ custom metric

Celery tasks by host

Throughput

Indicates the running time (seconds) for celery tasks by host.

For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc.

CommCareHQ custom metric

Celery tasks by queue

Throughput

Indicates the running time (seconds) for celery tasks by queue. This way you can identify slower queues.

For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc.

CommCareHQ custom metric

Celery tasks by task

Throughput

Indicates the running time (seconds) for celery tasks by each respective task. Slower tasks can be identified.

For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc.

CommCareHQ custom metric

Tasks queued by queue

Saturation

Indicates the number of tasks queued by each respective queue. If this becomes increasingly large, keep an eye out for blockages.

For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc.

Celery Flower

Tasks failing by worker

Error

Indicates tasks that failed to execute. Increasing numbers indicates some problems with the respective worker(s).

If certain background or asynchronous tasks fail, certain features become unusable, for example sending emails, SMS’s, periodic reporting etc.

Celery Flower

Tasks by state

Other

This metric shows the number of tasks by their celery state. If the number of failed tasks increases for instance, it might be worth looking into.

If certain background or asynchronous tasks fail, certain features become unusable, for example sending emails, SMS’s, periodic reporting etc.

Celery Flower

RabbitMQ

The Datadog Agent ships with an integration which can be used to collect metrics. See the RabbitMQ Integration for more information.

Metric

Metric type

Why care

User impact

How to measure

Queue depth

Saturation

Using queue depth, messages ready and messages unacknowledged

For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc.

rabbitmq.queue.messages

Messages ready

Other

Using queue depth, messages ready and messages unacknowledged

For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc.

rabbitmq.queue.messages_ready

Messages unacknowledged

Error

Using queue depth, messages ready and messages unacknowledged

Certain background tasks will fail to execute, like sending emails, SMS’s, alerts, etc.

rabbitmq.queue.messages_unacknowledged

Queue memory

Utilization

RabbitMQ keeps messages in memory for faster access, but if queues handle a lot of messages you could consider using lazy queues in order to preserve memory. Read more

For the most part this might go unnoticed for the user, but there will be a delay in the execution of background tasks, like sending emails, SMS’s, alerts, etc.

rabbitmq.queue.memory

Queue consumers

Other

The number of consumers is configurable, so a lower-than-expected number of consumers could indicate failures in your application.

Certain background tasks might fail to execute, like sending emails, SMS’s, alerts, etc.

rabbitmq.queue.consumers

Node sockets

Utilization

As you increase the number of connections to your RabbitMQ server, RabbitMQ uses a greater number of file descriptors and network sockets. Since RabbitMQ will block new connections for nodes that have reached their file descriptor limit, monitoring the available number of file descriptors helps you keep your system running.

Background tasks might take longer to execute as, or in the worst case, might not execute at all.

rabbitmq.node.sockets_used

Node file descriptors

Utilization

As you increase the number of connections to your RabbitMQ server, RabbitMQ uses a greater number of file descriptors and network sockets. Since RabbitMQ will block new connections for nodes that have reached their file descriptor limit, monitoring the available number of file descriptors helps you keep your system running.

Background tasks might take longer to execute as, or in the worst case, might not execute at all.

rabbitmq.node.fd_used