Thriftly Metrics

Prometheus supports four types of metrics: counters, gauges, histograms and summary. To learn more about each metric, follow this link https://prometheus.io/docs/concepts/metric_types/. The metrics provided by Thriftly are:

  • process_seconds_total - This is a counter that reports the number of seconds a process has been running. It doesn’t indicate that the process was doing anything, just the lifetime of the process.

  • process_cpu_seconds_total - This is a counter that reports the number of full seconds of CPU time consumed by the process. By itself it doesn’t have much value, but when viewed as a portion of process_seconds_total it can be used to determine if a process is using too much CPU time or if it is not being used at all.

  • process_startup_time_seconds - This is a counter that reports the number of seconds that a process takes from creation to report to Thriftly that it is ready to receive API events.

  • process_resident_memory_bytes - This is a gauge measuring the amount of actual memory that has been allocated to a process and has not yet been reclaimed by windows.

  • process_virtual_memory_bytes - This is a gauge measuring the amount of virtual address space the process currently has allocated. Your virtual memory allocation will always be greater than or equal to the resident memory allocation. Windows limits virtual memory allocation to the amount of physical ram + pagefile size.

  • thriftly_runtime_call_duration_seconds - This is a histogram with default quantiles (.005, .01, .025, .05, .075, .1, .25, .5, .75, 1, 2.5, 5, 7.5, 10). It is measuring the time between an event being dispatched to a process and the process returning a result. This metric can give you a good indication of how fast your actual application is independent of Thriftly. If thriftly_runtime_request_duration is trending high and thriftly_runtime_response_duration is also trending high, it could indicate a bottleneck within your application such as an overloaded database server. The labels associated with this call can help you determine if a specific API is often taking longer than expected.

  • thriftly_runtime_calls_total - This is a counter of the total number of calls coming in to Thriftly. It doesn’t differentiate between calls that throw an error versus those that did not. It is a total count only. The labels associated with this call could be used to break this total up into more granular values.

  • thriftly_runtime_call_duration_seconds_total - This is a counter. A running total of the amount of time all calls have taken. The labels associated with this call could be used to break this total up into more granular values.

  • thriftly_runtime_request_bytes_total - This is a counter of the total bytes received by Thriftly. In total it isn’t that useful, but when filtered based on labels it can provide insight on calls that are becoming too large either due to an error in the client or design flaw.

  • thriftly_runtime_response_bytes_total - This is a counter of the total bytes sent by Thriftly. This can be used to determine which calls are “expensive” from the standpoint of the amount of data transferred by filtering based on the labels associated with the metric.

  • thriftly_worker_pid - This is a gauge, although from a use standpoint it’s just an informational indicator of what process IDs are currently active.

  • thriftly_worker_status - This is a gauge indicating the state of a specific worker. There are three valid states: 0) The worker is down. This means the dispatcher is unable to communicate with the worker process. 1) The worker is started. This means that a worker process has been spawned but it has not yet connected to the dispatcher. 2) The worked is connected. This indicates that the worker process has a successful communication channel to the dispatcher.

  • thriftly_worker_starts_total - This is a counter. Whenever a worker process is spawned for a pool, this will incremented. This doesn’t indicate that the worker process is successfully connected to the dispatcher, just that it was spawned.

  • thriftly_worker_restarts_total - This is a counter. If a worker process is no longer communicating with the dispatcher, it will eventually be killed and a new worker process spawned. When this occurs, this metric will be increased.

  • thriftly_worker_connects_total - This is a counter. After a worker process is started when it finally connects to the dispatcher, this metric will be incremented. If this metric is significantly higher than the thriftly_worker_starts_total it can indicate that there is a problem with the process starting up.

  • thriftly_worker_connect_timeouts_total - This is a counter. When a worker process is sent a message and doesn’t respond to that message within the allotted time, based on the Kill Process Timeout setting, this metric will be incremented. When a worker process is initially spawned and doesn’t connect within the allotted time, based on the Startup Timeout setting, this metric will be incremented.

  • thriftly_worker_execute_timeouts_total - This is a counter. When a worker process is sent an API event by the dispatcher and takes too long to respond to that event, based on the Execute Timeout setting, this metric will be incremented. A worker process that takes too long isn’t necessarily killed, it may finish processing still even though the API call has been aborted by the dispatcher. If this metric is increasing it may indicate that the current Execute Timeout setting is insufficient or that certain API calls are taking longer than expected.

  • thriftly_worker_disconnects_total - This is a counter. A worker process may decide on its own to close or a pool of work processes could be stopped intentionally. When this happens gracefully, they’ll properly disconnect from the dispatcher before exiting. When this occurs, this metric will be incremented.

  • thriftly_worker_exits_total - This is a counter. When a worker process shuts down gracefully, this metric will be incremented.

  • thriftly_dispatch_call_duration_seconds - This is a histogram with default quantiles (.005, .01, .025, .05, .075, .1, .25, .5, .75, 1, 2.5, 5, 7.5, 10). Rather than providing details about a specific occurrence, this metric reports the duration that call takes from the time it is put into the queue until a response is sent to the client. Using this histogram an Apdex score could be approximated or the 0.95 quantile over a decay time for an SLO.

  • thriftly_dispatch_call_overhead_seconds_total - This is a counter that accumulates the difference between thriftly_dispatch_call_duration_seconds and thriftly_runtime_call_duration_seconds. It is a measure of how much time is being spent in Thriftly’s internal logic verses network time and application time. It should grow very slowly over time.

  • thriftly_dispatch_running_workers - This is a gauge. This indicates the current total running workers in any state.

  • thriftly_dispatch_idle_workers - This is a gauge. This indicates the current total running workers that are sitting idle, meaning they aren’t processing an event.

  • thriftly_dispatch_pending_requests - This is a gauge. This indicates the number of API calls waiting for a worker process to become idle so they can be processed.

  • thriftly_dispatch_executing_requests - This is a gauge. This indicates the number of API calls currently being processed.