In this document we'll try to come up with the glossary for the Telemetry.Metrics
project.
In order to create a glossary, it's beneficial to look at how various metric systems call different entities. Note that this comparison aims to highlight only these differences related to data model and metric types.
StatsD is not a full-fledged metric system, but an agent which aggregates metrics and forwards them to other system (Graphite by default) with specified interval. StatsD implements one-dimensional data model, i.e. the metric has a name and a value. DogStatsD, which is DataDog's implementation of the agent, supports optional tagging.
Since it's the StatsD agent who aggregates the metrics, each sample (or measurement, you call it) needs to be sent to it. This means that UDP might become a bottleneck on bigger workloads.
Main issue with StatsD is that it has many implementations, and there are small differences in metrics' behaviour between those implementations. The overview below is based on the original Etsy implementation.
StatsD counter can be incremented and decremented. StatsD agent publishes both total count and rate. After publishing, the counter is reset. You can specify counter's sampling rate.
Gauge can be set, incremented and decremented. StatsD agent published only gauge value. Gauge is not reset when published.
Timer metric is produces summary statistics of the measured value, i.e. mean, maximum, minimum, quantiles, etc. Optionally, it can maintain a histogram of measured values. The same as with counter, you can specify the sampling rate.
Set, as the name suggests, is a collection of unique measurements.
InfluxDB is a time-series database. It doesn't have a notion of a metric. In InfluxDB, values are organized in measurements which are conceptually similar to relation tables. Each point (row) in a measurement consists of a timestamp, a set of fields, and a set of tags. Field and tag values can be numbers, strings or booleans. A series is a collection of points in the single measurement having the same tags.
Note: the only difference between fields and tags is that tags are indexed, thus they are usually used to break down the collection of points by some feature.
More information can be found in the InfluxDB key concepts documentation.
Prometheus is a pull-based metric system. A Prometheus time series is uinquely identified by metric name and a set of labels. Each sample in the series has a timestamp and a single, floating point value. As you can see, the model is very similar to the one used by InfluxDB, except that it doesn't support multiple values per sample and the value needs to be a number.
Unfortunately, Prometheus is not really consistent with the use of the word "metric", e.g. histogram is a "metric", but it produces multiple time series, each having different "metric" name.
Prometheus counter is monotonically increasing. Only its value is published, but Prometheus query language allows to calculate the rate of things the counter is counting over selected time window.
Very similiar to StatsD gauge, it can be set, but also incremented or decremented.
Standard histogram of observations, i.e. tracks the number of observations which fall into configurable buckets. Histogram metric produces three series:
<metric_name>_sum
with the sum of observations<metric_name>_count
with the count of observations<metric_name>_bucket
withle
("less than or equal") label for actual distribution of values
You can calculate the mean of observations using Prometheus query language. This metric allows for more advanced calculations, e.g. percentage of requests served in under X milliseconds.
Tracks the quantiles, sum and number of observations. Summary metric produces three series:
<metric_name>_sum
with the sum of observations<metric_name>_count
with the count of observations<metric_name>
withquantile
label for quantiles
You can calculate the mean of observations using Prometheus query language.
Note: histogram allows to estimate qunatiles from multiple instances exposing the same metric. With summary, we get almost correct qunatiles, but aggregating them across multiple instance doesn't make statistical sense.
OpenCensus is not a metric system per se, but rather a standard for instrumenting the code across multiple languages and technology stacks. You can plug in an exporter to expose the data to external metric system, like Prometheus, Zipkin (since OpenCensus supports tracing as well), etc. In short, OpenCensus tries to do for all programming languages what Telemetry tries to do for Elixir.
OpenCensus has very detailed glossary around its data model. I think that we can learn much from it and piggyback on it a little.
All information here is taken from OpenCensus metrics documentation.
Measure is a metric type to be recorded. A measure has a name, description and a unit. Measure does not describe how the values are aggregated. For example, a library in some programming language could expose a measure and the user of the library could choose to aggregate it later, but by itself measure is just a logical stream of measurements (see below).
Measurement is a data point/value collected for the measure. Each measurement has a value and a set of tags. OpenCensus has a dedicated API for recording measurements, and important fact about it is that it doesn't support sampling.
View describes how the data is aggregated. A view takes measurements from the specified measure and aggregates them with the selected aggregation method. Aggregations are broken down by selected set of tags, much like in Prometheus.
OpenCensus aggregations are the closest entity to metrics in other systems/standards.
Counts the number of measurements.
Tracks a histogram distribution of measurement values.
Sums up the measurement values.
Keeps track of the last measuement value.
In this document we'll try to come up with the glossary for the Telemetry.Metrics
project.
In order to create a glossary, it's beneficial to look at how various metric systems call different entities. Note that this comparison aims to highlight only these differences related to data model and metric types.
StatsD is not a full-fledged metric system, but an agent which aggregates metrics and forwards them to other system (Graphite by default) with specified interval. StatsD implements one-dimensional data model, i.e. the metric has a name and a value. DogStatsD, which is DataDog's implementation of the agent, supports optional tagging.
Since it's the StatsD agent who aggregates the metrics, each sample (or measurement, you call it) needs to be sent to it. This means that UDP might become a bottleneck on bigger workloads.
Main issue with StatsD is that it has many implementations, and there are small differences in metrics' behaviour between those implementations. The overview below is based on the original Etsy implementation.
StatsD counter can be incremented and decremented. StatsD agent publishes both total count and rate. After publishing, the counter is reset. You can specify counter's sampling rate.
Gauge can be set, incremented and decremented. StatsD agent published only gauge value. Gauge is not reset when published.
Timer metric is produces summary statistics of the measured value, i.e. mean, maximum, minimum, quantiles, etc. Optionally, it can maintain a histogram of measured values. The same as with counter, you can specify the sampling rate.
Set, as the name suggests, is a collection of unique measurements.
InfluxDB is a time-series database. It doesn't have a notion of a metric. In InfluxDB, values are organized in measurements which are conceptually similar to relation tables. Each point (row) in a measurement consists of a timestamp, a set of fields, and a set of tags. Field and tag values can be numbers, strings or booleans. A series is a collection of points in the single measurement having the same tags.
Note: the only difference between fields and tags is that tags are indexed, thus they are usually used to break down the collection of points by some feature.
More information can be found in the InfluxDB key concepts documentation.
Prometheus is a pull-based metric system. A Prometheus time series is uinquely identified by metric name and a set of labels. Each sample in the series has a timestamp and a single, floating point value. As you can see, the model is very similar to the one used by InfluxDB, except that it doesn't support multiple values per sample and the value needs to be a number.
Unfortunately, Prometheus is not really consistent with the use of the word "metric", e.g. histogram is a "metric", but it produces multiple time series, each having different "metric" name.
Prometheus counter is monotonically increasing. Only its value is published, but Prometheus query language allows to calculate the rate of things the counter is counting over selected time window.
Very similiar to StatsD gauge, it can be set, but also incremented or decremented.
Standard histogram of observations, i.e. tracks the number of observations which fall into configurable buckets. Histogram metric produces three series:
<metric_name>_sum
with the sum of observations<metric_name>_count
with the count of observations<metric_name>_bucket
withle
("less than or equal") label for actual distribution of values
You can calculate the mean of observations using Prometheus query language. This metric allows for more advanced calculations, e.g. percentage of requests served in under X milliseconds.
Tracks the quantiles, sum and number of observations. Summary metric produces three series:
<metric_name>_sum
with the sum of observations<metric_name>_count
with the count of observations<metric_name>
withquantile
label for quantiles
You can calculate the mean of observations using Prometheus query language.
Note: histogram allows to estimate qunatiles from multiple instances exposing the same metric. With summary, we get almost correct qunatiles, but aggregating them across multiple instance doesn't make statistical sense.
OpenCensus is not a metric system per se, but rather a standard for instrumenting the code across multiple languages and technology stacks. You can plug in an exporter to expose the data to external metric system, like Prometheus, Zipkin (since OpenCensus supports tracing as well), etc. In short, OpenCensus tries to do for all programming languages what Telemetry tries to do for Elixir.
OpenCensus has very detailed glossary around its data model. I think that we can learn much from it and piggyback on it a little.
All information here is taken from OpenCensus metrics documentation.
Measure is a metric type to be recorded. A measure has a name, description and a unit. Measure does not describe how the values are aggregated. For example, a library in some programming language could expose a measure and the user of the library could choose to aggregate it later, but by itself measure is just a logical stream of measurements (see below).
Measurement is a data point/value collected for the measure. Each measurement has a value and a set of tags. OpenCensus has a dedicated API for recording measurements, and important fact about it is that it doesn't support sampling.
View describes how the data is aggregated. A view takes measurements from the specified measure and aggregates them with the selected aggregation method. Aggregations are broken down by selected set of tags, much like in Prometheus.
OpenCensus aggregations are the closest entity to metrics in other systems/standards.
Counts the number of measurements.
Tracks a histogram distribution of measurement values.
Sums up the measurement values.
Keeps track of the last measuement value.
Metrics are responsible for aggregating Telemetry events with the same name in order to gain any useful knowledge about the events. A single metric may generate multiple aggregations, each aggregation being bound to a unique set of tag values. Tags are pairs of key-values derived from event metadata. In the simplest case, tags are a subset of the metadata. Based on the tag values, the value of the event will be used to generate one of the aggregations. Metric type defines how the values are aggregated (e.g. a sum or a distribution). Each aggregation may itself contain many values, which is dependent on the metric type.
Telemetry event, with a name, numerical value and a metadata.
Consumes events from the stream and aggregates them according to each unique set of tags derived from those events. Metric has a name, description, type and unit. You also need to specify the name of events consumed by the metric.
Collection of key-values pairs derived from event metadata.
Telemetry supports following metric types:
Aggregation value is the number of emitted events, regardless of their values. In multi-node deployments, the counter values can be safely merged (by adding) without losing statistical correctness.
Aggregation value is the sum of event values. Sum values can be safely merged without losing correctness.
Aggregation value is the value carried by the most recent event in the stream. Values of this metric cannot be merged in the general case without losing correctness.
Aggregation is a histogram distribution of event values, i.e. how many events were emitted with values falling into defined buckets. Aggregation contains a value for each bucket. Values of this metric can be safely merged by summing up per-bucket entries.
Aggregation contains summary statistics of values of events in the stream, like mean, minimum, maximum, count and selected quantiles. Values of this metric can't be aggregated without losing correctness.
Note: summaries won't be available in the first version of Telemetry.Metrics, but may be added later if there is a need for them.
Telemetry glossary
Metrics are responsible for aggregating Telemetry events with the same name in order to gain any useful knowledge about the events. A single metric may generate multiple aggregations, each aggregation being bound to a unique set of tag values. Tags are pairs of key-values derived from event metadata. In the simplest case, tags are a subset of the metadata. Based on the tag values, the value of the event will be used to generate one of the aggregations. Metric type defines how the values are aggregated (e.g. a sum or a distribution). Each aggregation may itself contain many values, which is dependent on the metric type.
Event
Telemetry event, with a name, numerical value and a metadata.
Metric
Consumes events from the stream and aggregates them according to each unique set of tags derived from those events. Metric has a name, description, type and unit. You also need to specify the name of events consumed by the metric.
Tags
Collection of key-values pairs derived from event metadata.
Metric type
Telemetry supports following metric types:
Counter
Aggregation value is the number of emitted events, regardless of their values. In multi-node deployments, the counter values can be safely merged (by adding) without losing statistical correctness.
Sum
Aggregation value is the sum of event values. Sum values can be safely merged without losing correctness.
LastValue
Aggregation value is the value carried by the most recent event in the stream. Values of this metric cannot be merged in the general case without losing correctness.
Distribution
Aggregation is a histogram distribution of event values, i.e. how many events were emitted with values falling into defined buckets. Aggregation contains a value for each bucket. Values of this metric can be safely merged by summing up per-bucket entries.
Summary
Aggregation contains summary statistics of values of events in the stream, like mean, minimum, maximum, count and selected quantiles. Values of this metric can't be aggregated without losing correctness.