Skip to content

Instantly share code, notes, and snippets.

@jrhemstad
Last active May 1, 2019 23:01
Show Gist options
  • Save jrhemstad/f6bb5d4c4598a6813493e1c37f01ea02 to your computer and use it in GitHub Desktop.
Save jrhemstad/f6bb5d4c4598a6813493e1c37f01ea02 to your computer and use it in GitHub Desktop.
Nomenclature for the different kinds of groupby aggregations

In order to facilitate conversation about the different kinds of aggregations that can be performed via groupby, it is important to be clear on the following terms:

Distributive

An aggregate function is distributive when it can be computed in a distributed manner. For example, assume that a data set is partitioned into n sets. The distributive aggregate function is applied to each set resulting in n aggregate values. If the aggregate value of the whole data set can now be computed by applying the same aggregate function to all the set aggregate values then the aggregate function is distributive. Examples of distributive aggregate functions are sum(), count(), min(), and max().

Algebraic

An aggregate function is algebraic if it can be computed by an algebraic function with M arguments (where M is a bounded positive integer) whereby each argument can be obtained with a distributive aggregate function. An example of an algebraic aggregate function is the average() function, which can be computed with the following algebraic expression: sum()/count() , with both sum() count() and count() being distributive.

There are sub-classes of algebraic aggregations:

Single-pass vs. Multi-Pass

Single-Pass

These algebraic aggregations require only a single pass through the input series. For example, sum and count can be computed in a single pass, and then the avg is computed via the scalar operation sum/count.

Multi-Pass

These algebraic aggregations require two or more passes through the input series. For example, stddev first requires computing the avg, and then revisiting ever element in the series to compute the sum of squares of the elementwise difference with the mean.

Scalar Result vs Series Result

Scalar Result

The result of these algebraic aggregations are a single, scalar value. For example, avg returns just a single value.

Series Result

The result of these algebraic aggregations return a set of 2 or more scalar values. For example, a prefix sum returns a list of N scalar values, where N is the size of the input series.

Holistic

An aggregate function is holistic if it cannot be described with an algebraic function with M arguments. An example of a holistic aggregate function is the median() function, which cannot be described by an algebraic expression or calculated in a distributive manner.

Sources

https://pdfs.semanticscholar.org/9542/f80403ef8093a8df142a8148f6685de781cb.pdf https://cs.stanford.edu/people/chrismre/cs345/rl/olap.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment