In order to facilitate conversation about the different kinds of aggregations that can be performed via groupby, it is important to be clear on the following terms:
An aggregate function is distributive when it can be computed in a distributed manner. For example, assume that a data set is partitioned into n sets. The distributive aggregate function is applied to each set resulting in n aggregate values. If the aggregate value of the whole data set can now be computed by applying the same aggregate function to all the set aggregate values then the aggregate function is distributive. Examples of distributive aggregate functions are sum(), count(), min(), and max().
An aggregate function is algebraic if it can be computed by an algebraic function with M arguments (where M is a bounded positive integer) whereby each argument can be obtained with a distributive aggregate function. An example of an algebraic aggregate function is the average() function, which can be computed with the following algebraic expression: sum()/count() , with both sum() count() and count() being distributive.
There are sub-classes of algebraic aggregations:
These algebraic aggregations require only a single pass through the input series. For example, sum
and count
can be computed in a single pass, and then the avg
is computed via the scalar operation sum/count
.
These algebraic aggregations require two or more passes through the input series. For example, stddev
first requires computing the avg
, and then revisiting ever element in the series to compute the sum of squares of the elementwise difference with the mean.
The result of these algebraic aggregations are a single, scalar value. For example, avg
returns just a single value.
The result of these algebraic aggregations return a set of 2 or more scalar values. For example, a prefix sum returns a list of N
scalar values, where N
is the size of the input series.
An aggregate function is holistic if it cannot be described with an algebraic function with M arguments. An example of a holistic aggregate function is the median() function, which cannot be described by an algebraic expression or calculated in a distributive manner.
https://pdfs.semanticscholar.org/9542/f80403ef8093a8df142a8148f6685de781cb.pdf https://cs.stanford.edu/people/chrismre/cs345/rl/olap.pdf