An aggregate function or clause takes a group of records (which could be an entire table or a subgroup in that table) and from all the values for a field in that group computes one or more resulting values. For example, the aggregate function Sum(<field>) takes all values for a given field in the group, sums them and returns the sum. The aggregate function Avg(<field>) returns the average of values for a field in the group. Aggregates ignore NULL values.
Using aggregates is easy: given a table called Expenses that has a field called Payment if we want to know the sum of all payments and the average of all payments we can write:
SELECT Sum(Payment), Avg(Payment)
FROM Expenses;
Aggregates include Sum, Avg, Count, Max and Min. Max and Min aggregates work for dates and strings and also for uuid and binary values. Max and Min aggregates work for xN vector values. See also the SQL Functions and INLINE topics for examples.
Aggregates are often used with the GROUP BY clause, which creates subgroups from a table. The aggregate can then operate on each subgroup to create the desired aggregate summary of the record values in that subgroup. If our Expenses table has a field called Payment and a field called Job that give payments made for each job that was done, we can get a list of each job category and the sum of payments for each job subgroup with:
SELECT Job, Sum(Payment)
FROM Expenses GROUP BY Job;
If there is no GROUP BY clause, the aggregate function will operate on the entire table. See the GROUP BY, COLLECT, and SPLIT Tutorial topic for aggregate examples using COLLECT and Sum together with GROUP BY.
Most aggregates are so simple and obvious, like Sum, they do not require additional commentary. The following notes discuss aggregates that are less obvious or have nuances in their use.
Depending on the argument, the Count aggregate function counts either the number of non-NULL values in a field or expression within a group, or it counts the total number of records within a group:
Count(field) - will count the number of non-NULL values in field.
Count(expression) - will count the number of non-NULL values of the expression.
Count(*) - will always count the total number of records.
A classic example is finding the number of duplicates in a City field within a table, using Count(*):
SELECT [City], Count(*) FROM [Employees]
GROUP BY [City]
HAVING Count(*) > 1;
See the discussion of the above example in the Queries topic.
The COLLECT aggregate collects values from a subgroup into a table. COLLECT returns a table with one or more fields and zero or more records. The values in that table normally are just regular values and are not tables.
Modern databases such as Oracle or PostgreSQL often allow control over the values seen by aggregates. For example, a user might write a query that contains several Sum aggregate functions using the same field with each Sum computing a partial sum utilizing a filter specified within the Sum, for example,
SUM(x FILTER ...)
The COLLECT aggregate provides a generic aggregate for performing such operations and other operations as well. For example:
SELECT name, Count(*),
SPLIT (COLLECT property, value)
FROM mfd_meta GROUP BY name;
In the above example SPLIT is used merely to show which values are in each COLLECT.
The table returned by COLLECT is typically fed into a function. This essentially allows creating custom aggregates, including with scripts.
We can also COLLECT expressions similar to how SELECT can use expressions:
SELECT name, Count(*),
SPLIT (COLLECT property, StringToUppercase(value))
FROM mfd_meta GROUP BY name;
One way to understand what COLLECT does is to start with what GROUP does. GROUP takes a table, sorts the records within that table into groups and then produces a record for each group by applying aggregates to derive that one record for the group.
Each aggregate takes a group, which is a subset of records from the original table, and from the values in that group computes one or more resulting values for the resulting record. For example, the aggregate Sum(f) takes all values in the group, sums them and returns the sum.
COLLECT works the same way: it is like a SELECT which runs on a group. COLLECT takes a table and returns a table without requiring us to write a FROM section as we would with a SELECT.
COLLECT supports ORDER BY using the same syntax as using ORDER BY in SELECT statements, including ordering by expression.
COLLECT supports DISTINCT, for example as in
(COLLECT DISTINCT value)
COLLECT supports WHERE.
See also the GROUP BY, COLLECT, and SPLIT Tutorial topic for a tutorial introduction to COLLECT.
In addition to the aggregates mentioned above, other aggregate functions are available:
Diversity - Takes a set of numbers and computes the total number of different values.
DiversityIndex - Takes a set of numbers and computes a measure of diversity using the formula 1 - sum(individualcount^2) / (totalcount^2) A diversity index of 0 means that all values are the same.
GeomMergeAreas takes a set of areas and returns an area geom that contains branches. Curves and Z values are removed.
GeomMergeLines takes a set of lines and returns a line geom that contains branches. Curves and Z values are removed.
GeomMergePoints takes a set of points and returns a multipoint geom that contains branches. Z values are removed.
GeomUnionAreas takes a set of areas and returns their union.
GeomUnionRects takes a set of x4 values and returns their union.
First and Last return the first and last value respectively. Unlike most other aggregates, First and Last do not skip NULLs.
JoinAnd, JoinOr and JoinXor combine boolean values using And, Or and Xor operators.
JoinBitAnd, JoinBitOr and JoinBitXor combine numeric values using BitAnd, BitOr and BitXor operators.
Median returns the median value of an arbitrary type.
Major - Takes a set of numbers and returns the most frequently occurring value.
Percentile - Returns a value for the specified percentile. Percentile 0 returns the minimum value, percentile 100 returns the maximum value, and percentile 50 returns the median value. Works for values of any type.
StDev and StDevPop compute sample standard deviation and population standard deviation.
StringJoinTokens takes a set of string values and joins them into a single string with the specified separator character(s).
Var and VarPop compute sample variance and population variance.
Covar and CovarPop compute sample covariance and population covariance, taking two parameters.
Corr computes correlation, taking two parameters.
See examples in the Aggregate SQL Functions topic.
Do aggregates require using GROUP BY? No. We can use aggregates without using a GROUP BY, in which case the aggregates will apply to the entire table. For example, given a table called Expenses that has a field called Payments we can write:
SELECT Sum(Payment), Avg(Payment)
FROM Expenses;
Using any aggregate in a SELECT list tells the query engine the SELECT is about aggregates even if there is no GROUP BY. If there is a GROUP BY, the aggregates in the SELECT list will be applied to each subgroup created by the GROUP BY, but if there is no GROUP BY the SELECT will take the entire table as a group, applying the aggregate to values in the entire table.
Geared up for aggregates, the SELECT will process the field list insisting that any fields in that list are either the result of an aggregate or are the subject of a GROUP BY. Therefore, if we had some other field in our table called Description we could not write:
-- NOT a valid query
SELECT Description, Sum(Payment), Avg(Payment)
FROM Expenses;
The above will fail with an error message saying that Description must be either part of the GROUP BY or part of an aggregate. See the GROUP BY, COLLECT, and SPLIT Tutorial topic for examples.
When implementing aggregates any query engine must decide how far an aggregate can go from the producing GROUP, whether that is explicit or implicit. For example, considering the fragment
SELECT ... Max(a) ... FROM t GROUP BY b
the query engine must decide what can be allowed or disallowed within the ellipses. Manifold allows everything except nested SELECTs and functions.
Nested Selects - An aggregate may not go into a nested SELECT. So, for example, the following works:
FUNCTION f(@t TABLE, @u NVARCHAR) TABLE AS
(SELECT mfd_id, type FROM @t WHERE type>@u) END;
SELECT name, SPLIT CALL f(mfd_root, Max(value))
FROM mfd_meta GROUP BY name;
But the following does not work:
SELECT name, SPLIT
(SELECT mfd_id, type FROM mfd_root
WHERE type>Max(value))
FROM mfd_meta GROUP BY name;
The above does not work because the inner SELECT treats Max(value) as its own, not that of the outer SELECT. If the inner SELECT accepted aggregates from the outer SELECT, we could end up with something like the following, which does not work:
SELECT name, SPLIT
(SELECT mfd_id, type FROM mfd_root
WHERE Max(name) > Max(value)
GROUP BY mfd_id, type)
FROM mfd_meta GROUP BY name;
In the above it would be difficult for a reader to tell which Max in the inner SELECT belongs to that SELECT and which belonged to the outer SELECT. In such cases it would be easy to create queries that would compile and run but which did not work as intended.
To prevent such problems, in Manifold a SELECT acts as a fence for aggregates. All aggregates inside a SELECT belong to that SELECT.
An aggregate may not cross over between the outside and the inside of the body of a FUNCTION. That does not mean we cannot use aggregates within functions. Aggregates may be used within functions but they either must be completely inside a function or completely outside a function. The boundary between the inside and outside of a function is also like a fence that an aggregate may not cross.
For example, the following works because the aggregate is completely outside the function:
FUNCTION f(@a INT32) INT32 AS @a+100 END;
SELECT name, f(Count(*))
FROM mfd_meta GROUP BY name;
The following works as well (aggregate completely inside):
FUNCTION f(@t TABLE) TABLE AS
(SELECT name, Count(*)+100 FROM @t GROUP BY name) END;
TABLE CALL f(mfd_meta);
The following works too with aggregates inside and outside the function but separated from each other:
FUNCTION f(@t TABLE, @n NVARCHAR) TABLE AS
(SELECT Count(*) FROM @t WHERE name=@n) END;
SELECT SPLIT CALL f(mfd_meta, Max(name)) FROM mfd_root;
This works:
SELECT Min(mfd_id) + Max(mfd_id) FROM mfd_meta;
But the following does not work:
FUNCTION f(@a INT32) INT32 AS Min(@a) + Max(@a) END;
SELECT f(mfd_id) FROM mfd_meta;
But this does work:
FUNCTION f(@a INT32, @b INT32) INT32 AS @a + @b END;
SELECT f(Min(mfd_id), Max(mfd_id)) FROM mfd_meta;
In general, everything else apart from nested SELECTs and mixing between the inside and outside of functions is fair game. For example, the following works:
SELECT SPLIT (VALUES (Min(mfd_id)), (Max(mfd_id)))
FROM mfd_meta;
..and this works as well:
SELECT SPLIT
(EXECUTE WITH (@n NVARCHAR=Max(name))
[[ SELECT * FROM mfd_root WHERE name=@n ]])
FROM mfd_meta;
Merge and Union aggregates usually also have a non-aggregate "Pair" version of the function that operates between two such types. For example, the aggregate GeomMergeLines function has a GeomMergeLinesPair equivalent that operates to merge two line objects.
We will try passing a non-line to GeomMergeLinesPair:
--SQL
TABLE CALL GeomToCoords(GeomMergeLinesPair(
GeomConvertToArea(GeomMakeRect(VectorMakeX4(3, 3, 4, 4))),
GeomConvertToLine(GeomMakeRect(VectorMakeX4(7, 7, 8, 8)))
))
...it ignored the non-line and returned branches for the line, the second argument. If we passed two non-lines it would have returned a NULL value.
The non-aggregate merge and union functions only take two arguments. If we have three objects, how do we merge them?
We can do it like this:
--SQL
SELECT GeomMergePoints(f) FROM
(VALUES
(GeomMakePoint(VectorMakeX2(0, 0))),
(GeomMakePoint(VectorMakeX2(1, 1))),
(GeomMakePoint(VectorMakeX2(2, 2)))
AS (f))
Or like this in the case of five objects (the same approach works for more than five):
--SQL
FUNCTION MakeSeveralPoints(@p GEOM, @q GEOM, @r GEOM, @s GEOM, @t GEOM) TABLE AS
(VALUES (@p), (@q), (@r), (@s), (@t) AS (f))
END;
SELECT GeomMergePoints(f) FROM CALL MakeSeveralPoints(
GeomMakePoint(VectorMakeX2(0, 0)),
GeomMakePoint(VectorMakeX2(1, 1)),
GeomMakePoint(VectorMakeX2(2, 2)),
GeomMakePoint(VectorMakeX2(3, 3)),
GeomMakePoint(VectorMakeX2(4, 4)))
Finally, we take a look at the difference between GeomMergeAreas and GeomUnionAreas. We run the following one by one:
--SQL
SELECT SPLIT CALL GeomToCoords(GeomMergeAreas(f)) FROM
(VALUES
(GeomMakeRect(VectorMakeX4(0, 0, 7, 7))), -- A
(GeomMakeRect(VectorMakeX4(2, 2, 3, 3))), -- B
(GeomMakeRect(VectorMakeX4(8, 8, 9, 9))) -- C
AS (f))
--SQL
SELECT SPLIT CALL GeomToCoords(GeomUnionAreas(f)) FROM
(VALUES
(GeomMakeRect(VectorMakeX4(0, 0, 7, 7))), -- A
(GeomMakeRect(VectorMakeX4(2, 2, 3, 3))), -- B
(GeomMakeRect(VectorMakeX4(8, 8, 9, 9))) -- C
AS (f))
GeomMergeAreas produces three branches while GeomUnionAreas produces two branches because B is completely covered by A and disappears in the union.
Percentile is a useful aggregate for quickly gathering information on percentiles. Percentile works for values of any type that can be ordered. The syntax is:
Percentile(<value>, <percent>): <value>
A casual way of explaining percentiles is that, given a value, the percentile of that value says what percent of other values in the table are less than or equal to that value. For example, if we have a table of invoices that gives the value of each invoice, if the 20th percentile value is $1.98, that means that 20 percent of the invoices were for less than or equal to $1.98.
Given that definition Percentile using 0 returns the minimum value (because zero percent of the values in the table are lower than or equal to the minimum value), Percentile using 100 returns the maximum value (because 100 percent of the values in the table are lower than or equal to the maximum value), and Percentile using 50 returns the median value (because the definition of median means that half, or 50%, of the values are lower than or equal to the median value).
Some simple examples follow that use a table called Invoices that lists invoices and the Total amount for each invoice:
SELECT Percentile(Total, 50)
FROM Invoices;
Will return the median value (50th percentile) of numbers in the Total field of the Invoices table.
SELECT Percentile(Total, 25)
FROM Invoices;
Will return the value such that 25% of the numbers in the Total field of the Invoices table will be less than or equal to the returned value.
Note that percentile values are not interpolated values. They are a report of which actual values in a table occur in various percentage bins. For example, suppose we have a table of 100 invoices where 60 of the invoices have a Total of $1.98 and 40 of the invoices have a Total of $13.50. The results for Percentile using percents of 0, or 25 or 50 will all be $1.98. The median, that is, a Percentile using 50, will be $1.98, and not some interpolated value between $1.98 and $13.50.
Why? Because half of the records in the table will have a Total value of less than or equal to $1.98. No record will have some interpolated value other than $1.97 or $13.50.
Because Percentile reports the percentages of actual values in tables that occur in different percentage bins, Percentile can be used with any data type that can be ordered, that is, sorted.
For example, the family names of people in English language can be ordered lexicographically from A to Z, with family names beginning with a D such as Davis, Denison, or Dodge, being lower than family names beginning with an E such as Edwards or Evans. If we have a table with very many names and we want to see what the median family name is in that table, we could use Percentile with a percent value of 50 to report the family name for which half the family names in the table would be "lower than or equal" to that name.
Why might we want to do that? Suppose we want to process very many people, for example, for picking up ID tags at a marathon or other athletic event, and we wanted to set up tables for picking up ID tags by family name. We could use Percentile to split up names into various bins such that each table, using bins such as from A to F, from F to L, and so on, would have approximately the same number of participants.
Since last names are not evenly distributed from A to Z, using Percentile to set up tables by even percentile bins would help avoid a situation where some tables had very few tags to pick up while others had too many for staffers to handle quickly.
For example, suppose we have a table with 30000 records called TaxParcel that contains a field called Last Name that gives the last names of owners of tax parcels in Naperville, Illinois. We would like to set up four tables at a city event where each owner of a tax parcel can consult with a clerk operating a terminal, and we would like to have the likely number of owners equally split between tables.
To find the "lowest" last name from A to Z, we can use:
SELECT Percentile([Last Name], 0)
FROM [TaxParcel];
The result is Aabram.
To find the 25th percentile value, we use:
SELECT Percentile([Last Name], 25)
FROM [TaxParcel];
The result is Farrington.
For the 50th percentile value, we use:
SELECT Percentile([Last Name], 50)
FROM [TaxParcel];
The result is Ledvora.
For the 75th percentile value, we use:
SELECT Percentile([Last Name], 75)
FROM [TaxParcel];
The result is Roberts.
To find the "highest" last name from A to Z, we can use:
SELECT Percentile([Last Name], 100)
FROM [TaxParcel];
The result is Zywiec.
To set up our four tables so each has approximately the same number of people to service, we arrange them so the first table serves last names from Aabram to Farrington, the second from Farrington to Ledvora, the third from Ledvora to Roberts, and the last from Roberts to Zywiec. If we do not want to double up the same name for the end of one table and the beginning of the next, we could simply start the next table with whatever name comes after the name of the prior table.
Note that the distribution of names is not the same as simply dividing the letter order from A to Z into four parts. Roberts, for example, is not in the middle of the alphabet order from N to Z. This example is based on actual data from Naperville, taken from the Naperville GDB tutorial data set published by Esri.
2 billion / 1 billion record limitations - Median, Diversity, DiversityIndex, Major are currently limited to 2 billion values, as a practical matter limiting their use to aggregates of no more than 2 billion records.. Corr, Covar, CovarPop are currently limited to 1 billion pairs of values. Future builds will remove those limits. StDev, StDevPop, Var, VarPop were also limited to 2 billion values in the past, but no longer have that limit.
Median aggregate - The behavior of Median on an even number of values is an implementation choice. Consider taking the median of two integers, 1, and 2. What is the median? 8 selects the lowest value of a pair of central values. 9 selects the highest value, primarily for historical reasons. Manifold code could be adjusted to selecting the lowest value like 8 does, simply to avoid breaking compatibility for little reason. That might happen in future builds. However, it could be that was is really needed here is a variant of Median that takes an average.
Nulls not skipped - First and Last do not skip NULLs because if they did a construction like
SELECT First(a), First(b)
could return values from different records.
Unused arguments - A function that does not use one of its arguments will not fail to compile when passed an aggregate. Instead, the aggregate is optimized away and is not computed. For example:
FUNCTION f(@t NVARCHAR) INT32 AS 5 END;
SELECT name, f(Max(property))
FROM mfd_meta GROUP BY name;
Will compile and run OK.
Streaming mode - Aggregates operate in streaming mode, spilling excess data to disk if required. This both allows handling bigger amounts of data at predictable speed and also protects against running out of memory if the system is low on memory. Protection against running out of memory is not guaranteed by streaming as it is still possible to run out of memory, but doing so is much more difficult.
GROUP BY, COLLECT, and SPLIT Tutorial