GROUP BY

The GROUP BY  clause can be used within SELECT statements that contain an aggregate function.   The GROUP BY clause creates subsets of records to feed into the aggregate function.  

 

The GROUP BY clause is often used together with the COLLECT statement and the SPLIT clause.  After reviewing this topic and the COLLECT and SPLIT topics, the GROUP BY, COLLECT, and SPLIT Tutorial topic will help explain how those three SQL constructs work together.  

 

GROUP BY is not some sort of sorting mechanism for results.  Instead, it is a helper clause that bundles and feeds groups of records into some other aggregate function.  Attempting to use GROUP BY without an aggregate function is an error, with the exception being that we can SELECT the field used for the GROUP BY without using an additional aggregate.

 

GROUP BY <field>, ..., HAVING <condition>

 

<field> -  When used with SELECT each <field> can be an expression.  

 

GROUP BY takes a table, bundles the records within that table into groups based on the specified <field> and then produces one record for each group by applying the aggregate function to derive that one record for the group.    If just the field used for the GROUP BY is used in the SELECT with no aggregate function, we get a list of the distinct occurrences of that field.

 

GROUP BY can be confusing for two reasons:  

 

 

 

Because GROUP BY is a "feeder" for aggregate functions, to understand GROUP BY we must understand aggregate functions.

Aggregate Functions and GROUP BY

The English word "aggregate" means "to combine" or "to put together."  Aggregate functions take a set of records and based on the values in those records produce a single resulting record.    

 

For example, the aggregate function Sum(<field>) takes all the values for <field> in the set of records that is fed to Sum and produces a single record where <field> contains the sum of those values.   If we feed all of the records in a table to Sum it returns the sum total of the indicated field for all of the records.  If we feed a  subset of records from a table to Sum it returns the sum total of the field for that subset of records.

 

Suppose we have a table called Expenses.   For each record the table has three fields: Vendor, Job and Payment.    The table records the amount of payments made to different vendors for various categories of jobs in a construction project, such as carpentry (wood work) or masonry (work with brick and stone).  Categories typically have more than one vendor who provides that job.

 

If our table has many records with a field called Payment then Sum(Payment) applied to a set of records from the table will return a single record where the Payment field contains the sum of payments in the set.  

 

The simplest case is to apply Sum to all records in the table:

 

SELECT Sum(Payment) FROM Expenses;

 

 

The result is a table with one record that gives the Sum of all payments.    That is why Sum is called an aggregate function: it combines many records into a single record.

 

More interesting would be to see the sum of payments for different categories of jobs.  For that we can use GROUP BY to create subsets of records for each category of job:

 

SELECT Job, Sum(Payment) FROM Expenses GROUP BY Job;

 

 

The result is a table where each record is a job category together with the total sum of payments made in that category.   

 

To create that table, the Manifold engine first applied  GROUP BY to look through the entire table and to arrange all the records into different bundles, that is, groups, where all the records in each group had the same Job category.  Each bundle was then processed by the Sum aggregate function to get the sum of Payment in that group.    

 

So, for example, GROUP BY scanned the table and found all records with a Job description of Carpentry, and put those into a Carpentry group.  It then fed that group of Carpentry records into Sum(Payment) to get the sum total of all payments for Carpentry.   That became the first record in the results table.   GROUP BY then found all records with a Job description of Electrical and fed that bundle of records into Sum(Payment) to get the sum total of all payments for Electrical, and so on.  

 

 If we do not like the system-provided name for the result field, we can use AS to name the result whatever we like:

 

SELECT Job, Sum(Payment) AS Total FROM Expenses GROUP BY Job;

 

 

The above query, as it expands, shows one effect that often confuses beginners:  The longer the query gets and the more SQL stuff that appears within the query between the aggregate function at the beginning of the query and the GROUP BY far away at the end of the query, the easier it is to forget that the GROUP BY works together with the aggregate function.  The two are inseparable, despite the mass of SQL stuff between them.    

 

It would be easier for beginners if SQL syntax kept the GROUP BY together with the aggregate function in a construction such as:

 

SELECT Job, {Sum(Payment) GROUP BY Job} AS Total FROM Expenses;

 

Alas, that is not how SQL syntax works and there are no magic curly { } brackets in Manifold.    We just need to remember that despite coming last the GROUP BY is evaluated first and applies to some aggregate function that appears earlier in the statement.

HAVING

The HAVING qualifier works for GROUP BY similar to how WHERE works for SELECT.  The HAVING condition is evaluated for each record that would participate in the GROUP BY.  If the HAVING condition evaluates to True the record participates in the GROUP BY, otherwise it does not participate.  The condition in HAVING is evaluated in the same context as fields in the FROM list.   It can, and often does, use aggregates.

 

SELECT Job, Max(Payment) FROM Expenses

  GROUP BY Job HAVING Job = 'Masonry';

 

 

In the above, the HAVING condition caused GROUP BY to send only records for Masonry jobs to the Max aggregate.  The Max aggregate picks out the largest payment and reports that payment.

 

SELECT Job, Payment FROM Expenses

  WHERE Job = 'Masonry';

 

By taking a look at a similar WHERE condition for SELECT we can see by analogy how HAVING operates with GROUP BY.  Looking only at those records in the Expenses table for Masonry jobs, we see that the largest of them has a payment of 58, as was reported by the Max function in the GROUP BY with HAVING example.

Criteria

The criteria for a GROUP BY clause can be either fields or expressions with no limit on the number of criteria. All fields used in expressions in the SELECT list must be either part of a GROUP BY clause or appear inside of an aggregate.

 

Grouping by an expression and then using the same expression verbatim in the SELECT list by design does not work.   For example, not allowed are constructions such as

 

SELECT f+1+2 FROM t GROUP BY f+1+2;

 

or

 

SELECT f+(1+2) FROM t GROUP BY f+1+2;

 

or

 

SELECT 1+2+f FROM t GROUP BY f+1+2;

 

Note that because the Manifold query engine allows replacing a table with a nested SELECT anywhere, without restrictions, in cases such as the above disallowed examples we can simply replace t with a nested SELECT that will  compute the desired grouping criteria as a field.

 

Implicit grouping is supported. For example, we can write

 

SELECT Max(f) FROM t;

 

...using the Max aggregate function.

 

Examples

Following are a series of examples with comments to provide discussion on GROUP BY and the use of aggregates.

 

SELECT name,

(SELECT mfd_id FROM mfd_root

WHERE mfd_root.name = mfd_meta.name) + sum(mfd_meta.mfd_id)

FROM mfd_meta GROUP BY name;

 

In the above example there are two SELECT statements with the outer SELECT doing grouping and the inner SELECT not doing grouping.  The aggregate belongs to the outer SELECT.

 

The following is not acceptable:

 

SELECT name,

(SELECT mfd_id FROM mfd_root

WHERE mfd_root.mfd_id = sum(mfd_meta.mfd_id))

FROM mfd_meta GROUP BY name;

 

This fails with 'sum': Unexpected aggregate. The aggregate is in the WHERE section of the inner SELECT so it should belong to the inner SELECT but the inner SELECT does no grouping, hence the failure.

 

The following is not acceptable:

 

SELECT name,

(SELECT mfd_id + sum(mfd_meta.mfd_id)

FROM mfd_root

WHERE mfd_root.name = mfd_meta.name)

FROM mfd_meta GROUP BY name;

 

This fails with: 'sum': Unexpected aggregate. The logic is the same as above: the aggregate is in the SELECT list of the inner SELECT so it should belong to the inner SELECT but the inner SELECT does no grouping, hence the failure.

 

The following is not acceptable:

 

SELECT name, mfd_id

FROM mfd_meta GROUP BY name;

 

This, of course, fails with: 'mfd_meta.mfd_id': Field should be either part of GROUP BY or part of aggregate.

 

The following is not acceptable:

 

SELECT name,

(SELECT type FROM mfd_root

WHERE mfd_root.mfd_id = mfd_meta.mfd_id)

FROM mfd_meta GROUP BY name;

 

This fails more interestingly with: 'mfd_meta.mfd_ID': Field should be either part of GROUP BY or part of aggregate of higher-level SELECT.

 

SELECT name, sum(mfd_id)

FROM mfd_meta GROUP BY name;

 

The above works, as expected.

 

SELECT name,

sum((SELECT StringLength(name)

FROM mfd_root

WHERE mfd_root.mfd_id = mfd_meta.mfd_id))

FROM mfd_meta GROUP BY name;

 

This works, too. The aggregate goes through different values of mfd_meta.mfd_id for each group with the same mfd_meta.name and does a SELECT for each value of mfd_meta.mfd_id to return a table.   If the table contains a single record it is converted to the value of that record and the resulting values are summed.

 

SELECT name,

sum((SELECT sum(mfd_id+10)

FROM mfd_root

WHERE name = mfd_meta.name GROUP BY name))

FROM mfd_meta GROUP BY name;

 

This works as well, providing an example of nested groups.

 

The following is not acceptable:

 

SELECT name,

(SELECT result FROM

CALL ValueSequence(sum(mfd_meta.mfd_id),  sum(mfd_meta.mfd_id), 1))

FROM mfd_meta GROUP BY name;

 

This fails because it seems unnatural that the same SELECT can have aggregates that use groups both for that SELECT (for example, in the SELECT list) as well as groups for some outer SELECT (for example, in the FROM section).

 

If we want to do something like the above, we can simply use a function:

 

FUNCTION f(x INT32) INT32 AS

(SELECT result FROM CALL ValueSequence(x, x, 1))

END;

 

SELECT name, f(sum(mfd_meta.mfd_id))

FROM mfd_meta GROUP BY name;

Other examples

A function that makes no use of one of its arguments will work correctly when passed an aggregate.   The aggregate is optimized away and is not computed.   An example:

 

FUNCTION f(t NVARCHAR) INT32 AS 5 END;

SELECT name, f(Max(property)) FROM mfd_meta GROUP BY name;

 

A function with a SELECT with GROUP will also work correctly.   An example:

 

FUNCTION f(t TABLE) INT32 AS

  (SELECT Count(*) FROM t

   WHERE StringStartsWith(property, 'Path')

   GROUP BY TRUE) END;

SELECT name, f((COLLECT property)) FROM mfd_meta GROUP BY name;

 

See also the COLLECT topic for examples using GROUP BY.

 

See also the Notes at the end of the SQL Example: Learn SQL from Edit Query - Merging Areas topic for a classic use of GROUP BY

 

See an example in the Find Percentages of Open Space in ZIP Code Areas  video.

 

See the examples in the GROUP BY, COLLECT, and SPLIT Tutorial topic.

Notes

No brackets - In this topic to reduce visual clutter we do not use square [ ] brackets around simple field and table names.    Square brackets are optional if the name is unambiguous.

 

Order of execution -  When we described above how GROUP BY scanned a table to find all carpentry records and then fed those as a bundle into the Sum function to produce the first record of the results table, that is a conceptual description to help us understand that the job of GROUP BY is to work as a bundler and feeder for some aggregate function.   What actually happens inside the query engine is more complex, but the net result is what the conceptual narrative describes.

 

HAVING and constant conditions - Using HAVING with a constant filter only evaluates the filter once.  "Constant" here means "not changing for the duration of the corresponding GROUP." Such "constants" can be significantly complex expressions in the case of nested queries, frequently involving fields from outer queries as well as fields from the inner query, and not just those fields by which the GROUP BY corresponding to the HAVING iterates.   If the condition includes both constant and non-constant parts, HAVING will still evaluate constant parts only once, for improved performance.

Videos

Find Percentages of Open Space in ZIP Code Areas  -  Uses GROUP BY:  Given a layer of polygons representing ZIP codes and a layer of polygons showing open spaces like parks and green spaces, find the percentage of open space in each ZIP code area.   This video shows how to do that start to finish in a few simple steps, from initial importing of shape files to final results, in just five minutes, with an additional six minutes of explanation what each step does.  Works in Manifold Release 9 or using the free Manifold Viewer.

See Also

Queries

 

Command Window

 

Query Builder

 

Aggregates

 

SQL Statements

 

COLLECT

 

ORDER BY

 

SPLIT Clause

 

GROUP BY, COLLECT, and SPLIT Tutorial

 

SQL Example: Learn SQL from Edit Query - Merging Areas - We learn how to write an SQL query that does a Merge : area (dissolve) operation by cutting and pasting from what the Edit Query button automatically generates.