GROUP BY, COLLECT, and SPLIT Tutorial

GROUP BY, COLLECT, and SPLIT are often used together in SELECT queries that work with data based on subgroups of interest.   This topic provides a tutorial discussion to help knit together concepts presented in the topics for those three SQL constructs.   

 

The GROUP BY clause tells the overall query the data is to be handled in subgroups, using aggregate functions that process each subgroup in some way.   The COLLECT statement provides a general purpose way to use an SQL expression as an aggregate to process each subgroup into a results table.  The SPLIT clause converts the result table from the COLLECT statement into fields that can be processed by the SELECT statement.    

 

Although GROUP BY comes last in a SELECT statement, it is the first clause to take effect: 

 

 

 

If we have a SELECT statement of the form

 

SELECT <aggregate function> FROM <table> GROUP BY <field>

 

the first thing that happens is that the GROUP BY creates subgroups of the table, and then the aggregate functions operate on each of the subgroups.  

 

For example, if we have a GROUP BY Jobs clause for our example Expenses table, the GROUP BY clause will create subgroups of the table where each subgroup has records with the same Job values.  It is as if the table were reorganized into separate tables for each job.    The data is not actually split onto separate tables, but the internal processing of the query engine operates as though they were.

 

 

In the original table with fourteen records there are four types of jobs: Carpentry, Electrical, Masonry, and Plumbing.   The GROUP BY therefore on the fly groups the fourteen records into four sub-groups where each sub-group has the same Job value in each record.  Adding up the total number of records in each of the subgroups we still have only fourteen records.   The GROUP BY has not added or deleted any records.  It has only internally organized them into subgroups based on the Job field.

 

Next, each of those four subgroups is independently operated upon by the aggregate function in the SELECT clause.   If the result of the aggregate function is a single row, the result of each such aggregation goes into one of the rows in the final result table.   If we have four subgroups, the resulting result table will have four records in it.   If the aggregate function generates more than one row (as a COLLECT aggregate statement might generate) then there could be more than one row for each subgroup in the result table.

 

 

The aggregate function operates in turn on teach of the subgroups.   For example, if the Carpentry subgroup has five records in it, the result for that subgroup will be whatever is the result of the aggregate function using those five records in the Carpentry subgroup.   The aggregate function then will operate on the two records in the Electrical subgroup, and so on.

 

Suppose the aggregate function we use is Sum(Payment).   That takes the values of the Payment field for all of the records in the subgroup and it sums them to get a total.    We can write a query that SELECTS the Job field as well as the Sum(Payment) aggregate, since the Job field is the field used by the GROUP BY clause:

 

SELECT Job, Sum(Payment)

   FROM Expenses GROUP BY Job;

 

Running that query gives us a results table with four records in it (as expected, one for each subgroup).

 

 

The result column in the results table gives the result of the Sum(Payment) aggregate for each subgroup.   If we manually add up all the Payment values for the five records in the Carpentry subgroup, for example, we see that they add up to a value of 248.    

 

While it certainly is convenient to use the various aggregate functions, like Sum( ), that are provided as built ins, we often would like to use an SQL expression as if it was an aggregate, applying it to each of the subgroups in turn.   

 

For example, we might want to pick out only the two largest Payment values from each subgroup.   There is no built in aggregate to pick out the two largest values from a subgroup of values, but a simple SQL expression can order a list in descending order and pick out the first two values.    The COLLECT aggregate allows us to use such an SQL expression as an aggregate.

 

If we want to pick out the two highest payments in each job category, we can use COLLECT to create an aggregate of our own that uses an SQL expression as follows:

 

SELECT Job, SPLIT  (COLLECT Payment ORDER BY Payment DESC FETCH 2)

   FROM Expenses GROUP BY Job;

 

 

In the above we are using COLLECT to aggregate Payment values within each subgroup using a simple, but useful, SQL expression:

 

COLLECT Payment ORDER BY Payment DESC FETCH 2

 

The COLLECT statement expression uses the Payment field, with an ORDER BY payment in descending order (bigger values first) and using FETCH 2 to take only the highest two values.   It is an aggregate command even though it does not result in a single value, but two records in a table.  The SPLIT clause splits out the field values from the table so they can be used by the SELECT to build the SELECT statement's results table.

 

If we remove the clauses that order the output and fetch only two records, we get even less of an "aggregate", in that it reports all of the Payment values:

 

SELECT Job, SPLIT  (COLLECT Payment)

   FROM Expenses GROUP BY Job;

 

 

Can we use COLLECT without GROUP BY?   Yes, although that misses the point of using COLLECT, since using it without groups is redundant, ending up being the same as a simple SELECT without using COLLECT.

 

SELECT SPLIT  (COLLECT Payment)

   FROM Expenses;

 

 

When a GROUP BY is part of the SELECT, the query engine creates subgroups based on the GROUP BY and applies the COLLECT on each subgroup.

 

SELECT Job, SPLIT  (COLLECT Payment)

   FROM Expenses GROUP BY Job;

 

 

When the SELECT has no GROUP BY, the COLLECT still acts as an aggregate, but it acts on the entire table, since there is no GROUP BY that organizes records into subgroups.  It is as if there was a single, implied subgroup that is the entire table:

 

SELECT SPLIT  (COLLECT Payment)

   FROM Expenses;

 

The use of COLLECT without a GROUP BY is an example of an implied group, the entire table, because using any aggregate in a SELECT list tells the query engine the SELECT is about aggregates even though there is no GROUP BY.  Geared up for aggregates, the SELECT will process the field list insisting that any fields in that list are either the result of an aggregate or are the subject of a GROUP BY.   

 

If there is no GROUP BY, we cannot use any fields but the result of aggregates like COLLECT, Sum, or other aggregates.  For example, since there is no GROUP BY we cannot write

 

-- NOT a valid query

 

SELECT Job, SPLIT  (COLLECT Payment)

   FROM Expenses;

 

That will throw the error message

 

'Expenses.Job': Field should be either part of GROUP BY or part of aggregate.

 

If we want to add a Job field to the output of a SELECT that uses COLLECT but does not use GROUP BY, we can add the Job field to the COLLECT:

 

SELECT SPLIT  (COLLECT Job, Payment)

   FROM Expenses;

 

 

The COLLECT pulls all the values from the entire table, treating the entire table as a subgroup.   But that is what SELECT does without needing to use COLLECT - it also pulls values from the entire table.  Using COLLECT without GROUP BY just accomplishes the same thing as a simple SELECT without using COLLECT:   

 

SELECT Job, Payment  

   FROM Expenses;

 

 

Therefore, while it is possible write valid queries that use COLLECT but which do not use GROUP BY, it does not make sense to do so.

Finding the Top N Values in Each Group

In the examples above we found the top two Payments in each Job group.   Finding the top n values by subgroup is a very common task in both GIS and database.   We may want to find the five largest cities in each country, the ten tallest mountains in each state, or the three highest paid employees by county in a government organization.   Enter

 

SQL find top N values in groups

 

as a search term into a search engine like Google and you will get a seemingly endless variety of pages showing how to do that with various databases, with usually several different ways of solving the problem.  What almost all of them have in common is they are way more complicated than simple and clear queries that use SPLIT, COLLECT, and GROUP BY.

Example

We have a table called Summits that lists all mountain summits in the US.   For each mountain it has a Name, the Elevation in meters, and the State in which the mountain is located.   We would like to find the ten tallest mountains in each state.

 

 

The query for that is simple:

 

SELECT State,

   SPLIT  (COLLECT Name, Elevation ORDER BY Elevation DESC FETCH 10)

   FROM Summits

   GROUP BY State

   ORDER BY State;

 

The result table is a list of US states with the name and elevation of each mountain summit:

 

How the Query Works

Consider the query without the final ORDER BY clause, which simply puts the table in order by state for display.

 

SELECT State,

   SPLIT  (COLLECT Name, Elevation ORDER BY Elevation DESC FETCH 10)

   FROM Summits

   GROUP BY State;

 

The FROM clause says to use the Summits  table as the source of data.  

 

   GROUP BY State;

 

The first thing that happens is the last clause, the GROUP BY.   The GROUP BY tells the SELECT statement to configure itself for working with aggregates on subgroups, and it tells the query engine to organize the one, big table into 51 subgroups by State, one for each of the US states plus a subgroup for the District of Columbia.

 

   COLLECT Name, Elevation ORDER BY Elevation DESC FETCH 10

 

The COLLECT aggregate is then applied to each of the 51 subgroups.   It creates a table from each subgroup by harvesting the Name and Elevation field values, arranging n descending elevation order (highest values first) and then fetching the first 10 results.

 

   SPLIT (  )

 

SELECT works with fields, not tables, so we have to convert the tables that COLLECT creates for each subgroup into fields.  SPLIT does that, extracting the Name and Elevation fields and their values from the COLLECT results, and combining them with the State field value for each subgroup that the SELECT State part of the query acquires from the GROUP BY State clause.

Writing to a Table

If we would like to save the results to a table, we can adjust the query to write:

 

SELECT State,

   SPLIT  (COLLECT Name, Elevation ORDER BY Elevation DESC FETCH 10)

   INTO [Top 10 Summits per State]

   FROM Summits

   GROUP BY State;

 

We cannot use ORDER BY since tables are not ordered.  Therefore, it is a mistake to use ORDER BY when writing records into a table using a SELECT ... INTO statement.   That will throw an error.

Similar Tasks

We can adjust the above query for a very wide range of applications to find the top N or lowest N values of interest by group.  For example, if we want to find the 5 most populated cities in each country we might write a query like:

 

SELECT Country,

   SPLIT  (COLLECT Name, Population ORDER BY Population DESC FETCH 5)

   FROM Cities

   GROUP BY Country

   ORDER BY Country;

 

If we want to find the 20 lowest cost product SKUs (an identification number) in each category in a big online electronics catalog we might write:

 

SELECT Category,

   SPLIT  (COLLECT SKU, Price ORDER BY Price ASC FETCH 20)

   FROM Products

   GROUP BY Category

   ORDER BY Category;

 

We can use and reuse the same pattern of query for many tasks.

 

Notes

No brackets - In this topic to reduce visual clutter we do not use square [ ] brackets around simple field and table names.    Square brackets are optional if the name is unambiguous.

 

Order of execution -  When we described above how GROUP BY scanned a table to find all carpentry records and then fed those as a bundle into the Sum function to produce the first record of the results table, that is a conceptual description to help us understand that the job of GROUP BY is to work as a bundler and feeder for some aggregate function.   What actually happens inside the query engine is more complex, but the net result is what the conceptual narrative describes.

 

Videos

Find Percentages of Open Space in ZIP Code Areas  -  Uses GROUP BY:  Given a layer of polygons representing ZIP codes and a layer of polygons showing open spaces like parks and green spaces, find the percentage of open space in each ZIP code area.   This video shows how to do that start to finish in a few simple steps, from initial importing of shape files to final results, in just five minutes, with an additional six minutes of explanation what each step does.  Works in Manifold Release 9 or using the free Manifold Viewer.

See Also

Queries

 

Command Window

 

Query Builder

 

Aggregates

 

SQL Statements

 

COLLECT

 

GROUP BY

 

ORDER BY

 

SPLIT Clause

 

SQL Example: Learn SQL from Edit Query - Merging Areas - We learn how to write an SQL query that does a Merge : area (dissolve) operation by cutting and pasting from what the Edit Query button automatically generates.