Aggregates, Grouping, Sorting

This chapter shows how to calculate aggregates, partition data, and sort rows.

In many scenarios, we are not interested in specific entries in a database. Instead, we want to derive insights about a data set as a whole. SQL offers various functions to calculate aggregates over large numbers of rows. More precisely, SQL supports five aggregation functions:

COUNT counts the number of rows.
MIN calculates a minimum value over all rows.
MAX calculates a maximum value over all rows.
SUM sums up values over all rows.
AVG calculates an average over all rows.

We start by discussing the Count function. In the simplest case, we are interested in counting all rows that remain after performing joins and applying all filter conditions in the Where clause. To clarify that we want to count all rows (instead of only counting rows satisfying certain conditions) we use the Count function with an asterisk parameter (Count(*)). Let's make it more concrete by an example. We might be interested in counting the number of students enrolled for a specific course, e.g., the course with ID 5. The following SQL query counts the number of rows in the Enrollments table satisfying this condition:

SELECT Count(*) FROM Enrollments WHERE courseID = 5;

We use the Count aggregate in the Select clause. The result of this query will contain only one single column with the count. As the Count aggregates all input rows into a single number, the query result also has only one single row. It is important to keep in mind the number of rows associated with different items in the Select clause. All items in the Select clause must result in the same number of rows. Otherwise, our query result would not correspond to a valid table (which does not allow having different columns with a different number of rows). Concretely, this means, as we're using an aggregate that condenses the entire input into one single row, we cannot add items in the Select clause that do not have this property. For instance, the following query would be invalid:

SELECT Count(*), courseID FROM Enrollments WHERE courseID = 5;

The query is invalid since it mixes, in the Select clause, an aggregate (which always results in one single result row) with a reference to the courseID column (for which the number of result rows depends on the number of input rows). Trying to issue this query will raise an error in the database management system. It does not correspond to valid SQL. On the other hand, adding multiple aggregates as Select items is valid since each aggregate must result in one single result row.

Sometimes, we do not want to count all rows but only rows that satisfy certain conditions. Of course, we can always specify those conditions in the Where clause. Alternatively, the Count aggregate offers convenient options to restrict the count to columns that have no NULL values in certain columns. Simply substitute the asterisk (*) with the name of the corresponding column. For instance, the following query counts the number of courses for which a name was already entered in the database (as opposed to a NULL value):

SELECT Count(name) FROM Courses;

As an additional option, we might be interested in counting not all rows but the number of distinct values in a certain column. For instance, we might be interested in counting the number of distinct students who already enrolled for courses (perhaps to compare that number to the total number of students in the database). The following SQL query accomplishes that:

SELECT Count(DISTINCT studentID) FROM Enrollments;

By adding the DISTINCT keyword before the column name, we count the number of rows with distinct (and not NULL) values in the studentID column.

For maximum, minimum, sum, and average aggregates, we must specify an expression used for aggregation. In the simplest case, this expression corresponds to the name of a column (note that, for average and sum, that column must contain numbers). For instance, we might be interested in retrieving statistics on the GPA over all students. The following query retrieves, at the same time, the minimum, maximum, and average GPA:

SELECT Min(GPA), Max(GPA), Avg(GPA) FROM Students;

Instead of single columns, we can also aggregate over more complex expressions. For instance, let's assume we want to calculate all of the aforementioned aggregates for a GPA on a 4.3 scale, assuming that database entries are consistent with a 4.0 scale. We can achieve that by multiplying all GPAs by the factor 1.075. This multiplication can happen directly in the aggregates, leading to the following query:

SELECT Min(GPA*1.075), Max(GPA*1.075), Avg(GPA*1.075) FROM Students;

NULL values are not considered for aggregation (except for the Count(*) aggregate which counts the number of rows). In the query above, if the GPA of a student is missing (i.e., set to NULL), the aggregates are calculated while discarding the corresponding row. If the Min, Max, Avg, or Sum aggregates are applied to an empty table (or to one in which values for the relevant columns are set to NULL for each row) the aggregates evaluate to NULL. The Count aggregate will return the value 0 instead.

Calculating aggregates over the entire data is not always what we want. Sometimes, we want more fine-grained information. SQL offers various features to partition data into row groups and calculate aggregates for each of those groups separately. We use the GROUP BY clause in order to group rows. After the GROUP BY keyword, we specify how groups should be formed. In general, SQL forms row groups by grouping all rows together for which one or multiple expressions evaluate to the same value. In the simplest case, those expression simply correspond to column references. Let's make all of this more concrete by an example. For instance, assume we want to count the number of enrolled students for each course (we simplify by assuming that each course has at least one enrolled student). The following SQL query accomplishes that:

SELECT Count(*), courseID FROM Enrollments GROUP BY courseID;

The query works with the Enrollments table, containing pairs of students and courses representing enrollments. The query groups rows from that table based on the courseID column. Then, as specified in the Select clause, we calculate row counts for each of those groups. Without grouping, aggregates aggregate all data. With grouping, aggregates are calculated for each group. As before, using aggregates in the Select clause implies constraints on which other items can appear in the Select clause. Without grouping, only other aggregates may appear in the Select clause since only those are guaranteed to return the same number of rows (i.e., one row). With grouping, we have more choices. Aggregates return one single value for each of the groups. Hence, we can combine them with other expressions in the Select clause that are also guaranteed to return only one single value per group. This applies to other aggregates but it also applies to the columns used for grouping! The query above exploits that fact and returns not only counts but also the associated course IDs. This makes sense, of course, since reading only counts, without any course attribution, is not very informative. In summary, the result of the query above contains two columns with the course ID and the associated enrollment counts.

To take another example, let's assume we want to count not the number of enrollments per course but the number of enrollments per student. Previously, we assumed that each course has at least one enrolled student. Now, however, we will not assume that each student is enrolled in at least one course. Some students may not have chosen their courses yet. We will not find references to those students in the Enrollments table. Instead, we have to combine information in the Students and Enrollments table. To count enrollments per student, we have to match students with their enrollments. At the same time, to account for students without enrollments, we have to keep students without matching rows in the Enrollments table. This is precisely a use case for an outer join operation. The following query counts the number of enrollments per student:

SELECT S.studentID, Count(courseID) FROM Students S LEFT OUTER JOIN Enrollments E ON (S.studentID = E.studentID) GROUP BY S.studentID;

Using a left outer join with the Students table on the left-hand side, the query matches students with their enrollments but preserves unmatched students in the join result. For unmatched students (and only for those) the courseID column is set to NULL since no corresponding information is available in the Enrollments table. Then, we group by the studentID column. The query result contains two columns: the student ID and the associated enrollment count. It may be tempting to use the Count(*) aggregate here but this approach does not lead to an accurate result. With Count(*), we count any row. Specifically for students without associated enrollments, that means we would still count the single, associated row in the join result. That, however, is incorrect: students without enrollments should have a count of zero. Using the Count(courseID) aggregate instead ensures that only rows with a non-NULL value in the courseID column are counted. Those are precisely the rows that correspond to actual enrollment entries.

We have seen that we can filter rows using the Where clause. Sometimes, we want to filter not single rows but row groups (i.e., after grouping rows). For that, we can use SQL's HAVING clause. The HAVING clause can only appear in combination with a GROUP BY clause. The HAVING clause (which comes after the GROUP BY clause) is followed by a condition that is evaluated on row groups. Hence, the condition must be consistent with the GROUP BY specification. For instance, the condition cannot refer to columns that have different values across different rows in a group (since that would imply a condition that cannot be evaluated on the group as a whole). On the other hand, the condition following the HAVING clause can refer to columns for which all rows in a given group are guaranteed to have the same value (in particular columns that are explicitly referenced in the GROUP BY clause) or to aggregates (which also evaluate to a single value for the group as a whole).

For instance, let's assume we want to count enrollments for courses but want to restrict our scope to large courses with at least 100 enrolled students. The following SQL query accomplishes that using a HAVING clause:

SELECT courseID, Count(*) FROM Enrollments GROUP BY courseID HAVING Count(*) >= 100;

This query first groups enrollment entries by the course ID, then filters groups to the ones having at least 100 rows (using the HAVING clause), and finally selects the courseID column and the enrollment count for the query result.

Another important SQL feature for data analysis is the capability to sort query results according to various criteria. This is accomplished using SQL's ORDER BY clause. The ORDER BY clause has generally the following structure:

ORDER BY [Sort-Item-List]

The [Sort-Item-List] is a comma-separated list of items used to sort rows. In the simplest case, each item is a column reference. In that case, SQL uses the standard sort order for the data type associated with the column (e.g., alphabetical order for text columns). Otherwise, sort items can correspond to arbitrary expressions, referencing one or multiple columns as well as constants. By default, items are sorted in ascending order. To reverse the ordering for specific sort items, users can append the suffix DESC (short for "descending") to the corresponding item. If multiple sort items are specified, SQL prioritizes items that appear earlier in the list (i.e., the order in which sort items are specified does matter). More precisely, when comparing rows to determine which one should be ordered first, the database system starts comparisons using the first item in the sort list. Only if both rows have the same value for that item, it considers the next item in the list. In general, an item in the sort list is only considered if values for all previous items have been the same when comparing two rows.

Let's make this more concrete using an example.

Assume we want to sort courses by the number of enrollments. For courses with the same number of enrollments, we want to sort them by the course name (in alphabetical order). The following query accomplishes that:

SELECT C.courseID, Count(*) FROM Enrollments E NATURAL JOIN Courses C GROUP BY C.courseID ORDER BY Count(*), C.name;

Note that the ORDER BY clause follows the GROUP BY (and HAVING) clause. If a GROUP BY clause is present, the ORDER BY clause refers to sorting of the resulting groups, i.e., it must reference expressions that are defined for groups as a whole (e.g., we cannot reference columns whose values may differ across multiple rows in a group).

The ORDER BY clause is often combined with the LIMIT clause, enabling users to limit the number of result rows. For instance, by expanding the query above with a limit clause, we can retrieve the five courses with the highest enrollments counts (or fewer courses if the database does not contain five courses):

SELECT C.courseID, Count(*) FROM Enrollments E NATURAL JOIN Courses C GROUP BY C.courseID ORDER BY Count(*), C.name LIMIT 5;

Taken together, grouping, aggregation, and sorting often enable users to extract interesting insights from various data sets. In the next chapter, we discuss sub-queries that enable even more complex types of analysis.