Sub-Queries

This chapter shows how to use correlated and uncorrelated sub-queries.

Nested queries are a very powerful feature of SQL. It means, essentially, that we can include SQL queries within other SQL queries. Let's make that more concrete by a simple example. Assume we want to find all students with the maximal GPA. We could sort students by their GPA (starting with the highest GPA numbers) and select the first few students. However, it is not clear how many students with maximal GPA exist in the database. That motivates a different approach. Nested queries offer an elegant solution:

SELECT * FROM Students WHERE GPA = (SELECT Max(GPA) FROM Students);

This solution composes two SQL queries. The outer query is the following:

SELECT * FROM Students WHERE GPA = [Inner-Query]

It selects all students who satisfy the condition in the Where clause. This condition compares the GPA to the result of the inner query. The inner query, in turn, is the following:

SELECT Max(GPA) FROM Students;

This query uses an aggregate to calculate the maximal GPA over all students. This number is used to filter students in the outer query, only keeping the students who have the maximal GPA. Note that both, the inner and outer query, refer to the same table (Students).

The inner query (we also call the inner query a sub-query, relative to the outer query) can appear in various query clauses. In the example query above, the inner query appears in the Where clause, as part of a predicate. However, SQL is quite flexible in terms of where sub-queries appear. For instance, we could use a sub-query in the From clause as well:

SELECT * FROM (SELECT * FROM Students WHERE GPA > 3.0);

This example is not a particularly useful application of sub-queries, admittedly (we can easily write an equivalent query without any sub-queries). However, it demonstrates that sub-queries can be used in the From clause as well. The query above is processed on PostgreSQL, starting from Version 16. For earlier versions of PostgreSQL, as well as for some other database management systems, you will have to add an alias for the sub-query. While base tables in the database have names, the table resulting from a sub-query does not have a name a-priori. By requiring you to specify an alias for the sub-query result, the database system makes sure that each item in the From clause has a name (which can be referenced in the rest of the query). The following query variant uses an alias and executes on earlier versions of PostgreSQL as well:

SELECT * FROM (SELECT * FROM Students WHERE GPA > 3.0) AS StudentsWithHighGPA;

Here, we assign the alias StudentsWithHighGPA to the result of the sub-query. Sub-queries are not restricted to the From and Where clauses. In general, sub-queries can substitute any SQL expression, as long as the expression type is consistent with the sub-query result type. For instance, items in the From clause generally corresponds to SQL tables. As any SQL query results in a table, any sub-query can be used as an item in the From clause. However, the previous example (retrieving students with maximal GPA) is more restrictive. Here, the sub-query appears in a binary comparison (equality). On the left-hand side of that expression is a reference to the GPA column, i.e., single numbers. Both operands of the equality comparison must have the same type. This means, as the left-hand side is a number, we need a (scalar) number on the right-hand side as well (not a general table). The sub-query satisfies this constraints as it calculates one single aggregate (the maximal GPA). As the sub-query does not use any grouping, it results in a "table" with one single row and one single column. This type of result is indeed compatible with a number. For instance, the following query variant does not work:

SELECT * FROM Students WHERE GPA = (SELECT Max(GPA) FROM Students GROUP BY name);

PostgreSQL 16, for instance, returns the following (quite explanatory) error message when trying the query above:

ERROR:  more than one row returned by a subquery used as an expression

This is indeed the problem: the sub-query may now return not one but multiple aggregate rows, due to the presence of the Group By clause. Note that this error appears independently of the actual number of students in the database. While having one single student in the database would imply one single result row, the query is still considered as incorrect since it does not intrinsically enforce a restriction to one single result row.

To find out the student with maximal GPA, we have used a predicate that compares the GPA column to one single number (the maximal GPA). With SQL, we can write an equivalent query in a different way. The ALL operator can be used to perform a comparison between one value and all values from a given set. The operation evaluates to true if, and only if, each single comparison evaluates to true. For instance, instead of selecting students whose GPA equals the maximum GPA, we can select students whose GPA is greater or equal to the set of GPAs of all students. The following query follows those principles:

SELECT * FROM Students WHERE GPA >= ALL(SELECT GPA FROM Students);

Here, the ALL operator is used to compare the GPA of the currently considered student to the GPAs of all other students, i.e., the result of the sub-query.

The ANY operator relates to the ALL operator. While the ALL operator evaluates to true only if the comparison holds for all available values, the ANY operator evaluates to true if the comparison evaluates to true for at least one single value. Consider the following query:

SELECT * FROM Students WHERE GPA > ANY(SELECT GPA FROM Students);

Here, the GPA of the current student is, again, compared to the GPAs of all students. The ANY operator evaluates to true if there is at least one student whose GPA is below the currently considered student. In other words, this query retrieves all students except for the students with minimal GPA (as, for those students, we cannot find any other students whose GPA is strictly smaller).

So far, all sub-queries we considered have the property that we could have executed them independently, all by themselves, without raising any errors. There is nothing within the sub-queries that connects them to the surrounding query. Hence, they can "stand" by themselves. We also call such queries uncorrelated sub-queries. Next, we discuss the complement: correlated sub-queries.

A correlated sub-query is connected to the surrounding query. Typically, the correlated sub-query cannot be executed all by itself. Due to references to the surrounding query, trying to execute the correlated sub-query alone would raise errors. Let's make it more concrete by an example. For instance, let's say we want to find students who did not register for any courses yet. Correlated sub-queries offer an elegant solution:

SELECT * FROM Students S WHERE NOT EXISTS (SELECT * FROM Enrollments E WHERE S.studentID = E.studentID);

The outer query selects students for which no enrollment entries exist:

SELECT * FROM Students S WHERE NOT EXISTS ([Correlated-Subquery]);

The goal of the [Correlated-Subquery] is to retrieve enrollment entries that are associated with the student currently examined in the outer query. Since the sub-query refers to the current student in the outer query, it is a correlated sub-query. The NOT EXISTS operator checks whether or not the result of the query in parentheses (the correlated sub-query in this case) is empty. If it is empty, the NOT EXISTS operator returns true (otherwise false). Only if the sub-query, retrieving associated enrollment entries, is empty, the NOT EXISTS operator returns true and the current student is included in the query result.

The correlated sub-query is the following:

SELECT * FROM Enrollments E WHERE S.studentID = E.studentID

Trying to execute this query in PostgreSQL (or in any other database management system) will raise an error. More specifically, the system will complain that S is not defined. This makes sense since S is the alias for the Students table that was, however, introduced in the outer query. S is not defined in the sub-query, leading to the error.

Conceptually, the query is evaluated as follows. Starting from the outer query, we are iterating over the rows of the Students table (with alias S). For each row in that table, we evaluate all applicable predicates in the Where clause. More precisely, we evaluate the NOT EXISTS operator for each row. To evaluate this operator, we must check whether the result of the sub-query is empty or not. The sub-query retrieves entries from the Enrollments table. It filters out all retrieved rows, except for the ones that satisfy the predicate in the Where clause. This predicate (S.studentID = E.studentID) refers to the studentID of the currently selected row in S (referring to the Students table in the outer query). Only if the studentID matches, i.e., we find enrollment rows associated with the current studentID, will we keep the associated row. This means the sub-query result is empty if no enrollment rows are associated with the current student. Such students are included in the result of the outer query. The database management system only shows the results of the outer-most query.

The complement to the NOT EXISTS operator is, of course, the EXISTS operator. It refers to a sub-query and evaluates to true if the result of the sub-query is non-empty. For instance, we can find students who have enrolled for at least one course by using this operator:

SELECT * FROM Students S WHERE EXISTS (SELECT * FROM Enrollments E WHERE S.studentID = E.studentID);

The only difference to the prior query is the use of EXISTS, rather than NOT EXISTS. Therefore, the query selects the complement of students, compared to the result of the previous query. Instead of selecting students without enrollment entries, we only select students with at least one enrollment entry. To see one more variant, we can select students who are enrolled in at least two courses with the following query:

SELECT * FROM Students S WHERE (SELECT Count(*) FROM Enrollments E WHERE S.studentID = E.studentID) >= 2;

Here, we use a correlated sub-query that returns a single number. Due to that, we can compare the result of the sub-query to a number (2) and retrieve only the students with at least two enrollments.

Finally, note that sub-queries can have sub-queries as well. This means the number of nesting levels is, a-priori, unlimited. As a final example, let's assume we want to find courses in which all students are enrolled. As we will see next, this problem can be solved using two nesting levels.

To find courses in which all students are enrolled, we need to find courses such that there does not exist a student who is not enrolled. More precisely, for each course, we need to ensure that no student exists for which no enrollment entry exists. Note the fact that this sentence uses "not exists" twice. Indeed, we will see that this problem can be solved using two nesting levels and two instances of the NOT EXISTS operator. Consider the following SQL query:

SELECT * FROM Courses C WHERE NOT EXISTS (
	SELECT * FROM Students S WHERE NOT EXISTS (
		SELECT * FROM Enrollments E WHERE NOT EXISTS
			C.courseID = E.courseID AND 
			S.studentID = E.studentID
	)
);

The outermost query has a sub-query, connected via the NOT EXISTS operator. In turn, the latter sub-query has a sub-query itself, again connected via the NOT EXISTS operator. In the outermost query, we consider all courses. We only add courses to the query result if the next-inner sub-query (called middle query from now on) returns an empty result. This next-inner sub-query considers all students and keeps only the one for which the innermost sub-query returns an empty result. This innermost sub-query analyzes the Enrollments table and retrieves enrollments that refer to the current student (in the middle query) and the current course (in the outermost query). If the innermost sub-query has a non-empty result then there is an enrollment for the current course and student. Hence, the student is not selected in the middle query. If no student has missing enrollment entries for the current course, the result of the middle query is empty. If so, the associated course is included in the result of the outer query.