Joins

This chapter discusses the join operator, an operation that combines data from multiple tables.

All queries considered in the last chapter used only one single table in the From clause. However, to gain interesting insights, we often need to combine data from multiple tables. Fortunately, SQL offers various options to analyze data across tables.

The primary operation used to combine data from multiple tables is a join operation. Conceptually, a join between two tables means we perform the following two steps:

We form all possible pairs of rows from the input tables.
We filter row pairs using a condition: the join predicate.

In other words: we find all matching pairs of rows from the two input tables (where a match is defined by the join condition). The result of the join operation combines the columns from both input tables. For instance, think back to our database storing information on Students and their Enrollments in two different tables. We might want to combine enrollment rows with the associated students. This means we want to retrieve all pairs of a row from the Enrollments table (storing pairs of student and course IDs) and the row from the Students table that contains all information about the associated student. This means we want to match each row from the Enrollments table with the row from the students table that has the same value for the student ID. Below are the two input tables with example data.

Enrollments:

studentID	courseID
1	3
2	3
3	1

Students:

studentID	name
1	Marc
2	Isabel
3	Robert

A join between those two tables, requiring matching values in the studentID column (which appears in both tables) results in the following output:

studentID	courseID	name
1	3	Marc
2	3	Isabel
3	1	Robert

Note that the studentID column (which appears in both tables) does not appear twice in the output (depending on the way the join operation is formulated in SQL, redundant columns may or may not be automatically eliminated). How can we perform joins using SQL?

We will see a few different ways of writing join operations in SQL. In many scenarios, we are free to choose between multiple options and it comes down to personal preferences. The join in the example above corresponds to a common special case: we want to match pairs of rows that have the same value in columns that have the same name. This is also called a Natural Join and SQL provides specialized syntax for this common case:

SELECT * FROM Enrollments NATURAL JOIN Students;

Here, we use the keyword Natural Join in the From clause, separating the two tables we want to join. In the example above, we only have a single column with the same name across both tables. In general, we might have multiple pairs of columns with the same name. The natural join finds row pairs that have the same values across all pairs of columns with matching names. Sometimes, we are not interested in matching values across all of these columns. To match rows using only a subset of columns, we can use the following join syntax:

SELECT * FROM Enrollments JOIN Students USING (studentID);

This syntax makes it clear that we want to find row pairs that have the same value in the studentID column (but do not require the same value in other columns that appear in both input tables). For our example database, both versions are equivalent since there is only one single column that appears in both tables in any case. However, in a more realistic database, we might well have tables with several tens or even hundreds of tables. In those cases, it becomes difficult to keep track of all columns that may have the same name in both tables. Hence, specifying the names of columns to match explicitly via the USING keyword is the safer approach.

While matching rows with the same values in certain columns is a common case, it does not cover every possible scenario. SQL enables you to specify arbitrary join conditions, determining which pairs of rows should be matched. To write arbitrary conditions, we can use the following syntax:

SELECT * FROM Enrollments JOIN Students ON (Enrollments.studentID = Students.studentID);

Here, we can specify arbitrary conditions as the join condition. The query above has the same semantics as the queries before with the following exception. The join syntax presented before (both, the NATURAL JOIN and USING syntax) are both specialized to scenarios where one or multiple columns appear in both tables and matching rows have the same values for those columns. In those cases, having the columns from both input tables appear in the join result is redundant. Hence, SQL eliminates one of the two columns in the join result automatically. When specifying arbitrary conditions using the ON syntax, we may not have any redundant columns in the join result. Therefore, SQL does not automatically eliminate any columns using the join syntax discussed last.

Let's use a different example to demonstrate the flexibility of the last join variant:

SELECT * FROM Enrollments JOIN Students ON (Enrollments.studentID <> Students.studentID);

Here, we match rows from the Enrollments and Students table where the student ID is different (essentially the opposite of our prior join condition). In general, using this syntax, we can use arbitrary conditions to match rows from the input tables.

You might have noticed that we prefix both references to the studentID column by the name of the table we are referring to (Enrollments or Students). This is necessary since the column appears in both input tables. Hence, a reference to the column name alone is ambiguous. The database system cannot know which column (among all columns with the same name) we are referring to. If column names appear only once in all input tables, we can omit the name of the table. Writing out the names of longer tables can be tedious. Hence, SQL enables us to introduce shortcuts for tables in the From clause. Below is the query joining both input tables while introducing shortcuts:

SELECT * FROM Enrollments E JOIN Students S ON (E.studentID = S.studentID);

We introduce the shortcut E for the Enrollments table and the shortcut S for the Students table. Shortcuts can be used in all other parts of the query, including the Where clause (as in the example query) as well as the Select clause.

Yet another alternative to the join syntax above is the following:

SELECT * FROM Enrollments E, Students S WHERE E.studentID = S.studentID;

This syntax does not refer to joins explicitly. Instead, it simply enumerates all tables to join as a comma-separated list in the From clause. Then, it specifies the join condition as part of the Where clause. The semantics of a comma-separated list of tables in the From clause is a Cartesian Product. This means, we obtain all possible pairs of rows from the input tables. The join condition is specified separately as a condition in the Where clause.

At this point, you might be wondering about whether or not the join syntax impacts efficiency of data processing. In particular, the last option might seem suspect: do we really pair up all rows in a first step (leading to a possibly large, intermediate result that is expensive to process) and filter only in the second step? Fortunately, that is not the case. SQL is a declarative language, meaning that the SQL syntax should not determine the way in which the query is processed by the system. Therefore, any of the aforementioned syntax variants lead to the same performance and you can choose whichever one is most intuitive.

The queries seen so far join only two tables. However, you may join any number of tables using the same syntax. As an example, the following query joins all tables from our example database:

SELECT * FROM Students S, Enrollments E, Courses C WHERE S.studentID = E.studentID AND E.courseID = C.courseID;

Similarly, the following syntax could be used (which also eliminates redundant columns automatically):

SELECT * FROM Students JOIN Enrollments USING (studentID) JOIN Courses USING (courseID);

All joins discussed so far only return row pairs that satisfy the join condition. This means that rows from any of the two input tables that do not match with any rows in the other table are not part of the join output. In some cases, this behavior is undesirable. For instance, imagine we want to match students with courses they enrolled for but also retrieve students who did not enroll in any courses yet. When joining Students and Enrollments tables as in the prior queries, students who do not match any rows in the Enrollments table do not appear in the join result.

To guarantee that each row from a specific input table appears at least once in the join result, independently of whether or not it satisfies the join condition with at least one row in the other table, we can use an Outer Join. We need to explicitly request an outer join in SQL. By default, joins behave as Inner Joins, meaning that unmatched rows are not part of the join result.

For the outer join, we can choose between three variants:

Left outer join.
Right outer join.
Full outer join.

The three variants differ with regard to the tables from which rows are guaranteed to be preserved in the join result. For a left outer join, all rows from the left-hand join table are guaranteed to be preserved. For the right outer join, rows from the right table are always preserved. Finally, for the full outer join, rows from both input tables are preserved, even if no matching rows are found in the other table.

Whenever a row is matched to a row in the other table, the join result fills result columns with values from both matched rows. However, what values should we use for rows that do not match any other rows in the other table? For unmatched rows, we fill values of columns that belong to the other table with SQL NULL values. The following example illustrates the difference between the inner and outer join variants.

Enrollments:

studentID	courseID
1	3
2	3
1	1

Students:

studentID	name
1	Marc
2	Isabel
3	Robert

Note that the table content has changed, compared to the previous example with the same two tables. Now, the third student, Robert, is not enrolled in any classes. Consider the following query (which uses an inner join):

SELECT * FROM Enrollments JOIN Students USING (studentID);

The result of this query is the following:

studentID	courseID	name
1	3	Marc
2	3	Isabel
1	1	Marc

The result contains one row for each enrollment entry, matching it to the associated student. Since the third student, Robert, is currently not enrolled in any courses, the student does not appear in the join result. To guarantee that each student, independently of matching enrollment rows, appears in the join result, we have to use the following query instead (which uses an outer join):

SELECT * FROM Enrollments RIGHT OUTER JOIN Students USING (studentID);

Note that the same shortcuts for specifying join conditions are available for outer joins, compared to inner joins (equivalently, we could have used a NATURAL RIGHT OUTER JOIN). The outer join is not symmetrical and it matters which version (RIGHT versus LEFT versus FULL) we use. In this example, our goal is to keep each students, even students without matching rows in the Enrollments table. The Students table appears on the right of the join operator. Hence, we need to use a RIGHT OUTER JOIN. The join result is the following:

studentID	courseID	name
1	3	Marc
2	3	Isabel
1	1	Marc
3	NULL	Robert

This join result contains all students, even Robert who has not yet enrolled in any courses. Since there are no matching entries in the Courses table for Robert, the columns that are specific to the Courses table are filled with NULL values (instead of values taken from any rows in Enrollments). If trying it out you may, depending on the database management system used, see a different representation of NULL values (e.g., when using PostgreSQL from the terminal, you would simply see an empty field representing the NULL value).

The courseID column is part of the primary key for the Enrollments table. This means no NULL values are allowed for this column. That, in turn, means that having a NULL value in the courseID column is a sure way to recognize students who have not enrolled for any courses. We can exploit that to write the following query, retrieving all students who did not yet enroll for any courses (perhaps to send out email reminders encouraging them to sign up for courses soon):

SELECT studentID FROM Enrollments RIGHT OUTER JOIN Students USING (studentID) WHERE courseID is NULL;

This query first performs the outer join between Enrollments and Students, then filters the resulting rows to the ones with a NULL value in the courseID column (which means we're selecting precisely the students without enrollments) and finally selects the studentID column for those students.

In the next chapter, we will see several SQL functions that can be combined with joins and enable even more sophisticated data analysis in SQL.