top of page

Simple Analysis

This chapter introduces simple data analysis with SQL, including filtering and projections.

Simple Analysis

SQL enables us to answer complex questions about relational data. In this and the following chapters, we will see some of the most important SQL features for data analysis. A simple SQL query for data analysis has the following structure:

SELECT [Select-Clause]
FROM [From-Clause]
WHERE [Where-Clause]

We discuss the semantics of each of those clauses next. SQL queries operate on tables that are stored in the database. The [From] clause describes the source data for the SQL query, meaning the tables that are used as input. In general, the From clause may refer to multiple tables. We discuss this case in more detail in the next chapter. For the moment, we will focus on the simplest possible case of one single table in the From clause.

Not all rows in the input table may be relevant for the current query. Therefore, the Where clause allows users to specify conditions. Only rows satisfying those conditions will affect the query output. The query output is, in general, another relational table. The Select clause describes the columns of that table. It is a comma-separated list of column definitions. In the simplest case, each of those definitions simply refers to the name of a column from the input table. In more advanced cases, those column definitions may contain complex expressions that refer to columns from the input table (as well as, possibly, constants).

Let's make all of this more concrete by an example query. Let's assume that we want to retrieve the GPA of the student named "Marc". The following SQL query accomplishes that:

SELECT GPA FROM Students WHERE name = 'Marc';

This query contains the three clauses discussed before. In the From clause, it refers to one single table, the Students table. This is the source data used for query processing. It contains all relevant information, namely the GPAs and names of all students. We are not interested in all students but only in a specific student named "Marc" (technically, as the name column is not guaranteed to contain unique values, we might retrieve GPAs for multiple students with the same name). Hence, the Where clause filters rows to the ones with the relevant value in the name column. The condition is satisfied only by the rows that refer to students named "Marc".

Also, we are not interested in all columns from the source table. Instead, we only want to retrieve the student's GPA. In the Select clause, we generally specify the columns that should be part of our result table. In this case, we only want the result table to have one single column, the GPA column. As this column already exists in the source table, we merely specify the name of that column in the Select clause.

The example query uses the simplest possible version for each of the three clauses. Next, we discuss more complex cases, starting with the Where clause.

The SQL Where clause generally formulates conditions used to filter the input rows. First of all, we discuss the types of conditions supported by SQL. Coming from a different language than SQL, you might be thinking of Boolean conditions, i.e., conditions that evaluate either to True or to False. However, SQL supports ternary logic instead. This means that conditions may evaluate to either True, False, or SQL NULL. The NULL value means that we do not have enough information to decide whether or not a condition is true or not. This is a powerful feature that allows handling cases where information in the database is missing.

Consider the following scenario: we want to retrieve the GPA of students whose name is "Marc". However, perhaps we previously created an entry in the Students table that has a student ID but the name has yet to be entered in the database. In that case, the name field for the associated Students row is set to SQL NULL. However, this implies that we cannot be sure whether the name will ultimately be set to "Marc" or to a different name. That means, it is unclear whether the condition in our Where clause (restricting the name column to the value "Marc") is satisfied or not. In that case, the condition evaluates neither to True or False but to NULL. By convention, rows for which the condition in the Where clause evaluates to NULL are not part of the query result.

SQL conditions can use binary operators such as equality (=), not equal (<>), inequalities (e.g., <, >, <=, and >=), as well as arithmetic operators (+, -, /, *). Those operators apply to columns, constants, and, of course, the results of other SQL expressions. For instance, the following query is perfectly valid SQL:

SELECT GPA FROM Students WHERE studentID * 2 + 5 <= 7;

In some cases, we want to retrieve rows that currently have no value in specific columns (i.e., the corresponding columns have the SQL NULL value). For instance, this is useful if we want to identify courses where the precise name was unclear when the associated row was inserted in the database (allowing us to substitute NULL with the final course name). A common mistake when writing SQL queries is to use a condition such as the following to test for SQL NULL values:

SELECT courseID FROM Courses WHERE name = NULL;

While this query is processed, it will generally not return a single row, even if several courses have a NULL value for the name! It is important to understand the reason behind this behavior as it sheds more light on the handling of NULL values in SQL conditions. The SQL NULL value essentially represents a lack of knowledge, an unknown value. Hence, using NULL in an equality means that we are comparing a value (in this case, the value of the name column) to a value we don't know. How should we tell whether or not the unknown value happens to be equal to the value we compare against? Of course, there is no way to decide whether or not the quality is satisfied or not. Accordingly, the Where condition in the query above generally evaluates to NULL, whether or not the left-hand side of the equality condition (the course name) is set to NULL or not. In essence, the only question is whether we compare one known value to an unknown value or compare two unknown values with each other. This distinction does not matter for the final outcome: as soon as at least one operand of the quality is unknown, we cannot know the final outcome anymore.

The correct way to check for SQL null values is the following:

SELECT courseID FROM Courses WHERE name IS NULL;

The operator IS NULL evaluates to True if the value on its left-hand side is an SQL NULL value. Similarly, using IS NOT NULL, we check for the absence of NULL values.

Another important operator, specifically applied to string values, is the LIKE predicate. The LIKE predicate compares string values to a regular expression and evaluates to True if the string matches the expression. When formulating LIKE expressions, the percentage symbol (%) abbreviates a string of arbitrary length whereas the underscore (_) abbreviates one single, arbitrary character. For instance, we can use the following query to retrieve the IDs of all courses that contain the word "Database", somewhere in the course name:

SELECT courseID FROM Courses WHERE name LIKE '%Database%';

Here, the LIKE expression can contain arbitrary prefixes (including the empty one) and suffixes, as long as we find the word "Database" somewhere in between.

SQL enables users to compose atomic conditions into more complex ones, using logical operators such as NOT, AND, and OR. E.g., the following query exploits those features to find courses with specific names and IDs:

SELECT courseID FROM Courses WHERE name LIKE '%Database%' AND courseID > 10;

To take just another example, the following query retrieves all courses except for the database courses:

SELECT courseID FROM Courses WHERE NOT (name LIKE '%Database%');

We now shift our focus from the Where clause to the Select clause. In the simplest case, the Select clause merely contains the names of columns that appear in the input table. If so, the values in those columns (for all rows satisfying the condition in the Where clause) will simply be copied to the output. In more complex cases, we might compute expressions that involve multiple input columns and/or constants. All expressions that can be used in the Where clause can appear in the Select clause as well, including comparisons, logical, and arithmetic operators. The scope of admissible expressions in the Select clause is broader, compared to the Where clause, since expressions do not have to correspond to predicates. Instead, arithmetic expressions (evaluating to numbers) or string expressions (e.g., a concatenation between multiple strings, using the || operator in PostgreSQL) are perfectly fine.

Next, we discuss a few short-hand notations that apply specifically to the Select clause. A common scenario is the following: we want to include not one but multiple (if not all) columns of the input table. If we are interested in a small subset of columns, we can simply enumerate all of them in the Select clause, separated by commata. If we are interested in all columns, we can use the asterisk (*) symbol to avoid having to enumerate all of them. Similarly, if multiple tables appear in the From clause (a case we discuss in more detail in the following chapter), using the expression [Table-Name].* will select all columns from that specific table. For instance, the following query retrieves all information for students named "Marc":

SELECT * FROM Students WHERE name = 'Marc';

Another feature, specific to the Select clause, is the possibility to rename the columns that appear in the result table. For instance, we might assign a new name to the name column in the Students table for better readability:

SELECT name as NameOfStudent FROM Students;

This feature will become particularly important when referencing the result of an SQL query from a surrounding query, a feature discussed in later sections.

The queries discussed in this chapter enable simple analysis of data contained in a single relational table. In the following chapter, we broaden our scope to data analysis spanning data in multiple tables.

© 2025 by Immanuel Trummer. All rights reserved.

bottom of page