Defining a Database

This chapter shows how we can define a database schema with associated constraints.

Before loading data into a database, we first need to define the database structure (also called: the Database Schema). In the simplest case, this means we define for each table in the database a table name, along with a list of columns with their names and data types. The following SQL command creates a single table:

CREATE TABLE [Table-Name] ([Column-List]);

In this and the following commands, placeholders are marked up via square brackets. In the command above, [Table-Name] represents the name we want to give to our table. [Column-List] is a comma-separated list of columns. For each column, we first define the column name and then the data type. Note that the names of tables, within a given database, as well as the names of columns, within a given table, must be unique (of course, different tables may have columns with the same name and different databases may have tables of the same name).

Let's take an example to make things more concrete. Let's say we want to define a table to store information on students at a university. We probably want to store a unique ID for each student, the name of the student, and the student's GPA. We need to assign each column to one of SQL's data types. Some of the most important SQL data types are the following:

Int Integer values.
Numeric Numerical values.
Text Text strings.
Date Date values.
Timestamp Timestamp values.

We will focus on those data types for the remaining tutorial. To get a complete list of data types supported by your database management system (the list of supported types may differ across different systems), consult the documentation of the specific database management system you're using.

For our example table, we want to use type Int for the student ID, the Numeric type for the GPA, and the Text type for the student name. The following SQL command creates the corresponding table:

CREATE TABLE Students(
	studentID int, name text, GPA numeric);

Perhaps we also want to include some information on the courses students are currently enrolled in. First of all, it makes sense to create a table that stores information on courses:

CREATE TABLE Courses(
	courseID int, name text);

Again, we introduce a course ID, along with a course name. Note that different courses sometimes have the same name (e.g., if different departments offer a course in "Statistics"). Hence, having the ID as a reliable way to distinguish different courses seems like a good idea. Now that we have information on students and courses, we might also want to keep track of which students are currently enrolled in which courses. For that, we introduce yet another table:

CREATE TABLE Enrollments(
	studentID int, courseID int);

This table requires a bit more explanations. It refers to two columns we have already encountered: the course ID (introduced in the Courses table) and the student ID (introduced in the Students table). The idea is that this table links together pairs of students and courses with the semantics that the corresponding student is taking the corresponding course. To avoid confusing different students or courses that happen to have the same name, we refer to IDs, rather than names. Note that the same student may appear multiple times in this table if one student takes several courses (which is realistic). Also, the same course may appear multiple times in this table as many students typically take the same course.

So far, we have defined the basic structure for all tables, including the names and types of columns. At the very least, this will prevent users from inserting data rows with values of incorrect types. However, there are various other ways to insert incorrect data into the database. When defining the structure of an SQL database, the philosophy is typically to minimize possibilities for users to accidentally enter data that "does not make sense". For that purpose, SQL offers rich options to define various types of constraints on the data. If users try entering data that does not comply with constraints (or try to change existing data in a way that makes it violate constraints), the database management system throws an error and prevents the corresponding change. This is a good way of making sure that your data is as correct and consistent as possible.

One of the most popular types of constraints is a Primary Key Constraint. A primary key constraint designates a set of columns in a table with the semantics that value combinations in those columns must be unique (when comparing different table rows). SQL generally offers the option to mark values in columns as "unknown" by specifying a special value, the SQL NULL value. However, when specifying primary key constraints on columns, no row can use SQL null values for any of those columns. In summary, the primary key constraint requires setting concrete values for all of the corresponding columns and the value combinations must be unique across rows.

Let's see where we can use primary key constraints in our example database. First of all, we introduced ID values for students and courses precisely to guarantee that we can distinguish different students and courses in the same table. This means we want our ID values to be unique and we require an ID value for each student and course. This is why a primary key constraint for student and course IDs makes sense. The following SQL command introduces a primary key constraint:

ALTER TABLE [Table-Name] 
	ADD PRIMARY KEY ([Column-List]);

Here, [Table-Name] is the table for which we want to introduce a primary key constraint. [Column-List] is a comma-separated list of column names to which the primary key constraint refers to. The following SQL commands introduce primary key constraints for the student and the course ID:

ALTER TABLE Students	ADD PRIMARY KEY (studentID);
ALTER TABLE Courses ADD PRIMARY KEY (courseID);

Together, those commands ensure that users cannot insert two students with the same ID or two courses with the same ID. While the IDs must differ, different students may still have the same name (and different courses may have the same name as well).

We also want to avoid redundant entries in the Enrollment table. Storing the information that the same student is enrolled in the same course repeatedly is useless and wastes storage space. Hence, we introduce a constraint ensuring that enrollment entries are unique:

ALTER TABLE Enrollments
	ADD PRIMARY KEY (studentID, courseID);

This is the first time, we define a primary key constraint on multiple columns (studentID and courseID). It means that having two entries with the same values in both of those columns is inadmissible. However, we can still have two table rows that have the same value for only one of those two columns. This is important since we still want to allow cases where the same student appears twice, having enrolled in multiple courses. Also, we need to enable the case where the same course appears multiple times, to represent multiple students who are enrolled in the same course.

We have prevented users from inserting redundant enrollments or inserting multiple courses and students with the same idea. Does that cover any possible mistake? Of course not. Next, we discuss another common type of data entry error that can be fixed using a second type of constraint.

The Enrollments table captures connections between students and courses, meaning that a specific course is enrolled in a specific course. For the moment, the names of the columns in the Enrollments table make it clear to users that one should refer to the course ID and the other one to the student ID. Of course, we want to refer the student ID to students who are stored in our Students table. The course ID should equally refer to courses that we have stored in our Courses data. However, for the moment, there is no mechanism that prevents users from entering enrollments for students that do not exist or to enroll students in courses that we do not have in our database! Clearly, that's undesirable (and mistakenly using an incorrect ID for students or courses seems like a mistake that could happen indeed).

To prevent such cases, we can use another popular type of constraint, a Foreign Key Constraint. A foreign key constraint connects columns across multiple tables. Its semantics is the following: any value that appears in a column of the first table must also exists as a value in the column of a second table. The constraint generalizes to multiple columns, meaning that value combinations in a group of columns in a first table must also exist as value combination in the columns of a second table. Specifically, as the name of the constraint suggests, the columns in the first table must link to primary key columns in the second table.

In our example database, we want to link from the columns in the Enrollment table to key columns in the Students and Courses tables. The general syntax for introducing foreign key constraints is the following:

ALTER TABLE [Table-1]
	ADD FOREIGN KEY ([Table-1-Columns])
	REFERENCES [Table-2] ([Table-2-Columns]);

This command constraints values that can appear in the table named [Table-1]. More precisely, it constrains values that can appear in the columns specified as [Table-1-Columns] (a comma-separated list of constrained columns). The values that appear in [Table-1-Columns] must match values that appear in the columns [Table-2-Columns] in the table named [Table-2]. If users try to insert data into [Table-1] whose values in [Table-1-Columns] do not appear in [Table-2-Columns] in [Table-2], the database system will reject that change with an error message.

Let's apply this type of constraint to our example database. First of all, we want to make sure that all enrollment entries refer to students who are stored in the Students table. We can accomplish that with the following command:

ALTER TABLE Enrollments ADD FOREIGN KEY (studentID) REFERENCES Students (studentID);

Also, we want to make sure that all Enrollment rows refer to existing courses in our database. The following command takes care of that:

ALTER TABLE Enrollments ADD FOREIGN KEY (courseID) REFERENCES Courses (courseID);

Now, trying to add enrollments for non-existing courses will produce an error message. Note that we can have multiple constraints, as well as multiple constraints of the same type (foreign key constraints in this case) on the same table.

Adding constraints to tables via separate commands is one option. Another option, slightly shorter, is to integrate constraints directly in the table definition command. For instance, adding PRIMARY KEY directly behind a table column definition (after the data type) turns that column into a primary key column. In case of multiple primary key columns, adding the statement PRIMARY KEY([Column-List]) in the table definition does the trick. Similarly, foreign key constraints can be added directly in the table definition. For instance, the following command creates the Enrollments table directly with all of the aforementioned constraints:

CREATE TABLE Enrollments(
	studentID int, courseID int,
	PRIMARY KEY (studentID, courseID),
	FOREIGN KEY (studentID) 
		REFERENCES Students(studentID),
	FOREIGN KEY (courseID)
		REFERENCES Courses(courseID)
);

Next, we will see how we can fill our newly created database schema with data.