Article updated 2025-02-05.
Written by Christofer Sandin.

A Bit of Basic Database Theory

It may sound dry, but it's not entirely wrong to know a bit of theory about how and why certain things are done if you want to work with database development in a serious manner. Especially if you aim to do real jobs. The theory below is somewhat simplified, and I know I haven't covered everything. The intention was to summarize the most important points, and sometimes that requires leaving out details to get the message across. I hope you can bear with this.

In reality, as it seems one must adapt to from time to time, there are often reasons that make it impossible to be as strict as one might wish. But, as with everything else, the better you understand a subject, the easier it is to "break the rules" when necessary.

With these words, here follows what I, subjectively, consider to be some of the most important concepts in a bit of basic database theory.

Introduction and Some Terminology

The English term DataBase Management System (DBMS), or as it is called in Swedish, a databashanterare, is a collection of data that is interdependent, along with all the programs needed to read/edit/delete this data. A database is the aggregated dataset and is usually described with one or more tables.

These tables consist of a number of columns with names (attributes) where data is stored in each row (tuple). Of course, there are different types of database management systems, but for those of us working with web technology, it is almost always a Relational DataBase Management System (RDBMS) from MySQL, Oracle, PostgreSQL, or Microsoft that is used.

A database design is referred to as its schema, and the information stored in the database at a specific point in time is called an instance of the database.

Another term that is used quite often, and that can be good to know, is entity. It is an "object" in the real world that can be distinguished from all other objects. An entity has a set of properties that describe it, and sometimes one or more of these properties can uniquely describe the object, i.e., serve as a way to refer to this specific "object."

An entity set is, as it sounds, a collection of entities of the same type and with the same properties. An example of this is all the records in your CD collection or all the people in Sweden.

id	title	artist	website	date
1	Out of Time	R.E.M.	www.remhq.com	1991-03-12
2	Reveal	R.E.M.	www.remhq.com	2001-05-07
3	Mental Jewelry	Live	www.friendsoflive.com	1992-03-09

Table 1: An example of an entity set with the attributes id, title, artist, and date. It is presented in table form with three tuples of data.

As mentioned above, certain attributes can uniquely determine a tuple of data in a table. In our example above, for instance, id is such an attribute; this is then referred to as a key.

In other cases, we may need to use multiple attributes to identify a unique tuple. This is also allowed and works in principle the same way, but then these common attributes are referred to as a superkey.

That was a lot of terms in a relatively short time, so it might be good to have another example:

The entire population of Sweden can be described with a lot of different attributes where several different people can be described as blonde, brown-eyed, or tall, but only one unique person can be referred to using a personal identification number.

Normalization

When designing a database that will be used professionally (and that is a bit more complex than the table with CDs above), there are a few things you should know and consider. One of these things is normalization.

The major benefits of normalization are that you avoid redundant data and that it often becomes faster to locate data in the database. Redundant data means that the same information is stored in many different places in the database, which is not recommended as it can lead to problems during updates since you have to update in many different places.

Thus, you should (often) normalize the database as much as possible. It is worth noting that the various normal forms below build upon each other, so to meet the conditions of the third normal form, the conditions for the first two must also be met.

First Normal Form (1NF)

Each row in a column must not contain more than one value. If one of the bands above has two official websites, these should not be stored as “http://www.address1.com, http://www.address2.com” but in two columns that should be named something like website1 and website2 instead.

There must be a primary key in each table. This is a column that uniquely identifies an instance of data (a row in the table).

A primary key can also consist of multiple columns, with the condition that the combination of these values uniquely identifies a row in the table.

Second Normal Form (2NF)

If multiple rows share common data, this should be placed in a separate table.

The relationship between the old and new table is defined with a foreign key.

We use our first example to show how the database storing the CD collection can be rewritten into two separate tables to comply with 2NF:

id	title	artist	date
1	Out of Time	R.E.M.	1991-03-12
2	Reveal	R.E.M.	2001-05-07
3	Mental Jewelry	Live	1992-03-09

Table 2: Here we have split the database into two tables with the attribute artist as a foreign key. This is to extract rows with common data and place them in their own table.

artist	website
R.E.M.	www.remhq.com
Live	www.friendsoflive.com

Table 3: In this case, there was common data in the column website. It has been extracted, and thus the database now follows 2NF.

As you can see, it becomes more and more strict as the level of normalization increases, but it is actually not a bad idea to follow some form of normalization when designing a database. When dealing with large datasets, the advantages become particularly clear, as this means that each change going forward only needs to be made in one table.

Third Normal Form (3NF)

Place all columns that do not depend solely on the primary key in their own table.

There are also other normalization forms. In addition to the fourth and fifth normal forms, there is one called Boyce–Codd Normal Form (BCNF). However, the form that is most commonly used in practice is often 3NF, as the others become a bit too cumbersome unless it is a major project being developed.

Want More?

This was a brief summary of the basics to be able to move on to SQL and "do something for real." If you want more theory, there is plenty to read in professional literature or on the web.

If you are studying or are interested in further education, there are also opportunities to take a course at a YH program, a university, or a technical college.