Practical Data Science: Relational Data

  • 1 year ago
  • Gritinai

The basic relation (i.e. the table)

The term technical term “relation” can be interchanged with the standard notion we have of “tabular data,” say an instance of a “Person” relation

Rows

Rows are called tuples (or records), represent a single instance of this relation, and must be unique

Columns

Columns are called attributes, specify some element contained by each of the
tuples

Multiple tables and relations

Of course, a single table isn’t particular interesting as a rich data source; we could have done (and in fact do) store the exact same information in a CSV file. Where relational data becomes interesting is when we have multiple tables and explicit relationships between them. For example, considering the case above, the “Role” attribute in our Person table is not particularly well defined. What is the data type of this attribute? A string? If so, we could enter anything we want here, which would not be ideal for maintaining a large collection of data, say if each student has to remember to type their role exactly as “Student” (with capitalization intact, of course). A better alternative is to create a separate “Role” table, that lists the allowable roles for the course:

Primary keys

A primary key is a unique identifier for each row in the table. It is common to have a single column (like we do here as the “ID” column) serve as the primary key, but that is not required; the primary key can consist of multiple columns so long as they are unique in every row. Every relation (table) in the database must have exactly one primary key. Its unique ID for every tuple in a relation (i.e. every row in the table), each relation must have exactly one primary key

Foreign keys

A foreign key is an attribute that “points” to the primary key of another table. Thus, in the above example, the “Role ID” attribute in the Person table is a foreign key, pointing to the primary key of the Role table.

Indexes (not indices)

Indexes are created as ways to “quickly” access elements of a table
For example, consider finding people with last name “Gates”: no option but just scan through the whole dataset: 𝑂 𝑛 operation

Indexes are created to “quickly” look up rows by some of their attributes. For example, suppose we wanted to find all the people in our Person table with the last name of “Gates”. Naively, there would be no way to do this except search over the entire table to see every column that matched this last name. Instead of doing this, we can build an index on the “Last Name” attribute to provide an efficient means for retreiving tuples based upon last name.

Think of an index as a separate sorted table containing the indexed column and the tuple location: searching for value takes 𝑂(log 𝑛) time
In practice, use data structure like a B-tree(a type of search tree that in made to be particularly efficient on disk) or several others.

The primary key always has an index associated with it (so you can think of primary keys themselves as always being a fast way to access data)
Indexes don’t have to be on a single column, can have an index over multiple
columns (with some ordering)

Entity relationships

Several types of inter-table relationships
1. One-to-one
2. (One-to-zero/one)
3. One-to-many (and many-to-one)
4. Many-to-many
These relate one (or more) rows in a table with one (or more) rows in another table, via a foreign key
Note that these relationships are really between the “entities” that the tables
represent, but we won’t formalize this beyond the basic intuition

One-to-many relationship

We have already seen a one-to-many relationship: one role can be shared by
many people, denoted as follows

One-to-one relationships

In a true one-to-one relationship spanning multiple tables, each row in a table has exactly one row in another table
Not very common to break these across multiple tables, as you may as well just add another attribute to an existing table, but it is possible

One-to-zero/one relationships

More common in databases is to find “one-to-zero/one” relationships broken
across multiple tables
Consider adding a “Grades” table to our database: each person can have at most one tuple in the grades table

Many-to-many relationships

Creating a grades table as done before is a bit cumbersome, because we need to keep adding columns to the table, null entries if someone doesn’t do the homework
Alternatively, consider adding two tables, a “homework” table that represents
information about each homework, and an associative table that links homeworks to people

Setups like this encode many-to-many relationships: each person can have
multiple homeworks, and each homework can be done by multiple people

We could also write this in terms of relationships specified by the associative table, but this is not really correct, as it is mixing up the underlying relationships with how they are stored in a database