Principles of Database Systems (H) Lecture 4 Notes
Advanced Queries
DISTINCT
To obtain a valid relation, we must ensure that there are no duplicate identifiers in the result set. For example, the naive query below
1 | SELECT country FROM movies WHERE year_released=2000 |
Tip
If we need to do some operations with the query result, always try to accomplish it by using database.
If we are only interested in the different countries, we can use the
keyword DISTINCT
1 | SELECT DISTINCT country FROM movies WHERE year_released=2000 |
If there are multiple columns after the keyword
DISTINCT, them it will eliminate those rows where all
selected fields are identical.
1 | SELECT DISTINCT country, year_released FROM movies WHERE year_released IN (2000,2001) |
(country, year_released) will be identical.
Aggregate Functions
The aggregate function will aggregate all rows that share a feature and return a characteristic of each group of aggregated rows. It will be clearer with an example. To compute an aggregate result, we’ll first retrieve data. And then regroup all the data according to the values in one or several columns.
For example, if we want to group by country, and for each country the
aggregate function COUNT(*) says how many movies we have.
1 | SELECT country, |
1 | SELECT country, year_released |
SELECT must also appear after GROUP BY.
Caution
Beware of some performance implication. When we apply a simple
WHERE filter, the DBMS can start returning rows as soon as
a match is found. With a GROUP BY, the rows must be
regrouped before the aggregate can take place. In other words, we need a
preparatory phase that may take time, even if few rows are returned in
the end.
Common aggregate functions: -
COUNT(*)/COUNT(col) - MIN(col) -
MAX(col) - AVG(col) -
STDDEV()
For example, to find the earliest release year by country:
1 | SELECT country, |
MIN will be a
relation, we can apply another relational operation to the result:
1 | SELECT * |
HAVING: 1 | SELECT country, |
1 | SELECT country, |
Caution
The query optimizers in different DBMS are very likely different. We must not rely the optimization that depends on a specifc DBMS. The most efficient query should always be chosen.
The aggregate ignores NULL. In the following query, the
WHERE condition changes nothing to th result:
1 | SELECT MAX(died) most_recent_death |
1 | SELECT country, |
A complex example: find out how many people that are both actors
(represented by the kind 'A') and directors (represented by
the kind 'D').
1 | SELECT peopleid, COUNT(*) AS number_of_roles |
Retrieving Data from Multiple Tables
In the building of complex queries, a important invariant that we need to maintain is that the result set return by those queries are valid relations, with no duplicates and a column that could be used as a key. This must hold at every stage in a query.
To relate data from multiple tables, we need to use
JOIN. For example, to retrieve country names by using the
country column in the table movies, we can
use:
1 | SELECT title, country_name, year_released |
JOIN operation will create a
virtual table with all combinations between rows in the two tables. If
table 1 has In the query above, we use an ON filter to filter out
unrelated rows to make a much smaller virtual table.
If the two tables joined contain columns that has the same name, then it will be ambiguous to join on these columns. There is something called natural join (unsupported by SQLServer) that basically means “if a column has the same name, then we should join on it”. But It can be quite counterintuitive because it is based purely on names.
We can solve the issue by USING (not supported by
SQLServer either):
1 | SELECT DISTINCT first_name, surname |
1 | SELECT DISTINCT first_name, surname |
1 | SELECT DISTINCT first_name, surname |
Caution
Most DBMS accept people AS p instead of
people p, except Oracle.
More about JOIN
A simple example of self join is if, for actor families, each row were containing the identifiers of the father and mother if they are in the database. You can display child and father.
1 | SELECT c.first_name || ' ' || c.surname as person, |
JOIN can as well be applied to a virtual table returned
from a query, as long as the result is a valid relation.
We can also chain joins the same way we chain filtering conditions
with AND. JOINs between 10 or 15 tables aren’t
uncommon, and queries generated by programs often do much worse.
1 | SELECT ... FROM table1 |
SELECT: 1 | SELECT m.title, p.surname |
title comes from movies, and the
surname comes from people. But is it a key? -
Consider when two brothers co-direct a film, then the title and surname
are definitely not unique. How to handle the situation of a movie with
two directors? A easy and naive solution is to use
DISTINCT: 1 | SELECT DISTINCT m.title, p.surname |
1 | SELECT m.year_released, m.title, p.first_name, p.surname |
- Title: Principles of Database Systems (H) Lecture 4 Notes
- Author: Zhige Chen
- Created at : 2025-09-29 14:00:24
- Updated at : 2025-10-10 13:10:58
- Link: https://nofe1248.github.io/2025/09/29/dbh-04/
- License: This work is licensed under CC BY-NC-SA 4.0.