Principles of Database Systems (H) Lecture 4 Notes

Zhige Chen Lv4

Advanced Queries

DISTINCT

To obtain a valid relation, we must ensure that there are no duplicate identifiers in the result set. For example, the naive query below

1
SELECT country FROM movies WHERE year_released=2000
This query will return a result set that contains many identical rows. In other words, we may obtaining a table, but it’s not a relation because many rows cannot be distinguished.

Tip

If we need to do some operations with the query result, always try to accomplish it by using database.

If we are only interested in the different countries, we can use the keyword DISTINCT

1
SELECT DISTINCT country FROM movies WHERE year_released=2000
Now the result set is a valid relation.

If there are multiple columns after the keyword DISTINCT, them it will eliminate those rows where all selected fields are identical.

1
SELECT DISTINCT country, year_released FROM movies WHERE year_released IN (2000,2001)
The selected combination (country, year_released) will be identical.

Aggregate Functions

The aggregate function will aggregate all rows that share a feature and return a characteristic of each group of aggregated rows. It will be clearer with an example. To compute an aggregate result, we’ll first retrieve data. And then regroup all the data according to the values in one or several columns.

For example, if we want to group by country, and for each country the aggregate function COUNT(*) says how many movies we have.

1
2
3
4
SELECT country,
COUNT(*) number_of_movies
FROM movies
GROUP BY country
We can also group on several columns.
1
2
3
4
SELECT country, year_released
COUNT(*) number_of_movies
FROM movies
GROUP BY country, year_released
Every column that isn’t an aggregate function and appears after SELECT must also appear after GROUP BY.

Caution

Beware of some performance implication. When we apply a simple WHERE filter, the DBMS can start returning rows as soon as a match is found. With a GROUP BY, the rows must be regrouped before the aggregate can take place. In other words, we need a preparatory phase that may take time, even if few rows are returned in the end.

Common aggregate functions: - COUNT(*)/COUNT(col) - MIN(col) - MAX(col) - AVG(col) - STDDEV()

For example, to find the earliest release year by country:

1
2
3
4
SELECT country,
MIN(year_released) oldest_movie
FROM movies
GROUP BY country
Since the result returned by MIN will be a relation, we can apply another relational operation to the result:
1
2
3
4
5
6
7
8
SELECT *
FROM (
SELECT country,
MIN(year_released) oldest_movie
FROM movies
GROUP BY country
) earliest_movies_per_country
WHERE oldest_movie < 1940
There is a short-hand that makes nesting queries unnecessary. We can have a condition on the result of an aggregate with HAVING:
1
2
3
4
5
SELECT country,
MIN(year_released) oldest_movie
FROM movies
GROUP BY country
HAVING MIN(year_released) < 1940
The two following queries are both valid in SQL and behaviorally equivalent:
1
2
3
4
5
6
7
8
9
10
11
SELECT country,
MIN(year_released) oldest_movie
FROM movies
GROUP BY country,
HAVING country='us'

SELECT country,
MIN(year_released) oldest_movie
FROM movies
WHERE country='us'
GROUP BY country,
But the efficient way is the second one. This is because it will first filter the result then aggregate them.

Caution

The query optimizers in different DBMS are very likely different. We must not rely the optimization that depends on a specifc DBMS. The most efficient query should always be chosen.

The aggregate ignores NULL. In the following query, the WHERE condition changes nothing to th result:

1
2
3
SELECT MAX(died) most_recent_death
FROM people
WHERE died IS NOT NULL
We can also count distinct columns:
1
2
3
4
SELECT country,
COUNT(DISTINCT year_released) number_of_years
FROM movies
GROUP BY country

A complex example: find out how many people that are both actors (represented by the kind 'A') and directors (represented by the kind 'D').

1
2
3
4
5
6
7
8
SELECT peopleid, COUNT(*) AS number_of_roles
FROM(
SELECT DISTINCT peopleid, credited_as
FROM credits
WHERE credited_as IN ('A', 'D')
)
GROUP BY peopleid
HAVING count(*)=2

Retrieving Data from Multiple Tables

In the building of complex queries, a important invariant that we need to maintain is that the result set return by those queries are valid relations, with no duplicates and a column that could be used as a key. This must hold at every stage in a query.

To relate data from multiple tables, we need to use JOIN. For example, to retrieve country names by using the country column in the table movies, we can use:

1
2
3
SELECT title, country_name, year_released
FROM movies JOIN countries
ON country_code=country
The JOIN operation will create a virtual table with all combinations between rows in the two tables. If table 1 has rows, and table 2 has rows, then the result virtual table will have rows.

In the query above, we use an ON filter to filter out unrelated rows to make a much smaller virtual table.

If the two tables joined contain columns that has the same name, then it will be ambiguous to join on these columns. There is something called natural join (unsupported by SQLServer) that basically means “if a column has the same name, then we should join on it”. But It can be quite counterintuitive because it is based purely on names.

We can solve the issue by USING (not supported by SQLServer either):

1
2
3
SELECT DISTINCT first_name, surname
FROM people JOIN credits USING (peopleid)
WHERE credited_as = 'D'
A much better solution is to explicitly write out the table names
1
2
3
4
SELECT DISTINCT first_name, surname
FROM people JOIN credits
ON credits.peopleid = people.peopleid
WHERE credited_as = 'D'
We can create short alias to shorten the query:
1
2
3
4
SELECT DISTINCT first_name, surname
FROM people AS p JOIN credits AS c
ON c.peopleid = p.peopleid
WHERE credited_as = 'D'

Caution

Most DBMS accept people AS p instead of people p, except Oracle.

More about JOIN

A simple example of self join is if, for actor families, each row were containing the identifiers of the father and mother if they are in the database. You can display child and father.

1
2
3
4
SELECT c.first_name || ' ' || c.surname as person,
f.first_name || ' ' || f.surname as father
FROM people AS c JOIN people AS f
ON f.peopleid = c.fatherid
A JOIN can as well be applied to a virtual table returned from a query, as long as the result is a valid relation.

We can also chain joins the same way we chain filtering conditions with AND. JOINs between 10 or 15 tables aren’t uncommon, and queries generated by programs often do much worse.

1
2
3
4
5
6
SELECT ... FROM table1 
JOIN table2
ON ...
...
JOIN tablen
ON ...
Let’s write a relatively simple query, say we want to retrieve all British movie titles with director names. Finding all the tables involved is simple enough. But trouble begins as soon as we start writing column names after SELECT:
1
SELECT m.title, p.surname
- The title comes from movies, and the surname comes from people. But is it a key? - Consider when two brothers co-direct a film, then the title and surname are definitely not unique. How to handle the situation of a movie with two directors? A easy and naive solution is to use DISTINCT:
1
SELECT DISTINCT m.title, p.surname
But it will lose much information. A better solution is to return everything that is required to be certain about uniqueness:
1
2
3
4
5
6
7
SELECT m.year_released, m.title, p.first_name, p.surname
FROM movies AS m
INNER JOIN credits AS c
ON c.movieid = m.movieid
INNER JOIN people AS p
ON p.peopleid = c.peopleid
WHERE c.credited_as = 'D' AND m.country='gb'
One important thing is that it better to start with the table for which we can select efficiently fewer rows before starting joining. Though in most situations this would not cause any performance differences, but we cannot rely on the specific behavior of the DBMS.

  • Title: Principles of Database Systems (H) Lecture 4 Notes
  • Author: Zhige Chen
  • Created at : 2025-09-29 14:00:24
  • Updated at : 2025-10-10 13:10:58
  • Link: https://nofe1248.github.io/2025/09/29/dbh-04/
  • License: This work is licensed under CC BY-NC-SA 4.0.
Comments
On this page
Principles of Database Systems (H) Lecture 4 Notes