Advanced Queries

`DISTINCT`

To obtain a valid relation, we must ensure that there are no duplicate identifiers in the result set. For example, the naive query below

1	SELECT country FROM movies WHERE year_released=2000

This query will return a result set that contains many identical rows. In other words, we may obtaining a table, but it’s not a relation because many rows cannot be distinguished.

Tip

If we need to do some operations with the query result, always try to accomplish it by using database.

If we are only interested in the different countries, we can use the keyword DISTINCT

1	SELECT DISTINCT country FROM movies WHERE year_released=2000

Now the result set is a valid relation.

If there are multiple columns after the keyword DISTINCT, them it will eliminate those rows where all selected fields are identical.

1	SELECT DISTINCT country, year_released FROM movies WHERE year_released IN (2000,2001)

The selected combination (country, year_released) will be identical.

Aggregate Functions

The aggregate function will aggregate all rows that share a feature and return a characteristic of each group of aggregated rows. It will be clearer with an example. To compute an aggregate result, we’ll first retrieve data. And then regroup all the data according to the values in one or several columns.

For example, if we want to group by country, and for each country the aggregate function COUNT(*) says how many movies we have.

SELECT country,
	COUNT(*) number_of_movies
	FROM movies
	GROUP BY country

We can also group on several columns.

SELECT country, year_released
	COUNT(*) number_of_movies
	FROM movies
	GROUP BY country, year_released

Every column that isn’t an aggregate function and appears after SELECT must also appear after GROUP BY.

Caution

Beware of some performance implication. When we apply a simple WHERE filter, the DBMS can start returning rows as soon as a match is found. With a GROUP BY, the rows must be regrouped before the aggregate can take place. In other words, we need a preparatory phase that may take time, even if few rows are returned in the end.

Common aggregate functions: - COUNT(*)/COUNT(col) - MIN(col) - MAX(col) - AVG(col) - STDDEV()

For example, to find the earliest release year by country:

SELECT country,
	MIN(year_released) oldest_movie
	FROM movies
	GROUP BY country

Since the result returned by MIN will be a relation, we can apply another relational operation to the result:

SELECT *
	FROM (
		SELECT country,
			MIN(year_released) oldest_movie
			FROM movies
			GROUP BY country
	) earliest_movies_per_country
	WHERE oldest_movie < 1940

There is a short-hand that makes nesting queries unnecessary. We can have a condition on the result of an aggregate with HAVING:

SELECT country,
	MIN(year_released) oldest_movie
	FROM movies
	GROUP BY country
	HAVING MIN(year_released) < 1940

The two following queries are both valid in SQL and behaviorally equivalent:

SELECT country,
	MIN(year_released) oldest_movie
	FROM movies
	GROUP BY country,
	HAVING country='us'
	
SELECT country,
	MIN(year_released) oldest_movie
	FROM movies
	WHERE country='us'
	GROUP BY country,

But the efficient way is the second one. This is because it will first filter the result then aggregate them.

Caution

The query optimizers in different DBMS are very likely different. We must not rely the optimization that depends on a specifc DBMS. The most efficient query should always be chosen.

The aggregate ignores NULL. In the following query, the WHERE condition changes nothing to th result:

1
2
3

SELECT MAX(died) most_recent_death
	FROM people
	WHERE died IS NOT NULL

We can also count distinct columns:

SELECT country,
	COUNT(DISTINCT year_released) number_of_years
	FROM movies
	GROUP BY country

A complex example: find out how many people that are both actors (represented by the kind 'A') and directors (represented by the kind 'D').

SELECT peopleid, COUNT(*) AS number_of_roles
	FROM(
		SELECT DISTINCT peopleid, credited_as
			FROM credits
			WHERE credited_as IN ('A', 'D')
	)
	GROUP BY peopleid
	HAVING count(*)=2

Retrieving Data from Multiple Tables

In the building of complex queries, a important invariant that we need to maintain is that the result set return by those queries are valid relations, with no duplicates and a column that could be used as a key. This must hold at every stage in a query.

To relate data from multiple tables, we need to use JOIN. For example, to retrieve country names by using the country column in the table movies, we can use:

1
2
3

SELECT title, country_name, year_released
	FROM movies JOIN countries
	ON country_code=country

The JOIN operation will create a virtual table with all combinations between rows in the two tables. If table 1 has

rows, and table 2 has

rows, then the result virtual table will have

rows.

In the query above, we use an ON filter to filter out unrelated rows to make a much smaller virtual table.

If the two tables joined contain columns that has the same name, then it will be ambiguous to join on these columns. There is something called natural join (unsupported by SQLServer) that basically means “if a column has the same name, then we should join on it”. But It can be quite counterintuitive because it is based purely on names.

We can solve the issue by USING (not supported by SQLServer either):

1
2
3

SELECT DISTINCT first_name, surname
	FROM people JOIN credits USING (peopleid)
	WHERE credited_as = 'D'

A much better solution is to explicitly write out the table names

SELECT DISTINCT first_name, surname
	FROM people JOIN credits
	ON credits.peopleid = people.peopleid
	WHERE credited_as = 'D'

We can create short alias to shorten the query:

SELECT DISTINCT first_name, surname
	FROM people AS p JOIN credits AS c
	ON c.peopleid = p.peopleid
	WHERE credited_as = 'D'

Caution

Most DBMS accept people AS p instead of people p, except Oracle.

More about `JOIN`

A simple example of self join is if, for actor families, each row were containing the identifiers of the father and mother if they are in the database. You can display child and father.

SELECT c.first_name || ' ' || c.surname as person,
	   f.first_name || ' ' || f.surname as father
	FROM people AS c JOIN people AS f
	ON f.peopleid = c.fatherid

A JOIN can as well be applied to a virtual table returned from a query, as long as the result is a valid relation.

We can also chain joins the same way we chain filtering conditions with AND. JOINs between 10 or 15 tables aren’t uncommon, and queries generated by programs often do much worse.

SELECT ... FROM table1 
	JOIN table2
		ON ...
	...
	JOIN tablen
		ON ...

Let’s write a relatively simple query, say we want to retrieve all British movie titles with director names. Finding all the tables involved is simple enough. But trouble begins as soon as we start writing column names after SELECT:

1	SELECT m.title, p.surname

- The title comes from movies, and the surname comes from people. But is it a key? - Consider when two brothers co-direct a film, then the title and surname are definitely not unique. How to handle the situation of a movie with two directors? A easy and naive solution is to use DISTINCT:

1	SELECT DISTINCT m.title, p.surname

But it will lose much information. A better solution is to return everything that is required to be certain about uniqueness:

SELECT m.year_released, m.title, p.first_name, p.surname
	FROM movies AS m
		INNER JOIN credits AS c
			ON c.movieid = m.movieid
		INNER JOIN people AS p
			ON p.peopleid = c.peopleid
		WHERE c.credited_as = 'D' AND m.country='gb'

One important thing is that it better to start with the table for which we can select efficiently fewer rows before starting joining. Though in most situations this would not cause any performance differences, but we cannot rely on the specific behavior of the DBMS.

Principles of Database Systems (H) Lecture 4 Notes

Advanced Queries

DISTINCT

Aggregate Functions

Retrieving Data from Multiple Tables

More about JOIN

`DISTINCT`

More about `JOIN`