Principles of Database Systems (H) Lecture 6 Notes

Zhige Chen Lv4

Postprocessing Queries

Sorting

We have a single construct in SQL that sorts the result set:

1
2
3
SELECT title, year_released
FROM movies
ORDER BY year_released
This will return all movies, starting with the oldest one. The default order is ascending, we can also use the keywords DESC and ASC to control it:
1
2
3
4
5
6
7
8
9
10
11
12
SELECT c.country_name, m.title, m.year_released
FROM movies m
INNER JOIN countries c
ON c.country_code = m.country
WHERE m.movie IN
(SELECT DISTINCT c.movieid
FROM credits c
INNER JOIN people p
ON p.peopleid = c.peopleid
WHERE c.credited_as = 'A'
AND p.birth_year >= 1970)
ORDER BY c.country_name ASC, m.year_released DESC, m.title ASC

The specific ordering depends on the data type, e.g. the strings are sorted lexicographically. But it can be troublesome when NULLs are in the comparison. In this case, the ordering depends on specific behaviour of DBMS:

  • NULL is smaller than everything: SQL Server, MySQL, SQLite
  • NULL is greater than everything: DB2, Oracle, PostgreSQL

Sometimes we may found the orderings of texts vary across different languages or cultures. To solve this, we need to use collation:

1
2
3
4
5
6
7
8
-- For PostgreSQL, SQL Server, MySQL
CREATE TABLE table_name(
some_text_column varchar(100)
COLLATE collation_name NOT NULL
)

-- For Oracle
ORDER BY nls_sort(some_text_column, 'collation_name')

Another problem is that when we want orderings other than the ascending and descending behavior. For example, to order the credit of people in movies:

1
2
3
4
5
6
ORDER BY
CASE credited_as
WHEN 'D' THEN 1
WHEN 'P' THEN 2
WHEN 'A' THEN 3
END

Another problem is that we may need to retrieve only a “slice” of sorted data, e.g. successive pages on websites. We can leverage the LIMIT construct to achieve this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- PostgreSQL, MySQL, SQLite
SELECT title, country, year_released
FROM movies
ORDER BY title
LIMIT 10

-- DB2, Oracle, PostgreSQL
SELECT title, country, year_released
FROM movies
ORDER BY title
FETCH FIRST 10 ROWS ONLY

-- SQLServer
SELECT TOP 10 title, country, year_released
FROM movies,
ORDER BY title
If we want to “skip” the first few elements, we can use the OFFSET keyword:
1
2
3
4
5
6
7
8
9
10
11
12
-- PostgreSQL, MySQL, SQLite
SELECT title, country, year_released
FROM movies
ORDER BY title
LIMIT 10 OFFSET 20

-- DB2, Oracle, PostgreSQL
SELECT title, country, year_released
FROM movies
ORDER BY title
OFFSET 20
FETCH FIRST 10 ROWS ONLY

There are also many cases when plain ordering isn’t satisfying. For example, messages in a forum thread can form a very complex hierarchy, i.e. tree-like data. DBMS is known to struggle when dealing with data like this. This is the well-known BOM (Bill Of Materials) problem.

One way to solve this problem is by using the “materialized path”, turing the “ancestor” into a attribute.

Oracle has implemented another solution by dynamic ordering:

1
2
3
4
5
6
SELECT message
FROM forum_posts
CONNECT BY answered_postid = prior postid
START WITH answered_postid IS NULL
AND topicid = ...
ORDER siblings BY postid

Most DBMSs handle hierarchies through recursive queries.

Window Functions

We have seen so far two kinds of functions:

  • Scalar functions that operate on values in the current row
  • Aggregate function that operate on sets of rows.

The problem with aggregate functions is that they ignore details of data. If I ask for the year of the oldest movie per country, the database will only return a country and a year, and nothing else.

If we want more details with aggregate functions, the only option is to join their output to the same table:

1
2
3
4
5
6
7
8
SELECT a.country, a.title, a.year_released
FROM movies a
INNER JOIN
(SELECT country, min(year_released) minyear
FROM movies
GROUP BY country) b
ON b.country = a.country
AND b.minyear = a.year_released
Window functions solve this problem by returning a result for a single row and computing the result on several rows. With DBMS products that support window functions, every aggregate function can be used as a window function:
1
2
3
4
SELECT country, title, year_released,
min(year_released)
OVER (PARTITION BY country) earliest_year
FROM movies
Thus, this query returns two years for every movie:

  • the release year
  • earliest movie for the same country was released

But this will return many unnecessary results. We need to limit output to those movies for which the year of release happens to be the same as the earliest one for their country:

1
2
3
4
5
6
7
SELECT m.country, m.title, m.year_released
FROM
(SELECT country, title, year_released,
min(year_released)
OVER(PARTITION BY country) earliest_year
FROM movies) m
WHERE m.year_released = m.earliest_year
Be careful that if we use filters in the inner query:
1
2
3
4
5
6
7
8
SELECT m.country, m.title, m.year_released
FROM
(SELECT country, title, year_released,
min(year_released)
OVER(PARTITION BY country) earliest_year
FROM movies
WHERE title != 'example') m
WHERE m.year_released = m.earliest_year
it will return the second earliest movie! This is because the window functions are a postprocessing function on the result set. The corrected version is
1
2
3
4
5
6
7
8
SELECT m.country, m.title, m.year_released
FROM
(SELECT country, title, year_released,
min(year_released)
OVER(PARTITION BY country) earliest_year
FROM movies) m
WHERE m.year_released = m.earliest_year
AND title != 'example'

Also note that we cannot “flatten” the nested query into one query. This is mainly because the logical processing order of SQL:

  1. FROM/JOIN
  2. WHERE
  3. GROUP BY
  4. HAVING
  5. Window Functions
  6. SELECT
  7. ORDER BY

Then if we try to move the window function to the outer query:

1
2
3
SELECT country, title, year_released
FROM movies
WHERE year_released = MIN(year_released) OVER(PARTITION BY country)
Since window functions are processed after the WHERE clause, this will cause the WHERE clause depends on unavailable data!

Window functions always operate against rows that belong to a result set. One related characteristics is that they can only appear after the SELECT, not in the WHERE clause, and there is nothing with them similar to HAVING with aggregate functions.

Window functions+OVER can be rewrite by using GROUP BY+JOIN:

1
2
3
4
5
6
7
8
9
SELECT a.country, a.title, a.year_released
FROM movies a
INNER JOIN
(SELECT country, min(year_released) earliest_year
FROM movies
GROUP BY country) b
ON b.country = a.country
AND b.earliest_year = a.year_released
WHERE a.title != 'example'

Similar to that we can have an aggregate function without a GROUP BY clause when we want only one result for the whole table, we can have an empty OVER clause to indicate that the window function should compute the result over all rows selected. This is frequently used in operations such as computing a value as a percentage of the total:

1
2
3
4
5
6
7
8
9
10
11
SELECT country_name, cnt as number_of_movies,
round(100 * cnt / sum(cnt) over (), 0) as percentage
FROM
(SELECT c.country_name, coalesce(m.cnt, 0) cnt
FROM countries c
LEFT OUTER JOIN
(SELECT country, count(*) cnt
FROM movies
GROUP BY country) m
ON m.country = c.country_code) q
ORDER BY country_name

Tip

When there is an ORDER BY clause we cannot start returning rows before you have seen all of them, so we may count them too when sorting, and the marginal cost of the window function is near zero.

We can rewrite the query by using a CROSS JOIN:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
SELECT country_name, cnt as number_of_movies,
round(100 * cnt / t.movie_count, 0) as percentage
FROM
(SELECT c.country_name, coalesce(m.cnt, 0) cnt
FROM countries c
LEFT OUTER JOIN
(SELECT country, count(*) cnt
FROM movies
GROUP BY country) m
ON m.country = c.country_code) q
CROSS JOIN
(SELECT count(*) movie_count
FROM movies) t
ORDER BY country_name

There are a special window functions, called the ranking window functions:

  • row_number()
  • rank()
  • dense_rank()

With a ranking window function you must have a ORDER BY clause in the OVER(). You can combine it with a PARTITION BY to order with groups.

The three functions mainly differs in the ways of dealing with ties:

  • row_number() assigns distinct, sequential numbers to everyone
  • rank() assigns the same number to ties, but leave a corresponding gap in ranks
  • dense_rank() also assigns the same number to ties, with no gap in ranks
  • Title: Principles of Database Systems (H) Lecture 6 Notes
  • Author: Zhige Chen
  • Created at : 2025-10-21 22:03:57
  • Updated at : 2025-11-01 20:46:10
  • Link: https://nofe1248.github.io/2025/10/21/dbh-06/
  • License: This work is licensed under CC BY-NC-SA 4.0.
Comments
On this page
Principles of Database Systems (H) Lecture 6 Notes