Postprocessing Queries

Sorting

We have a single construct in SQL that sorts the result set:

1
2
3

SELECT title, year_released
	FROM movies
	ORDER BY year_released

This will return all movies, starting with the oldest one. The default order is ascending, we can also use the keywords DESC and ASC to control it:

SELECT c.country_name, m.title, m.year_released
	FROM movies m
		INNER JOIN countries c
			ON c.country_code = m.country
	WHERE m.movie IN
		(SELECT DISTINCT c.movieid
			FROM credits c
				INNER JOIN people p
					ON p.peopleid = c.peopleid
			WHERE c.credited_as = 'A'
				AND p.birth_year >= 1970)
	ORDER BY c.country_name ASC, m.year_released DESC, m.title ASC

The specific ordering depends on the data type, e.g. the strings are sorted lexicographically. But it can be troublesome when NULLs are in the comparison. In this case, the ordering depends on specific behaviour of DBMS:

NULL is smaller than everything: SQL Server, MySQL, SQLite
NULL is greater than everything: DB2, Oracle, PostgreSQL

Sometimes we may found the orderings of texts vary across different languages or cultures. To solve this, we need to use collation:

-- For PostgreSQL, SQL Server, MySQL
CREATE TABLE table_name(
	some_text_column varchar(100)
		COLLATE collation_name NOT NULL
)

-- For Oracle
ORDER BY nls_sort(some_text_column, 'collation_name')

Another problem is that when we want orderings other than the ascending and descending behavior. For example, to order the credit of people in movies:

ORDER BY
	CASE credited_as
		WHEN 'D' THEN 1
		WHEN 'P' THEN 2
		WHEN 'A' THEN 3
	END

Another problem is that we may need to retrieve only a “slice” of sorted data, e.g. successive pages on websites. We can leverage the LIMIT construct to achieve this:

-- PostgreSQL, MySQL, SQLite
SELECT title, country, year_released
	FROM movies
	ORDER BY title
	LIMIT 10
	
-- DB2, Oracle, PostgreSQL
SELECT title, country, year_released
	FROM movies
	ORDER BY title
	FETCH FIRST 10 ROWS ONLY
	
-- SQLServer
SELECT TOP 10 title, country, year_released
	FROM movies,
	ORDER BY title

If we want to “skip” the first few elements, we can use the OFFSET keyword:

-- PostgreSQL, MySQL, SQLite
SELECT title, country, year_released
	FROM movies
	ORDER BY title
	LIMIT 10 OFFSET 20
	
-- DB2, Oracle, PostgreSQL
SELECT title, country, year_released
	FROM movies
	ORDER BY title
	OFFSET 20
	FETCH FIRST 10 ROWS ONLY

There are also many cases when plain ordering isn’t satisfying. For example, messages in a forum thread can form a very complex hierarchy, i.e. tree-like data. DBMS is known to struggle when dealing with data like this. This is the well-known BOM (Bill Of Materials) problem.

One way to solve this problem is by using the “materialized path”, turing the “ancestor” into a attribute.

Oracle has implemented another solution by dynamic ordering:

SELECT message
	FROM forum_posts
	CONNECT BY answered_postid = prior postid
	START WITH answered_postid IS NULL
		AND topicid = ...
	ORDER siblings BY postid

Most DBMSs handle hierarchies through recursive queries.

Window Functions

We have seen so far two kinds of functions:

Scalar functions that operate on values in the current row
Aggregate function that operate on sets of rows.

The problem with aggregate functions is that they ignore details of data. If I ask for the year of the oldest movie per country, the database will only return a country and a year, and nothing else.

If we want more details with aggregate functions, the only option is to join their output to the same table:

SELECT a.country, a.title, a.year_released
	FROM movies a
		INNER JOIN
			(SELECT country, min(year_released) minyear
				FROM movies
				GROUP BY country) b
			ON b.country = a.country
				AND b.minyear = a.year_released

Window functions solve this problem by returning a result for a single row and computing the result on several rows. With DBMS products that support window functions, every aggregate function can be used as a window function:

SELECT country, title, year_released,
	min(year_released)
		OVER (PARTITION BY country) earliest_year
	FROM movies

Thus, this query returns two years for every movie:

the release year
earliest movie for the same country was released

But this will return many unnecessary results. We need to limit output to those movies for which the year of release happens to be the same as the earliest one for their country:

SELECT m.country, m.title, m.year_released
	FROM 
		(SELECT country, title, year_released,
			min(year_released)
				OVER(PARTITION BY country) earliest_year
			FROM movies) m
	WHERE m.year_released = m.earliest_year

Be careful that if we use filters in the inner query:

SELECT m.country, m.title, m.year_released
	FROM 
		(SELECT country, title, year_released,
			min(year_released)
				OVER(PARTITION BY country) earliest_year
			FROM movies
			WHERE title != 'example') m
	WHERE m.year_released = m.earliest_year

it will return the second earliest movie! This is because the window functions are a postprocessing function on the result set. The corrected version is

SELECT m.country, m.title, m.year_released
	FROM 
		(SELECT country, title, year_released,
			min(year_released)
				OVER(PARTITION BY country) earliest_year
			FROM movies) m
	WHERE m.year_released = m.earliest_year
		AND title != 'example'

Also note that we cannot “flatten” the nested query into one query. This is mainly because the logical processing order of SQL:

FROM/JOIN
WHERE
GROUP BY
HAVING
Window Functions
SELECT
ORDER BY

Then if we try to move the window function to the outer query:

1
2
3

SELECT country, title, year_released
	FROM movies
	WHERE year_released = MIN(year_released) OVER(PARTITION BY country)

Since window functions are processed after the WHERE clause, this will cause the WHERE clause depends on unavailable data!

Window functions always operate against rows that belong to a result set. One related characteristics is that they can only appear after the SELECT, not in the WHERE clause, and there is nothing with them similar to HAVING with aggregate functions.

Window functions+OVER can be rewrite by using GROUP BY+JOIN:

SELECT a.country, a.title, a.year_released
	FROM movies a
		INNER JOIN
			(SELECT country, min(year_released) earliest_year
				FROM movies
				GROUP BY country) b
			ON b.country = a.country
				AND b.earliest_year = a.year_released
	WHERE a.title != 'example'

Similar to that we can have an aggregate function without a GROUP BY clause when we want only one result for the whole table, we can have an empty OVER clause to indicate that the window function should compute the result over all rows selected. This is frequently used in operations such as computing a value as a percentage of the total:

SELECT country_name, cnt as number_of_movies,
	   round(100 * cnt / sum(cnt) over (), 0) as percentage
	FROM 
		(SELECT c.country_name, coalesce(m.cnt, 0) cnt
			FROM countries c
				LEFT OUTER JOIN 
					(SELECT country, count(*) cnt
						FROM movies
						GROUP BY country) m
					ON m.country = c.country_code) q
	ORDER BY country_name

Tip

When there is an ORDER BY clause we cannot start returning rows before you have seen all of them, so we may count them too when sorting, and the marginal cost of the window function is near zero.

We can rewrite the query by using a CROSS JOIN:

SELECT country_name, cnt as number_of_movies,
	   round(100 * cnt / t.movie_count, 0) as percentage
	FROM 
		(SELECT c.country_name, coalesce(m.cnt, 0) cnt
			FROM countries c
				LEFT OUTER JOIN 
					(SELECT country, count(*) cnt
						FROM movies
						GROUP BY country) m
					ON m.country = c.country_code) q
		CROSS JOIN
			(SELECT count(*) movie_count
				FROM movies) t
	ORDER BY country_name

There are a special window functions, called the ranking window functions:

row_number()
rank()
dense_rank()

With a ranking window function you must have a ORDER BY clause in the OVER(). You can combine it with a PARTITION BY to order with groups.

The three functions mainly differs in the ways of dealing with ties:

row_number() assigns distinct, sequential numbers to everyone
rank() assigns the same number to ties, but leave a corresponding gap in ranks
dense_rank() also assigns the same number to ties, with no gap in ranks

Principles of Database Systems (H) Lecture 6 Notes

Postprocessing Queries

Sorting

Window Functions