Sophisticated Search and Modification

Fuzzy Search

When we search things like the movie title, we often want to use the fuzzy search. For example, we want that a search of “2001, space odyssey” can return “2001: a space odyssey”. The common approach is the “full-text search”, which split the text in words (removing the punctuations), and eliminate the common words, then associate each word with the records in database.

These words are stored in a special table.

CREATE TABLE movie_title_ft_index(
	title_word varchar(30) NOT NULL,
	movieid int NOT NULL,
	primary key(title_word, movieid),
	foreign key (movieid)
		references movies(movieid)
)

Note that a true full-text search engine “stems” words, i.e., the singular/plural, infinitive/preterit/past participle are considered as the same word.

But the issue is the same word will probably occur in several titles, so we need a way to qualify the search matches. A naive solution may be to sort the number of matched words:

SELECT movieid FROM
	(SELECT movieid, rank() OVER (ORDER BY hits DESC) AS rnk
		FROM
			(SELECT movieid, count(*) AS hits
				FROM movie_title_ft_index
				WHERE title_word IN ('SPACE', 'ODYSSEY', '2001')
				GROUP BY movieid) q2) q1
	WHERE rnk=1

But what about the ties when the numbers of matches are the same? We may then use the length of unmatched part to rate these search. And to deal with typo, we can extend our index with common misspellings and typos.

Transaction

Often in financial operations, we may want that the series of translated database operations must all succeed or fail at once (like the atomic operation in the concurrent programming). We can do this by putting them into one transaction. Somce products require a special command to start a transaction:

1 2	BEGIN TRANSACTION ...

Other products such as Oracle or DB2 automatically start aa transaction if you aren’t already in one when you start modifying data. A transaction ends when you issue either COMMIT or ROLLBACK:

BEGIN;
DELETE FROM movies WHERE movieid=25;
ROLLBACK;

BEGIN;
DELETE FROM movies WHERE movieid=25;
COMMIT;

One important thing is concurrency. While there is one active transaction that is modifying the databse, the DBMS will prevent other users from changing the same data.

Every COMMIT defines a consistent state of data, which is the “official” state of data at that very moment. A ROLLBACK will takes you back to the last consistent state known.

[!caution] Many products (MySQL is one of them) start in “autocommit” mode, which means that every change is automatically commited and cannot be undone otherwise than by running the reverse operation (not always easy). Some interfaces such as JDBC (Java DataBase Connectivity) always start in autocommit mode, even when accessing products such as Oracle that never natively work in such a mode.

Caution

For products such as Oracle and MySQL, any change to the structure of the database (DDL operations) automatically commits all pending changes to data; DDL operations cannot usually be rolled back. This isn’t the case with SQL Server or PostgreSQL, for which a transaction can contain DDL statements (they can be rolled back). Switching to another DBMS is a mine field.

In real world bank applications, a batch program may run periodically to recompute the new balance, and a withdraw/deposit will only insert a new record in a operation table. In between, balances can be recomputed from the old balance and by aggregating the latest operations.

Another case will be the like/dislike of websites, we can use a primary key in the operation table and that will ensure that everybody can only vote once for a video. Under this scenario, what appears as an update to the database is in fact an insert.

On the other hand, deletes often are virtual deletes, i.e. they never really remove the record from the database, instead the update a “active” flag. One good reason for this is foreign keys, which need a cascading delete when you remove the key they rely on.

Insert

Most products (except Oracle and SQLite) allow inserting several rows in one statement, with a comma separated list of row data between parentheses:

INSERT INTO table_name
		(col1, col2, ..., coln)
	VALUES
		(value1, value2, ..., valuen),
		...,
		(value1, value2, ..., valuen)

If the columns are not specified, the DBMS will treat it as inserting into all columns in the same order as they are displayed when running SELECT *. This is extremely dangerous since the column order may change!

If one column is omitted from the INSERT, it will be the default one if sepcified, otherwise it will be NULL. To specify a default value for a column we need to do this at table creation:

CREATE TABLE table_name
(
	column_name data_type DEFAULT default_value NOT NULL
)

Concurrency

On a busy database it is common that two sessions will read the same value and try to insert the same value simultaneously. The solution is to let the system manage the generation of new identifiers, and there are two ways of doing it.

Sequence

We can use special database objects called sequences, which are simply number generated. By default they start with 1 and increase by 1:

1	CREATE SEQUENCE movie_seq

We can obtain a new number that guarateed to be unique from a sequence, and retrieve the last number obtained for this sequence and this session: Oracle:

INSERT INTO movies(movieid, ...)
VALUES(movie_seq.nextval, ...)
INSERT INTO credits(movieid, ...)
VALUES(movie_seq.currval, ...)

DB2, SQLServer:

INSERT INTO movies(movieid, ...)
VALUES(NEXT VALUE FOR movie_seq, ...)
INSERT INTO credits(movieid, ...)
VALUES(PREVIOUS VALUE FOR movie_seq, ...)

PostgreSQL:

INSERT INTO movies(movieid, ...)
VALUES(nextval('movie_seq'), ...)
INSERT INTO credits(movieid, ...)
VALUES(currval('movie_seq'), ...)

Auto-numbered Column

SQLServer:

1 2	CREATE TABLE movies (movieid int NOT NULL IDENTITY PRIMARY KEY)

DB2:

1 2	CREATE TABLE movies (movieid int GENERATED AS IDENTITY PRIMARY KEY)

PostgreSQL, MySQL:

1 2	CREATE TABLE movies (movieid SERIAL PRIMARY KEY)

MySQL:

1 2	CREATE TABLE movies (movieid int NOT NULL AUTO_INCREMENT PRIMARY KEY)

SQLite:

1 2	CREATE TABLE movies (movieid integer PRIMARY KEY)

Now if we insert a film with an auto-numbered column, just omit the movieid from the INSERT statement, and it will get automatically populated. To retrieve the last value generated in your session:

SQLServer: @@identity
PostgreSQL: lastval()
MySQL: last_insert_id()
IBM DB2: identity_val_local()
SQLite: last_insert_rowid()

Loading Data from a File

MySQL:

LOAD DATA INFILE '/tmp/us_movie_info.txt'
INTO TABLE us_movie_info
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'

In the file is on the local file system:

LOAD DATA LOCAL INFILE '/tmp/us_movie_info.txt'
INTO TABLE us_movie_info
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'

SQLServer:

1
2
3

BULK INSERT us_movie_info
FROM 'C:\temp\us_movie_info.txt'
WITH (rowterminator = '0x0a')

For a CSV file, a format file must be used:

1
2
3

BULK INSERT us_movie_info
FROM 'C:\temp\us_movie_info.csv'
WITH (formatfile = 'C:\temp\us_movie_info.fmt')

SQLServer has a special function openrowset() that allows us to see the file as a table:

SELECT REPLACE(title, '"', '') title,
	   year_released,
	   duration,
	   CASE color
	     WHEN 'B' THEN 'N'
	     WHEN 'C' THEN 'Y'
	   END color
FROM openrowset(BULK 'C:\temp\us_movie_info.csv',
				formatfile='C:\temp\us_movie_info.fmt') AS virtual_table

Oracle can do similar things to SQLServer but requires registering a directory from the database, and your account must have special rights given by a Database Administrator on this directory:

CREATE TABLE virtual_us_movie_info
	(title varchar2(150),
	 year_released number(4)
	 duratoin number(3),
	 color char)
ORGANIZATION EXTERNAL (DEFAULT DIRECTORY input_dir
	ACCESS PARAMETERS
		(RECORDS DELIMITED BY '\n'
		 FILEDS TERMINATED BY ' ')
	LOCATION ('us_movie_info.txt'))

PostgreSQL uses the non-standard SQL COPY command to load a file located on the server. The default format is tab-separated:

1
2
3

COPY us_movie_info
FROM '/tmp/us_movie_info.csv'
WITH (FORMAT CSV)

1
2
3

\COPY us_movie_info
FROM '/tmp/us_movie_info.csv'
WITH (FORMAT CSV)

Principles of Database Systems (H) Lecture 7 Notes