When we search things like the movie title, we often want to use the
fuzzy search. For example, we want that a search of “2001,
space odyssey” can return “2001: a space odyssey”. The common approach
is the “full-text search”, which split the text in words (removing the
punctuations), and eliminate the common words, then associate each word
with the records in database.
Note that a
true full-text search engine “stems” words, i.e., the singular/plural,
infinitive/preterit/past participle are considered as the same word.
But the issue is the same word will probably occur in several titles,
so we need a way to qualify the search matches. A naive solution may be
to sort the number of matched words:
1 2 3 4 5 6 7 8
SELECT movieid FROM (SELECT movieid, rank() OVER (ORDERBY hits DESC) AS rnk FROM (SELECT movieid, count(*) AS hits FROM movie_title_ft_index WHERE title_word IN ('SPACE', 'ODYSSEY', '2001') GROUPBY movieid) q2) q1 WHERE rnk=1
But what about the
ties when the numbers of matches are the same? We may then use the
length of unmatched part to rate these search. And to deal with typo, we
can extend our index with common misspellings and typos.
Transaction
Often in financial operations, we may want that the series of
translated database operations must all succeed or fail at once (like
the atomic operation in the concurrent programming). We can do this by
putting them into one transaction. Somce products require a
special command to start a transaction:
1 2
BEGINTRANSACTION ...
Other products
such as Oracle or DB2 automatically start aa transaction if you aren’t
already in one when you start modifying data. A transaction ends when
you issue either COMMIT or ROLLBACK:
1 2 3 4 5 6 7
BEGIN; DELETEFROM movies WHERE movieid=25; ROLLBACK;
BEGIN; DELETEFROM movies WHERE movieid=25; COMMIT;
One important thing is concurrency. While there is one
active transaction that is modifying the databse, the DBMS will prevent
other users from changing the same data.
Every COMMIT defines a consistent state of data, which
is the “official” state of data at that very moment. A
ROLLBACK will takes you back to the last consistent state
known.
[!caution] Many products (MySQL is one of them) start in “autocommit”
mode, which means that every change is automatically commited and cannot
be undone otherwise than by running the reverse operation (not always
easy). Some interfaces such as JDBC (Java DataBase Connectivity) always
start in autocommit mode, even when accessing products such as Oracle
that never natively work in such a mode.
Caution
For products such as Oracle and MySQL, any change to the structure of
the database (DDL operations) automatically commits all pending changes
to data; DDL operations cannot usually be rolled back. This isn’t the
case with SQL Server or PostgreSQL, for which a transaction can contain
DDL statements (they can be rolled back). Switching to another DBMS is a
mine field.
In real world bank applications, a batch program may run periodically
to recompute the new balance, and a withdraw/deposit will only insert a
new record in a operation table. In between, balances can be recomputed
from the old balance and by aggregating the latest operations.
Another case will be the like/dislike of websites, we can use a
primary key in the operation table and that will ensure that everybody
can only vote once for a video. Under this scenario, what appears as an
update to the database is in fact an insert.
On the other hand, deletes often are virtual deletes, i.e. they never
really remove the record from the database, instead the update a
“active” flag. One good reason for this is foreign keys, which need a
cascading delete when you remove the key they rely on.
Insert
Most products (except Oracle and SQLite) allow inserting several rows
in one statement, with a comma separated list of row data between
parentheses:
If the columns are not specified, the DBMS
will treat it as inserting into all columns in the same order as they
are displayed when running SELECT *. This is extremely
dangerous since the column order may change!
If one column is omitted from the INSERT, it will be the
default one if sepcified, otherwise it will be NULL. To
specify a default value for a column we need to do this at table
creation:
On a busy database it is common that two sessions will read the same
value and try to insert the same value simultaneously. The solution is
to let the system manage the generation of new identifiers, and there
are two ways of doing it.
Sequence
We can use special database objects called sequences, which
are simply number generated. By default they start with 1 and increase
by 1:
1
CREATESEQUENCE movie_seq
We can obtain a new number that guarateed to be
unique from a sequence, and retrieve the last number obtained for this
sequence and this session: Oracle:
1 2 3 4
INSERT INTO movies(movieid, ...) VALUES(movie_seq.nextval, ...) INSERT INTO credits(movieid, ...) VALUES(movie_seq.currval, ...)
CREATETABLE movies (movieid intNOTNULLIDENTITYPRIMARY KEY)
DB2:
1 2
CREATETABLE movies (movieid intGENERATEDASIDENTITYPRIMARY KEY)
PostgreSQL, MySQL:
1 2
CREATETABLE movies (movieid SERIALPRIMARY KEY)
MySQL:
1 2
CREATE TABLE movies (movieid int NOT NULL AUTO_INCREMENT PRIMARY KEY)
SQLite:
1 2
CREATE TABLE movies (movieid integer PRIMARY KEY)
Now if we insert a film with an auto-numbered column, just omit the
movieid from the INSERT statement, and it will
get automatically populated. To retrieve the last value generated in
your session:
SQLServer: @@identity
PostgreSQL: lastval()
MySQL: last_insert_id()
IBM DB2: identity_val_local()
SQLite: last_insert_rowid()
Loading Data from a File
MySQL:
1 2 3 4
LOAD DATA INFILE '/tmp/us_movie_info.txt' INTO TABLE us_movie_info FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
In the file is on the local file system:
1 2 3 4
LOAD DATA LOCAL INFILE '/tmp/us_movie_info.txt' INTO TABLE us_movie_info FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
SQLServer:
1 2 3
BULK INSERT us_movie_info FROM'C:\temp\us_movie_info.txt' WITH (rowterminator = '0x0a')
For a CSV file, a format file must be used:
1 2 3
BULK INSERT us_movie_info FROM'C:\temp\us_movie_info.csv' WITH (formatfile = 'C:\temp\us_movie_info.fmt')
SQLServer has a special function openrowset() that
allows us to see the file as a table:
1 2 3 4 5 6 7 8 9
SELECT REPLACE(title, '"', '') title, year_released, duration, CASE color WHEN'B'THEN'N' WHEN'C'THEN'Y' END color FROM openrowset(BULK 'C:\temp\us_movie_info.csv', formatfile='C:\temp\us_movie_info.fmt') AS virtual_table
Oracle can do similar things to SQLServer but requires registering a
directory from the database, and your account must have special rights
given by a Database Administrator on this directory: