Databases and SQL

Download Report

Transcript Databases and SQL

Databases and SQL
Storing Data
• The simplest way to store data is in a “flat” file.
That is, a file written in plain ASCII text, that you
can open and read. For relatively small amounts
of data, this simple approach is a good idea.
• However, flat files are not very efficient: they
must be searched linearly, from beginning to
end. Also, flat files take up a lot of space due to
redundancy and to blank spaces used to make
them more readable.
• A better approach for moderate to large amounts
of data is to use a relational database such as
MySQL (“my ess-que-ell”) or Oracle.
Basic Database Theory
• In 1970, IBM researcher E. F. Codd published the
seminal paper “A relational model of data for large
shared databases”, which specified the basic design
principles still used today for designing databases.
• Database design can be very complicated: it is a whole
specialized field that we are going to touch on only very
lightly.
• Databases are composed of tables of data. Each row is
an “entity”, a single object such as a gene. Entities are
also called “records”.Each column is an “attribute”, a
property that the entities have, for example the length of
the gene or the protein it produces. Attributes are also
called “fields”.
Normalization
• The basic principle in database design is that we
want to enter each unique piece of data a single
time only. That way, if it changes we need to
only change a single copy instead of having
multiple copies that will get out of synch with
each other. The process of designing tables to
accomplish this is called “normalization”.
• There are 3 “normal forms”, or normalization
principals.
First Normalization Principal
• A database is said to be in “first normal form” if
every attribute has exactly one value. “NULL” is
an acceptable value.
• For example, if you are creating a database of
chromosomes, you cannot have a column
(attribute) named “genes”, because there are
many genes on each chromosome. Instead, you
would need to create a separate “genes” table.
The genes table would have a column named
“chromosome”.
Skip a bunch of theory...
Accessing MySQL Databases
•
At the command line prompt, type “mysql -u z012345 -p” . The “-u” expects your
MySQL user name, and “-p” causes it to prompt you for your password. Note that
your MySQL user name and password are not necessarily the same as your regular
biolinx password. If all goes well, MySQL will then give you some kind of welcome
message.
•
You need to use a specific database. For the examples here, we will be using the
“GL” database or the “bios546db” database. You also have your own personal
database, called something like “z012345db”. It takes special permission to create a
new database, so I have done that for you.
•
To access a particular database type “use GL” (or whatever) at the MySQL prompt.
•
MySQL commands are NOT case-sensitive. Often, command keywords are written
in all caps, just to distinguish them from your user-specific values such as table and
column names. However, this isn’t necessary.
MySQL commands have to end with a semicolon (;). If you forget this, the system
will put an arrow on the next line: ->, which means it is waiting for you to finish the
command.
•
Accessing Data Tables
• A database consists of one or more tables. To see which tables are
present in the database, type “SHOW TABLES;” for the current
database, or “SHOW TABLES FROM bios546db;” if you want to
see the tables in bios546db and are in another database. Note that
all commands must end in a semi-colon.
• The GL database has tables “pets”, “clients”, and “charges”.
• Tables contain columns of data. To see the columns and the type of
data they contain, type “SHOW COLUMNS FROM table_name;”.
Or, you can use the equivalent command “DESCRIBE table_name;”.
• The “pets” table has columns pet_id, owner_id, name, species, age,
sex, breed, color, temperment, and notes. Note that pet_id is listed
as the primary key. This means that pet_id is a unique identifier for
each entity (row, individual pet) in this table.
Accessing Data
• The basic keyword used to access data is “SELECT”.
• The basic syntax is: “SELECT column1, column2, ...
FROM table_name WHERE qualifying_clause;”
• A typical select statement is something like: “SELECT
name FROM pets WHERE owner_id = 3;” This extracts
data from the “name” column in the “pets” table, and only
gets the data rows in which the “owner_id” column has
the value 3.
• The wildcard character is “*”, which will give you all the
data. Thus, “SELECT * FROM pets;” extracts all data
from the table.
WHERE clauses
•
•
•
•
•
•
•
SELECT statements are based on properly qualifying which data you want
to extract.
The equals sign ( = ) matches rows. Automatic conversion of types occurs,
so strings and numbers can all be compared with the same symbol.
Not equals is either <> or !=
<, <=, >, >= all work
AND or &&, OR or ||, NOT or !: MySQL is very flexible.
You can also use LIKE instead of =, and use “%” as a wildcard character.
For example, SELECT name FROM pets WHERE name LIKE ‘B%’; will
extract all pet names that start with a B and are followed by any number of
other characters. Similarly, SELECT name FROM pets WHERE name LIKE
‘%b’; will find any name that ends in a “b”.
All string values must be quoted. I usually use single quotes, but double
quotes also work. In contrast, numerical values are not quoted. On the
third hand (the gripping hand), database names, table names, and column
names are not quoted. Very annoying, and MySQL is very picky about such
things.
Functions in SELECT statements
• The WHERE clause can do mathematical, string, and logical
operations on either side of the comparison operator (equals sign).
• Parentheses can be used to group statements appropriately.
• Simple math operations: +, -, *, /, %. For example: SELECT name
FROM pets WHERE owner_id % 2 = 0; will give you all the pets
whose owners have ID numbers that are even.
• MySQL has quite a number of built in functions as well. A few
examples: ABS (absolute value), CONCAT (concatenate strings),
CEILING (smallest integer larger than the operand), GREATEST
(returns largest value from the list following the operator), LENGTH
(number of characters in string), LOG (natural logarithm), RIGHT
(returns rightmost characters in the string). See the list in the book
handout.
• IF statements can also be used as part of the WHERE clause. The
format is: IF(test, value1, value2). If the “test” is true, value1 is
returned; if false, value2 is returned.
Aggregate Functions
• SELECT statements can also be used to
perform operations on the returned data. The
function is written into the list of columns to be
used, with the column name in parentheses.
• For example: SELECT AVG(age) FROM pets
WHERE species = ‘feline’; gives a single value,
the average age of all the cats.
• Other useful aggregate functions: COUNT, MAX,
MIN, SUM, STD (standard deviation).
• COUNT is especially useful is determining the
size of the database: SELECT COUNT(*) FROM
pets;
Joining Data from Two of More
Tables
•
•
•
•
•
•
The whole point of relational databases is to separate data into different
tables so updates need only be done in one place.
The basic syntax needed to join data from 2 tables is how to refer to a
specific column in a specific table. This is done by writing the table name
followed by a dot followed by the column name, For example, “pets.name”
refers to the “name” column in the “pets” table.
If a column name is unambiguous (i.e. only found in 1 table), you can
usually get away with not qualifying it with the table name.
You need to list all the tables that you are using in the FROM clause.
An example: SELECT pets.name, clients.first_name, clients.last_name
FROM pets, clients WHERE pets.owner_id = clients.client_id AND species
= ‘canine’; The WHERE clause compares every entry in the pets.id column
to the entries in clients.client_id, and returns only those rows with a match.
MySQL chooses which table to read first based on efficiency.
There are other, fancier ways to join tables, such as the “LEFT JOIN”. I’m
not going to talk about them.
Creating Tables
• Creating a new database requires special permission, so I have
created you each your own database, with name something like
“z012345db”.
• The basic commands for tables are: CREATE, ALTER, and DROP.
• For creating a table, the syntax is: CREATE TABLE name
(column1_name column1_type, column2_name column2_type, ...);
Note the position of the commas in the column list: each column is
given a name and a type, followed by a comma, then the next
column name and type.
• The statement in parentheses following the table name is the “create
clause”, It consists of the column name followed by the column type,
followed optionally by modifiers to the type.
Column Types
• Column types determine what kind of data is stored and
how much space is allocated for it. The basic types are
integer, floating point numbers, dates, and text. There
are a number of variations of each of these, depending
on how long the values are expected to be. The table in
the book helps.
• In many cases you need to add the length (number of
characters) in parentheses following the type. For
instance CHAR(20) would set aside 20 characters of
space for each data row. Note that for numerical values,
the decimal point, minus sign, and “E” (for exponent)
each take up a character.
More Column Types
•
•
You can get a long way on just a few types.
INT will cover all but very large integers ( between -2,147,483,648 and
2,147,483,647).
–
•
FLOAT or DOUBLE will cover floating point numbers (single or double-precision,
respectively).
–
•
•
•
•
•
the optional parameter “length” gives the number of characters displayed, right-justified and
padded with spaces. Thus, INT(5) would print the number 17 as
“ 17”.
you can add optional parameters in the form (M, D), where M is the total number of digits
displayed, and D is the number after the decimal point. Note that, unlike printf, the M
parameter doesn’t count the decimal point or a “-” at the beginning. Thus, FLOAT(5,2) would
print out a value such as “ 4.32”.
CHAR(length) works for text. By default it is case-insensitive. If you want casesensitive fields, use BINARY(length) instead. The length can be up to 255
characters.
TEXT and BLOB are variable length text fields, case-insensitive and case-sensitive,
respectively. They hold up to 64 KB of data. If you need more, MEDIUMTEXT (or
BLOB) holds up to 16 MB, and LONGTEXT (or BLOB) holds up to 4 GB.
DATE stores dates in a YYYYMMDD format: 4 digits of the year, followed by a 2 digit
month, followed by a 2 digit day.
TIMESTAMP stores the last time that data row was modified. It has quite a few
formats. The default stores the time to the second, which I find worthless. I like
TIMESTAMP(8), which stores YYYYMMDD format.
As usual, there are other data types--see the book.
Primary Key
• Each table needs a primary key, a column with a unique
entry for each row. Although you can use anything you
like for this, it is usually easiest and safest to use
sequentially numbered integers as the ID numbers of the
primary key.
• The primary key is designated as such in the “create
clause”, using the keyword “PRIMARY KEY” after the
column type. Also, if you use “AUTO_INCREMENT”
after the column type, MySQL will automatically put in
the next available value.
• For example: CREATE TABLE table_name (id INT
PRIMARY KEY AUTO_INCREMENT, name CHAR(20),
age INT(3) );
Drop
• If you want to remove a table, use DROP
TABLE table_name;
Altering Tables
•
“ALTER TABLE table_name” can be used to add, change, or remove the
columns. Various keywords follow the table name:
•
Adding a new column: ALTER TABLE name ADD (create clause); Use the
same type of clauses as in CREATE: column_name column_type.
•
Dropping a column: ALTER TABLE name DROP column_name;
•
Changing a column type: ALTER TABLE name MODIFY column_name
new_type
•
Changing a column name: ALTER TABLE name CHANGE old_name
new_name column_type. Note that you have to put in the column type as
well as the new name, even if the type stays the same!
•
So, MODIFY is just for changing the column type, leaving the column name
the same. CHANGE is for changing the column name as well as its type:
with CHANGE you must specify the type as well as the name.
Adding, Altering, and Removing
Data
• The basic commands for individual data items are
INSERT, UPDATE, and DELETE.
• If you want to insert a value into every column of a table
in the proper order, the command is INSERT INTO
table_name VALUES (col1_value, col2_value,
col3_value, ...); The values are simply inserted left to
right at the next available position in the table.
• If your primary key column is an INT with
AUTO_INCREMENT (which I highly recommend), use
“null” as the value for that column when inserting data.
This tells MySQL to invoke the AUTO_INCREMENT
function and put in the next integer.
• Similarly, the value of a TIMESTAMP column should be
“null”, so MySQL will automatically insert the current
time.
• The “INTO” word isn’t really necessary.
More INSERTs
• You can insert values into multiple rows by separating the values for
each row (enclosed in parentheses) with commas: INSERT INTO
table VALUES (val1a, val2a, val3a), (val1b, val2b, val2c), (val3a,
val3b, val3c);
• You can take the results of a SELECT statement and insert them
into a table: INSERT INTO table (column_name1, column_name2,
...) SELECT statement;
• You can insert into specific columns: the values of the other columns
will be set to NULL or to a pre-determined default value. There are
two ways of doing this:
– INSERT INTO table_name (col1_name, col2_name) VALUES
(col1_value, col2_value);
– INSERT INTO table_name SET col1_name=col_value,
col2_name=col2_value;
UPDATE
• You can alter data in specific rows using
UPDATE. It uses a WHERE clause like
the ones used in SELECT statements, and
it uses the “SET” keyword as in the
previous INSERT statement.
• UPDATE table_name SET
col3_name=col3_value WHERE
col2_name = whatever;
DELETE
• Data rows are removed with the DELETE
command. WHERE clauses are used
here as well.
• DELETE FROM table_name WHERE
col1_name = some_value;
• Note that the entire row is deleted. IF you
just want to remove the value of a
particular column, use UPDATE and set
that column to NULL.
LOAD
• To load large amounts of data into a database, the LOAD command
can be used. It reads from an external file.
• By default, the external file should have each data field (column)
separated by tabs and each data entity (row) on a separate line.
• You can modify this default behavior by adding the phrases “FIELDS
TERMINATED BY ‘termination character’ “, or FIELDS ENCLOSED
BY ‘enclosing character’ “, or LINES TERMINATED BY ‘termination
character’ “. Note that you can only use one character in these
phrases. I often use FIELDS TERMINATED BY ‘,’, a comma as
termination character, instead of the default tab character.
• Basic syntax: LOAD DATA LOCAL INFILE ‘path_to_file’ INTO
TABLE table_name;
• Be sure you have permission to read the file!
Indexes
•
•
•
•
•
•
The whole point of using a database, as opposed to a flat file, is to speed up
searches. One way to markedly increase search speed is to make an index
based on your most common search criteria.
For instance, I have a table of rice genes, containing the gene name, the
chromosome it is on, the start and end positions on that chromosome, and
its orientation. nearly all searches are based on chromosome and start
position. Thus my main index uses these two criteria.
You can make as many indexes as you like.
An index is made from one or more columns, using the syntax CREATE
INDEX index_name ON table_name (column_name1, column_name2, ...).
If you use more than one column, you automatically get additional indexes
based on all subsets of the columns form left to right. For example, if your
created an index with the columns name, rank, and serial_num, you also
get an index based on name only, and another one based and name and
rank.
When you create a table, an index based on the primary key is
automatically constructed.
Commands Used in Class
• The commands all are written with ">" in front of them. They relate
to the “bios546db” database.
• >show databases;
• >create table student (id int auto_increment primary key, name
char(30), year char(9), gender char(1), up_date timestamp(8) );
• >show columns from student;
• >create table bios308 (id int not null, name char(30), quiz1 int, quiz2
int, grade char(1) );
• show columns from bios308;
• >alter table student change year class char(9);
• >alter table student modify class char(10);
• >alter table bios308 add quiz3 int;
More Commands Used in Class
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
> insert into student values (null, 'fred flintstone', 'junior', 'm' , null);
>select * from student;
>insert into student values (null, 'wilma flintstone', 'senior', 'f', null);
>insert into student values (null, 'betty rubble', 'sophomore', 'f', null), (null, ‘barney
rubble’, 'sophomore', 'm', null);
> insert into student (name, gender) values ('bam bam', 'm');
>update student set name = 'bamm bamm' where name = 'bam bam';
> update student set class = 'freshman' where name = 'bamm bamm';
>update student set class = 'freshperson' where class = 'freshman';
>alter table student modify class char(11);
update student set class = 'freshperson' where class = 'freshperso';
> delete from student where class = 'senior';
>insert into bios308 (id, name) select id, name from student where class='junior' OR
class='sophomore';
> update bios308 set quiz1 = 14 where name= 'fred flintstone';
> update bios308 set quiz2 = 25, quiz3 = 31, grade = 'C' where name = 'fred
flintstone';
> update bios308 set quiz1 = 34, quiz2 = 27, quiz3 = 38, grade = 'B' where name like
‘%tty%';
LOAD DATA LOCAL INFILE 'mysql_new_students.txt' INTO TABLE student;
The Perl DBI
• SQL is a whole language unto itself, and database specialists
become masters of it. On the other hand, we are trying to give you
the basics of SQL only, and then have you use Perl to deal with
many of the fancier manipulations that an SQL master could
perform.
• The Perl database interface (DBI) uses object-oriented syntax. You
need to create objects, mainly database and statement handles,
then use their associated methods with the arrow (->) operator.
Note this arrow uses the hyphen ( - ). It is different from the =>
operator that substitutes for the comma when creating hashes.
• Although object-oriented programming is a whole philosophy, it can
easily be used without any further knowledge if you just consider it
to be a slightly altered syntax.
Connecting to MySQL from Perl
• You need to use the DBI module: put “use DBI;” at the top of your
program, under “use strict;”.
• You need to create a database handle, which we will call $dbh. This
is a scalar variable that holds a reference to the data in the database
object.
my $dbh = DBI->connect(database, name, password);
• Here, the “database” must be fully qualified: “dbi:mysql:bios546db”
would connect you to the bios546db database.
• Your personal MySQL databases don’t have a password, so just use
“” for that field.
• So, to access the z01235db database through Perl, use the
command:
my $dbh = DBI->connect(“dbi:mysql:z012345db”, “z012345”, “”);
• After you are done using the database, you should disconnect from
it with the command $dbh->disconnect;
Do Statements
• For all commands except SELECT and SHOW, the “do”
statement can be used. The syntax is: $dbh->do(“your
SQL statement”); .Note that the SQL statement inside
the parentheses doesn’t need to end in a semicolon. As
usual, the line of Perl code must still end in a semicolon.
• For example: $dbh->do(“INSERT INTO student VALUES
(null, ‘Mr. Ed., ‘horse’, ‘m’, null)”); The return value of the
statement is the number of rows affected. The NULL
values here are for the auto_increment primary key and
for the timestamp.
• MySQL is annoyingly picky about quotes. Table and
column names are NOT quoted, but string values are
quoted. Also, numerical values are not quoted.
SELECT Statements
• For each new SELECT statement you need to prepare a
statement handle, then execute it, then fetch the results.
You can use the same statement handle repeatedly once
it is prepared.
my $sth = $dbh->prepare(“SELECT * from student”);
$sth->execute;
while (my @row = $sth->fetchrow_array) {
print “@row\n”;
}
• The fetchrow_array command returns an array
consisting of all the fetched values in the data row. It
does this repeatedly for each row that matches the
select statement. You can of course access individual
elements of the array.
More on SELECT
• A quick alternative to fetchrow_array is “dump_results”,
which just prints all the fetched data at once, with each
data row on a new line.
$sth->dump_results;
• Statement handles should be closed when you are done
with them:
$sth->finish;
• SHOW statements work the same way as SELECT
statements: first prepare them, then execute, then fetch
with fetchrow_array or dump_results.
Binding Parameters
•
•
•
•
•
It is possible to prepare a statement without knowing all the values in
advance. You can add the values when you execute it. With this technique,
different values can be put into the same statement handle during different
executions.
To do this, use question marks (?) in place of the values in the “prepare”
statement.
Then, put the necessary values as a list of parameters in the execute
statement. The values must appear in the same order that the question
marks are in.
This is the easiest way to do multiple INSERT statements into the same
table. Instead of using “do”, prepare a statement handle for the INSERT
statement once, then execute it repeatedly for different values of the data.
For example, assume that the INFILE consists of lines of data, separated by
commas:
my $sth = $dbh->prepare(“INSERT INTO student VALUES (null, ?, ?, ?,
null)”);
while (my $line = <INFILE>) {
my ($name, $class, $gender) = split /,/, $line;
$sth->execute($name, $class, $gender);
}