Transcript Document

Databases
A/Prof Alan Fekete
University of Sydney
Overview
• Data, databases, and database
management systems
• Querying a database with SQL
• Creating a database with a RDBMS
• Tradeoffs
Data
• Facts about the world that are worth knowing
– Eg NGC 450 is at 18.876,-0.861
– Eg The AAT has a 3.9 m reflector
– The redshift of NGC 450 was measured with the
AAT’s 2dF spectrograph
• Data is very valuable
– Data can be analysed to produce theories, or to
test them
– It takes a lot of effort to gather data
Persistence
• Values of variables in a running program are
stored in the memory (RAM)
– They are lost when the program ends, or the
machine crashes
• Computers have persistent storage (hard
disk, flash memory etc) whose values persist
• The operating system organizes the
persistent storage into files
– So store valuable data in files
Sharing
• Once someone has collected data and got it
into a computer, they can use it for many
purposes, and many others can use it
• The naïve approach: copy the file around
– Or email, or set up an ftp site, etc
• This is very bad, if the data might change
– More data (more galaxies, more facts about each
galaxy!)
– Corrections to existing data
• Copies can become inconsistent; so share
one store instead
A Database
• A store of data relevant to some
domain, stored in a computer, and
shared (between application programs
or between users)
• Typically, the data is structured
– Many facts in each of a few types
– Eg many galaxies, each has name, ra, dec,
redshift etc
Clients and servers
• The database server is the machine where the database is
stored
• Each user must run some presentation client code on their
machine, to display the data
– Usually this is just a web browser
• There may be other machines in-between
– Eg web server, application server
– Note that a machine may be a server which receives some
requests, and also a client of others as it tries to respond!
• We focus on the database server, and the “database client”
where the application code is running that understands the data
Two approaches
• Purpose-written
application code,
accessing data
stored in files, using
knowledge of the
structure of the files
This is common in science
• Use a database
management system
(DBMS) which
manages the data and
provides higher-level
facilities to access it
– You may write
applications that call the
DBMS, or even use it
directly, interactively
This is standard in industry,
and is becoming more frequent
in science
DBMS Features
• A query language
– Describe the data to retrieve
– Can also describe updates etc
• A data definition language
– Describe the logical structure of the data
• Storage management
– Control the layout in files, and with facilities to
improve performance
• Access control
Different Data Models
• Each database has a model or schema
which tells what sort of data is stored
– Eg each galaxy has id, name, ra, dec, etc
• Using a DBMS means that the data
model must fit the style supported by
the DBMS
• The most common DBMSs all support
the relational style of data model
Relational Data Models
• Database consists of tables, each with a
name
• Each table has many rows, each with exactly
the same structure
– several attributes, each with a name and type
(integer, float, string of length 20, etc)
– Each row has a value for each attribute
• Rows are distinguished by the values of a
primary key (one or more attributes)
Example
Note use of “foreign key” to connect information in different tables
• Schema:
– galaxy(gal_id INTEGER,
gal_ra FLOAT, gal_decl
FLOAT, gtype STRING)
– observation(obs_id
INTEGER, obs_ra
FLOAT, obs_decl FLOAT,
obs_flux FLOAT, gal_id
INTEGER, source
STRING)
• An Instance
1
2
3
4
1 13.1
7.2 S
2 14.7
5.3 C
3 11.2
4.8 S
4 15.3
4.6 S
14.6
11.2
14.5
11.1
13.4
9.6
13.3
9.7
5.4
4.9
4.3
4.8
2
3
2
3
A
A
B
C
Views
• You can define a view on a database
• Just like a table, but the data isn’t
inserted or stored explicitly
– It is computed when needed from the other
tables
• A view can be used in queries just like a
table
Queries
• User can write a query, and run it
– Answer is a collection of relevant data
• Query may be typed interactively, or
written into a program for repeated use
– Perhaps with parameters that vary
• Answer may be viewed on the screen,
imported into a program, printed, or
stored back in the database
SQL
• A standardized language in which queries
can be written against databases in relational
DBMSs
– But each platform has slight variations
• SQL is declarative
– Query describes what data is needed
– RDBMS has the job of working out how to trawl
through the database to compute the answer
Select-from-where
SELECT gal_id, gtype
FROM galaxy
WHERE gal_ra
between 13.0 and
16.0
AND gal_decl between
4.0 and 6.0;
1 13.1
7.2 S
2 14.7
5.3 C
3 11.2
4.8 S
4 15.3
4.6 S
Answer is:
2 C
4 S
Remember
• FROM tells which table to use
• WHERE tells which rows you are
interested in
– A boolean condition that is evaluated for
each row
• SELECT says which attributes
(columns) you want in the answer
Note query is stated SELECT-FROM-WHERE
Limiting the answers
• Real scientific databases have tables with
many many rows
• A query which returns too many is expensive
to run
– And it’s hard to understand the answer
• Especially when exploring the data, get a
small sample first
– Ask for the top few rows, perhaps with respect to
some order on the selected rows
• Syntax details are not standard
– MySQL has “LIMIT n” clause at end of query
Joins
• A query can
combine data from
several tables
SELECT observation.obs_id,
galaxy.gal_ra, galaxy.gal_decl,
galaxy.gtype
FROM galaxy JOIN observation
– use the foreign keys ON observation.gal_id = galaxy.gal_id
to pick values from WHERE observation.source=‘A’;
the related row
From observation
1
2
3
4
14.6
11.2
14.5
11.1
5.4
4.9
4.3
4.8
13.4
9.6
13.3
9.7
2
3
2
3
From galaxy
A
A
B
C
14.7
11.2
14.7
11.2
5.3
4.8
5.3
4.8
C
S
C
S
Unstructured joins
• Sometimes you want to take all
combinations of rows from tables, not
just the rows related through foreign key
– Then filter by a condition
SELECT galaxy.gal_id, observation.obs_id
FROM galaxy, observation
WHERE ABS(galaxy.gal_ra - observation.obs_ra)<0.2
AND ABS(galaxy.gal_decl - observation.obs_decl)<0.3;
Aggregation
• Combine the values
from a column, among
all the rows that satisfy
WHERE clause
– MAX, MIN, SUM, AVG,
COUNT
– COUNT(DISTINCT
colname) is usefully
different if values are
duplicated in the column
SELECT MAX(obs_flux)
FROM observation
WHERE gal_id = 3;
Find the largest flux in
any observation of
galaxy 3
Grouping
• Divide the rows into
groups, based on the
values in some
column(s)
• Aggregate the values in
some other column(s),
for each group
• Display the grouping
column along with the
aggregate
SELECT source,
COUNT(DISTINCT
obs_id)
FROM observation
WHERE obs_flux > 10.0
GROUP BY source;
For each source, report
how many observations
that have high flux were
made by that source.
Subqueries
• SQL is block-structured
• A subquery can be used
in the FROM clause,
instead of an existing
table
• A subquery can be used
to produce a value used
in the WHERE clause
– eg in a comparison
SELECT obs_id
FROM observation
WHERE obs_flux =
(SELECT
MAX(obs_flux)
FROM observation);
Find the observation
with the highest flux
Creating a database
• Decide on the platform to use
• Decide on the schema that will
represent your data
• Create a database and tables
• Load the data
• Allow users to access it
Relational DBMS platforms
• Enterprise solutions (Oracle, IBM DB2,
SQL Server) are very expensive
– And require professional administration too
• Vendors also offer light-weight
cheap/free variants
• Free open-source platforms:
PostreSQL, MySQL, Cloudscape
Table Design
• One table for each type of object in the
domain
– Columns for each attribute of importance
– Make sure there is a primary key
• Invent a new identifier for this purpose if necessary
• Use foreign keys to represent relationships
between objects
– Or have a separate “association table” where one
object can be related to many, in each direction
Normalisation
• It is important that the schema doesn’t allow
the same facts to be stored repeatedly in
different rows, or different tables
– Eg observation(obs_id, obs_ra,…,gal_id, gal_ra,
…, gtype, …) would have same gal_ra stored in
several rows of observations of this galaxy
– This would risk inconsistency when updates occur
• There is theory about this, but sensibly
defining one table per object will avoid the
problems
Declaring tables
• Syntax isn’t quite standard
– Details of types etc
• MySQL variant is
CREATE TABLE galaxy (
gal_id INT UNSIGNED,
gal_ra FLOAT,
gal_decl FLOAT,
gtype VARCHAR(3),
PRIMARY KEY (gal_id)
);
Indices
• The performance of queries can vary
greatly depending on how the data is
stored in the DBMS
• Having an index on a column usually
speeds up queries that select rows with
given values (or ranges of values) in
that column
– SQL has CREATE INDEX statement
Permissions
• Each DBMS can control which users are
allowed to run different sorts of statements on
each table separately
• Typically, for scientific data one allows
SELECT access to all users (or all registered
users)
– INSERT access to a few trusted users
– DELETE, UPDATE only for the administrators
• Usually have a separate account, so even the admin
people can’t do this accidentally while acting as scientists
Transactions
• Nasty problems can occur if some users are
modifying data while others are looking at it
– Or worse, if several modify concurrently
• This is serious for enterprise applications, but
not common in scientific situations, as
updates are rare
• A transaction is a whole collection of SQL
statements that should be done as a single
block
– Not interrupted by others, and completely undone
unless the whole collection succeeds
Loading data
• SQL has INSERT statement which adds rows
to a table
INSERT INTO galaxy
(gal_id, gal_ra, gal_decl, gtype)
VALUES
(1, 13.1, 7.2, 'S'),
(2, 14.7, 5.3, 'C'),
(3, 11.2, 4.8, 'S'),
(4, 15.3, 4.6, 'S');
Bulk load
• If data is already in a file in a wellknown format (eg csv), then DBMS has
non-standard commands to import it
directly into a table
• MySQL
LOAD DATA LOCAL
INFILE “gal.csv”
INTO TABLE galaxy
Trade-offs
Compared to writing applications against data
stored directly in files, using DBMS to
manage the data:
• Allows users to perform unpredicted (ad-hoc) queries
– without being programmers, and without knowing the
structure in files
• Gives better support as data schema evolves
– Existing queries can continue to run
• Has more predictable performance
– Easier to avoid very slow execution
– But you may not be able to get very fast execution
• Has better security
– But considerable overhead in administration