Introduction - Tufts Computer Science

Download Report

Transcript Introduction - Tufts Computer Science

1/60
Lecture 8:
Databases
November 3, 2015
COMP 150-2
Visualization
Slides courtesy of Stanford, Toronto, and UIUC
2/60
Quick Note about VIS
• http://ieeevis.org/year/2015/info/overviewamp-topics/papers-sessions
• Animated Transitioning
• http://www3.cs.stonybrook.edu/~pruchikachor/a
nalogy/
• Memorability
• Database Workshop
3/60
Why Databases?
• In-memory visualization systems are not scalable.
•
When datasets become large, maintaining interactivity is
not easy.
• Why do we care about databases?
•
Different databases have different “properties”
•
•
i.e. things that they are good at and things that they are not good
at.
Consider the library example
•
•
•
Current structure – “write once, read often”
What if, “write often, read rarely”?
Twitter?
4/60
Overview
1. Relational algebra
2. Writing a SQL query
i.
ii.
iii.
iv.
Insert, update, delete
Select, from, where
(Inner) join
Other syntax
3. Stages of a query:
i.
ii.
iii.
Parsing
(query optimization)
PreparedStatement
4. Technical Stuff
a)
Database (ODBC) connection
i.
b)
Using the “device driver” model
ResultSet
5. Using Databases with Visualizations
5/60
Relational Algebra
• Definitions
•
•
•
Relational Schema
Relational Instance
A Relation
• Bags vs. Sets
• Operations
•
Operation tree
• Limitations
6/60
Different Types of Models
7/60
The Relational Model: Schemata
Part (I) Relational Schema:
Students(sid: string, name: string, gpa: float)
Relation name
Attributes
string, float, int are the domains of the attributes
8/60
The Relational Model II: Data
Students(sid: string, name: string, gpa: float)
Attributes
Tuples / Records
# of tuples is the
cardinality
Sid
101
121
123
135
Name Gpa
Bob
3.2
Joe
2.8
Mary 3.8
Alice
3.7
# of attributes is the arity
NB: In practice, DB systems relax
the set requirement. Why?
A relational instance is a set of tuples all
conforming to the same schema
9/60
To reiterate
A relational schema describes the data that is
contained in a relational instance
A relational instance is a set of tuples of the same type.
Let R(f1:D1,…,fm:Dm) be a relational schema then,
an instance of R is a subset of Dom1 x Dom2 x … x Domn
A tuple viewed as a total function from attribute names to types
(names important)
10/60
One more time
A relational schema describes the data that is
contained in a relational instance
A relation R is of arity t is a function:
R : D1 x … x Dt  {0,1}
The schema is simply the signature
of the function. Note that the 0, 1
represents whether the instance
exists
11/60
A relational database
A relational database schema is a set of
relational schemata, one for each relation
A relational database instance is a set of
relational instances, one for each relation
Two conventions:
1. We call instances as simply databases
2. We assume all instances are valid, i.e., satisfy
the domain constraints
12/60
Operations on Bags
A bag = a set with repeated elements
All operations need to be defined carefully on bags
{a,b,b,c}{a,b,b,b,e,f,f}={a,a,b,b,b,b,b,c,e,f,f}
{a,b,b,b,c,c} – {b,c,c,c,d} = {a,b,b}
sC(R): preserve the number of occurrences
PA(R): no duplicate elimination
Cartesian product, join: no duplicate elimination
Important ! Relational Engines work on bags, not sets !
13/60
Relational Algebra
Five operators:
Union: 
Difference: Selection: s
Projection: P
Cartesian Product: 
Derived or auxiliary operators:
Intersection, complement
Joins (natural, equi-join, theta join, semi-join)
Renaming: r
Division
14/60
Cartesian Product Example
Employee
Name
John
Tony
SSN
999999999
777777777
Dependents
EmployeeSSN
999999999
777777777
Dname
Emily
Joe
Employee x Dependents
Name
SSN
EmployeeSSN
John
999999999 999999999
John
999999999 777777777
Tony
777777777 999999999
Tony
777777777 777777777
Dname
Emily
Joe
Emily
Joe
15/60
Natural Join
Notation: R1 || R2
Meaning: R1 || R2 = PA(sC(R1  R2))
Where:
The selection sC checks equality of all common
attributes
The projection eliminates the duplicate common
attributes
16/60
Natural Join Example
Employee
Name
SSN
John
999999999
Tony
777777777
Dependents
SSN
Dname
999999999
Emily
777777777
Joe
Employee
Dependents =
PName, SSN, Dname(s SSN=SSN2(Employee x rSSN2, Dname(Dependents))
Name
SSN
Dname
John
999999999
Emily
Tony
777777777
Joe
17/60
Natural Join
R=
R || S=
S=
A
B
B
C
X
Y
Z
U
X
Z
V
W
Y
Z
Z
V
Z
V
A
B
C
X
Z
U
X
Z
V
Y
Z
U
Y
Z
V
Z
V
W
18/60
Virtues of the model
Physical independence (logical too), Declarative
Simple, elegant clean: Everything is a relation
Why did it take multiple years?
Doubted it could be done efficiently.
19/60
What is an Algebra?
20/60
Logical Equivalence of Plans
R(A,B) S(B,C)
s[A=5] (p[A] R) = p[A] (s[A=5] R)
Here, projection and selection commute
p[B] (s[A=5] R)?
Can we play the same game here?
21/60
A simple plan
p[B]
R(A,B)
S(B,C)
What SQL query does
this correspond to?
22/60
Pushing down projection
p[B]
R(A,B)
S(B,C)
p[B]
p[B]
R(A,B)
S(B,C)
Why might we
prefer this plan?
23/60
Example Limitations
• SQL/RA has its problems
• Reusability
• Modularity
• Readability, etc.
• “Computation” or math (e.g., average, min,
max, stdev, etc.) is not a part of RA
• But there are also “real” limitations
24/60
Finally: RA has Limitations !
Cannot compute “transitive closure”
Name1
Name2
Relationship
Fred
Mary
Father
Mary
Joe
Cousin
Mary
Bill
Spouse
Nancy
Lou
Sister
Find all direct and indirect relatives of Fred
Cannot express in RA !!!
Need to write C program, use a graph engine, or modern SQL…
25/60
SQL Query
• Four Types of Operations
•
•
•
•
INSERT
UPDATE
DELETE
SELECT
26/60
Ok, but you said this was operational…
SQL Data Definition Language (DDL) is one
method of defining relations
In SQL, Relation → table
Students(sid: string, name: string, gpa: float)
CREATE TABLE Students ( sid CHAR(20),
Name CHAR(50),
Gpa REAL);
27/60
Table name
Tables in SQL
Attribute names
Product
PName
Price
Category
Manufacturer
Gizmo
$19.99
Gadgets
GizmoWorks
Powergizmo
$29.99
Gadgets
GizmoWorks
SingleTouch
$149.99
Photography
Canon
MultiTouch
$203.99
Household
Hitachi
Tuples or rows
27
28/60
Tables Explained
The schema of a table is the table name and its
attributes:
Product(PName, Price, Category, Manfacturer)
A key is an attribute whose values are unique;
we underline a key
Product(PName, Price, Category, Manfacturer)
28
29/60
Data Types in SQL
Atomic types:
Characters: CHAR(20), VARCHAR(50)
Numbers: INT, BIGINT, SMALLINT, FLOAT
Others: MONEY, DATETIME, …
Every attribute must have an atomic type
Hence tables are flat
Why ?
29
30/60
Tables Explained
A tuple = a record
Restriction: all attributes are of atomic type
A table = a (multi)-set of tuples
Like a list…
…but it is unordered:
no first(), no next(), no last().
30
31/60
Outline
Data in SQL (Review)
Simple Queries in SQL
Queries with more than one relation
31
32/60
SQL Query
Basic form: (there are many many more bells and whistles)
SELECT <attributes>
FROM <one or more relations>
WHERE <conditions>
Call this: An SFW query.
32
33/60
Simple SQL Query
Product
PName
Gizmo
Price
$19.99
Powergizmo
SingleTouch
MultiTouch
$29.99
$149.99
$203.99
Category
Gadgets
Manufacturer
GizmoWorks
Gadgets
GizmoWorks
Photography
Canon
Household
Hitachi
SELECT *
FROM Product
WHERE category=‘Gadgets’
“selection”
PName
Price
Category
Manufacturer
Gizmo
$19.99
Gadgets
GizmoWorks
Powergizmo
$29.99
Gadgets
33
GizmoWorks
34/60
Simple SQL Query
Product
PName
Gizmo
Price
$19.99
Powergizmo
SingleTouch
MultiTouch
$29.99
$149.99
$203.99
Category
Gadgets
Manufacturer
GizmoWorks
Gadgets
GizmoWorks
Photography
Canon
Household
Hitachi
SELECT Pname, Price,Manufacturer
FROM Product
WHERE category=‘Gadgets’
“selection”
“projection”
PName
Price
Manufacturer
Gizmo
$19.99
GizmoWorks
Powergizmo
$29.99
GizmoWorks
34
35/60
The IN operator
SELECT *
FROM Products
WHERE PName in (‘Gadgets’, ‘Photography’);
This is functionally equivalent to:
SELECT *
FROM Products
WHERE PName = ‘Gadgets’ OR
PName =‘Photography’;
35
36/60
The LIKE operator
SELECT *
FROM Products
WHERE PName LIKE ‘%gizmo%’
s LIKE p: pattern matching on strings
p may contain two special symbols:
% = any sequence of characters
_ = any single character
36
37/60
Eliminating Duplicates
SELECT DISTINCT category
FROM Product
Category
Gadgets
Photography
Household
Compare to:
SELECT category
FROM Product
Category
Gadgets
Gadgets
Photography
Household
37
38/60
Ordering the Results
SELECT pname, price, manufacturer
FROM Product
WHERE category=‘gizmo’ AND price > 50
ORDER BY price, pname
Ties are broken by the second attribute on the ORDER BY list, etc.
Ordering is ascending, unless you specify the DESC keyword.
38
39/60
Employee table
LastName
DepartmentID
Rafferty
31
Jones
33
Heisenberg
33
Robinson
34
Smith
34
Williams
NULL
Department table
DepartmentID
DepartmentName
31
Sales
33
Engineering
34
Clerical
35
Marketing
40/60
Inner Join
SELECT * FROM employee INNER JOIN
department ON employee.DepartmentID =
department.DepartmentID;
SELECT * FROM employee, department
WHERE employee.DepartmentID =
department.DepartmentID;
41/60
Inner Join
Employee.LastNam
e
Employee.Departm
entID
Department.Depart Department.Depart
mentName
mentID
Robinson
34
Clerical
34
Jones
33
Engineering
33
Smith
34
Clerical
34
Heisenberg
33
Engineering
33
Rafferty
31
Sales
31
42/60
(Left) Outer Join
SELECT * FROM employee LEFT OUTER JOIN
department ON employee.DepartmentID =
department.DepartmentID;
Employee.LastNam
e
Employee.Departm
entID
Department.Depart Department.Depart
mentName
mentID
Jones
33
Engineering
33
Rafferty
31
Sales
31
Robinson
34
Clerical
34
Smith
34
Clerical
34
Williams
NULL
NULL
NULL
Heisenberg
33
Engineering
33
43/60
Subqueries Returning Relations
Company(name, city)
Product(pname, maker)
Purchase(id, product, buyer)
Return cities where one can find companies that
manufacture products bought by Joe Blow
SELECT Company.city
FROM Company
WHERE Company.name IN
(SELECT Product.maker
FROM Purchase , Product
WHERE Product.pname=Purchase.product
AND Purchase .buyer = ‘Joe Blow‘);
43
44/60
Aggregation
SELECT avg(price)
FROM Product
WHERE maker=“Toyota”
SELECT count(*)
FROM Product
WHERE year > 1995
SQL supports several aggregation operations:
sum, count, min, max, avg
Except count, all aggregations apply to a single attribute
44
45/60
Aggregation: Count
COUNT applies to duplicates, unless otherwise stated:
SELECT Count(category)
FROM Product
WHERE year > 1995
same as Count(*)
We probably want:
SELECT Count(DISTINCT category)
FROM Product
WHERE year > 1995
45
46/60
Stages of a Query
• Parsing
• Query Plan Generation
• Query Optimization
• (Using PreparedStatement)
47/60
Example Query Plan / Optimizer
Given a database with two tables:
dept
(dno, floor)
emp (name, age, sal, dno)
Consider the following SQL query:
select name, floor
from employ, dept
where employ.dno = dept.dno
and employ.sal > 100k
Example taken from “Query Optimization” by Ioannidis, 1997
48/60
Possible Query Plans
49/60
Cost of the Query
For a database with 100,000 employees (stored
across 20,000 page files), the three query
plans can have significantly different execution
time (in 1997):
T3: <1 sec
T1: >1 hour
T2: ~1 day
50/60
PreparedStatement
• For repeated queries
java.sql.PreparedStatement stmt =
connection.prepareStatement(
"SELECT * FROM users
WHERE USERNAME = ? AND ROOM = ?");
stmt.setString(1, ‘Tom’);
stmt.setInt(2, 24);
stmt.executeQuery();
51/60
Technical Stuff
• Think of ODBC (Open Database Connectivity)
and JDBC as a device driver / API
• Just like a printer driver:
1. For each printer (database software), you need to
install a specific driver
2. Once the driver is installed, your software talks it the
printer (database) via a common API / interface
3. The printer (database) takes the command and
converts it to “native” functions and calls
52/60
Example Database Connection Code
import java.sql.*;
public class FirstExample {
// JDBC driver name and database URL
static final String JDBC_DRIVER =
"com.mysql.jdbc.Driver";
static final String DB_URL =
"jdbc:mysql://localhost/EMP";
// Database credentials
static final String USER = "username";
static final String PASS = "password";
public static void main(String[] args) {
Connection conn = null;
Statement stmt = null;
try{
//STEP 2: Register JDBC driver
Class.forName("com.mysql.jdbc.Driver");
//STEP 3: Open a connection
System.out.println("Connecting to
database...");
conn =
DriverManager.getConnection(DB_URL,USER,PAS
S);
//STEP 4: Execute a query
System.out.println("Creating statement...");
stmt = conn.createStatement();
String sql;
sql = "SELECT id, first, last, age FROM
Employees";
ResultSet rs = stmt.executeQuery(sql);
:
:
}
53/60
ResultSet
String sql;
sql = "SELECT id, first, last, age FROM Employees";
ResultSet rs = stmt.executeQuery(sql);
//STEP 5: Extract data from result set
while(rs.next()){
//Retrieve by column name
int id = rs.getInt("id");
int age = rs.getInt("age");
String first = rs.getString("first");
String last = rs.getString("last");
}
//STEP 6: Clean-up environment
rs.close();
54/60
Database Background
• SQL Query. Example:
SELECT person
FROM datasetTable
WHERE (weight > 130) AND
(weight < 180) AND
(height > 5.8) AND
(height < 6)
55/60
Database and Visualization
• This is the equivalent of a selection box in a
scatter plot visualization
Height = 6
Height = 5.8
weight = 130
weight = 180
56/60
Database and Visualization
• For a specific type of interaction (e.g. select,
filter, zoom, etc.), there is often a 1:1
mapping between the interactions and
database queries
• Homefinder Example
57/60
Database and Visualization
• There is a nice mapping between
each visual element with an
attribute in the database
SELECT latitude, longitude
FROM homesTable
WHERE (distance > 5) AND
(distance < 10) AND
(price > 100,000) AND
(price < 300,000) AND
(bedrooms > 3) AND
(garage = TRUE) AND
(fireplace = FALSE)
58/60
Query Relaxation
• “Generalized Selection via Interactive Query
Relaxation” (Heer et al. CHI 2008)
• http://vis.berkeley.edu/papers/generalized_s
election/
59/60
Database Strength and Weakness
• Strengths:
•
•
•
Generalizable
Separates data from visualization cleanly
SQL is a very stable language that is very complete for
interacting with data stored in relational databases
• Weaknesses:
•
•
•
Dynamic query construction is tricky
DB overhead
Speed is a problem
•
•
•
Network speed
DB searching speed
Parsing DB returned tuples
60/60
Conjunctive Form
• A selection / filtering / zooming interaction ==
database query
• It can also be thought of as a conjunctive
form
!(A1 V A2) ^ A3 V (A4 V A5 ^ A6) V …
where A1 could be the clause (price > 200)
61/60
Conjunctive Form
• Very similar to SQL query, but can be built in software
• Has some of the similar limitations as SQL in that the
clauses need to be pre-built
•
Such that a data attribute needs to be “hard-coded” to a
visual element
• Consider a slider that connects to two or more
attributes
(http://gravis.cs.unibas.ch/publications/2007/VIS07_S
mith.pdf)