Why do we even care?
Download
Report
Transcript Why do we even care?
Why do we care about databases in a
course about Web?
Administrivia
• Midterm next class (in class)
– Openbook/Open Notes/open course web (not general
web)
– Everything done until Sep 28th (and covered in
homework 2)
– Questions?
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Adapting old disciplines for Web-age
• Information (text) retrieval
– Scale of the web
– Hyper text/ Link structure
– Authority/hub computations
• Databases
– Multiple databases
• Heterogeneous, access limited, partially overlapping
– Network (un)reliability
• Datamining [Machine Learning/Statistics/Databases]
– Learning patterns from large scale data
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Why do we care about databases?
• Three reasons
– Deep web is all databases…
– We can do better with structured data…
– Exposing databases on web changes their clientele..
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Deep Web is databases..
• The crawlable web pages are just the tip of a huge ice berg that is
deep web
– Many web sites have huge backend databases that generate pages dynamically
in response to queries
• Airline fare databases; News paper classifieds etc.
– By some estimates, deep web is 2 orders of magnitude bigger than the shallow
(“html page”) web
• We need to exploit deep web
– Crawl/index deep web
– Select databases relevant to a query
– Provide information aggregation/integration services over deep web databases
• ..and all the big kids are trying to gobble up anyone who is even going through the
motions of doing these..
• …which leads to several DB challenges not addressed in
traditional DBs
–
–
–
–
Wrapper generation
Schema mapping
(automated) form filling
Query optimization
• Learning source profiles
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Web brings unwashed masses, unreliable
medium as well as dirty data to databases..
• Web accessibility changes the user/data/medium profile
significantly
– from SQL gurus supporting financial data on dedicated DBMS to
“2.1 keyword query” instant gratification seekers working with
dirty/inconsistent data over unreliable web.
• Challenges
–
–
–
–
How does one support keyword queries in databases?
How does one support imprecise queries in databases?
How do we handle incompleteness/inconsistency in databases?
Does it make sense to focus on total response time minimization
• As against a multi-objective cost/benefit optimization?
The DB community has embraced these challenges
--see Lowell Report
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Databases offer lessons on exploiting
structure
• We argued that structure (and semantics) help querying
– If there is structure (as in databases) we can exploit it
• Databases is an existing technology for exploiting some forms of structure
– SQL may not look like much, but it is more expressive than keyword queries!
– If not, we can extract structure and then exploit it
• Challenges
– Techniques for extracting information (NLP-lite)
– Languages for representing/handling “Semi-structured” data
– Standards for supporting/exploiting semantic tagging
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Before we play havoc with
databases, let’s quickly review the
traditional art of db management
so we know all that needs to change
Concepts covered so far …
• Information Retrieval
– Text retrieval
– Hyper-linked text retrieval
– Improvements…
• Information Mining
– Clustering techniques to improve result presentation
– Classification and filtering techniques
Slides adapted from Rao (ASU) & Franklin (Berkeley)
What Is a Database System?
• Database:
a very large, integrated collection of data.
• Models a real-world enterprise
– Entities (e.g., teams, games)
– Relationships
(e.g., The Patriots are playing in The Superbowl)
– More recently, also includes active components , often called
“business logic”. (e.g., the BCS ranking system)
• A Database Management System (DBMS) is a
software system designed to store, manage, and
facilitate access to databases.
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Functionality of a DBMS
• Data Dictionary Management
• Storage management
– Data storage Definition Language (DDL)
• High level query and data manipulation language
– SQL/XQuery etc.
– May tell us what we are missing in text-based search
• Efficient query processing
– May change in the internet scenario
• Transaction processing
• Resiliency: recovery from crashes,
• Different views of the data, security
– May be useful to model a collection of databases together
• Interface with programming languages
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Traditional Database Architecture
Database
(relational)
Services
Source Trust
Ontologies;
Source/Service
Descriptions
Probing
Queries
Webpages
Structured
data
Sensors
(streaming
Data)
od
M
ty
til
i
ce
/U
Executor
Answers
Needs to handle
Source/network
Interruptions,
Runtime uncertainity,
replanning
all
ce
C
cs
tisti
g Sta
atin
Upd
ing
nn
pla ts
Re ques
Re
e
Qu
Slides adapted from Rao (ASU) & Franklin (Berkeley)
ur
el
ry
So
Needs to handle:
Multiple objectives,
Service composition,
Source quality & overlap
s
Source Fusion/
Query Planning
re
n
Answer
(relation)
Database Manager
(DBMS)
-Storage mgmt
-Query processing
-View management
-(Transaction processing)
Pr
ef
e
Query
(SQL)
Monitor
Building an Application with a Database
System
• Requirements modeling (conceptual, pictures)
– Decide what entities should be part of the application and
how they should be linked.
• Schema design and implementation
– Decide on a set of tables, attributes.
– Define the tables in the database system.
– Populate database (insert tuples).
• Write application programs using the DBMS
– Now much easier, with data management API
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Conceptual Modeling
name
category
name
ssn
Takes
Course
Student
quarter
Advises
Teaches
Professor
address
name
Slides adapted from Rao (ASU) & Franklin (Berkeley)
field
Data Models
• A data model is a collection of concepts for
describing data.
• A schema is a description of a particular collection
of data, using a given data model.
• The relational model of data is the most widely used
model today.
– Main concept: relation, basically a table with rows and
columns.
– Every relation has a schema, which describes the columns, or
fields.
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Levels of Abstraction
• Views describe how
users see the data.
• Conceptual schema
defines logical structure
• Physical schema
describes the files and
indexes used.
View 1
View 2
View 3
Conceptual Schema
Physical Schema
Slides adapted from Rao (ASU) & Franklin (Berkeley)
DB
Example: University Database
• Conceptual schema:
– Students(sid: string, name: string,
login: string, age: integer, gpa:real)
– Courses(cid: string, cname:string, View 1
credits:integer)
• External Schema (View):
– Course_info(cid:string,enrollment:in
teger)
• Physical schema:
– Relations stored as unordered files.
– Index on first column of Students.
View 2
View 3
Conceptual Schema
Physical Schema
DB
If fiveadapted
peoplefrom
are Rao
asked
to come
up with
a schema for the data,
Slides
(ASU)
& Franklin
(Berkeley)
what are the odds that they will come up with the same schema?
Data Independence
• Applications insulated from
how data is structured and stored.
• Logical data independence:
Protection from changes in
logical structure of data.
View 1
View 2
View 3
Conceptual Schema
Physical Schema
• Physical data independence:
Protection from changes in
physical structure of data.
• Q: Why are these particularly
important for DBMS?
Slides adapted from Rao (ASU) & Franklin (Berkeley)
DB
Schema Design & Implementation
• Table Students
Student
Course
Quarter
Charles
CS 444
Fall, 1997
Dan
CS 142
…
…
Winter,
1998
…
• Separates the logical view from the physical
view of the data.
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Terminology
Attribute names
tuples
Students
Student
Course
Quarter
Charles
CS 444
Fall, 1997
Dan
CS 142
…
…
Winter,
1998
…
(Arity=3)
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Querying a Database
• Find all the students taking CSE594 in Q1, 2004
• S(tructured) Q(uery) L(anguage)
select E.name
from Enroll E
where E.course=CS490i and
E.quarter=“Winter, 2000”
• Query processor figures out how to answer the query
efficiently.
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Defining Views
(Virtual) Views are relations, except that they are not
physically stored.
They are used mostly in order to simplify complex queries and
to define conceptually different views of the database to different
classes of users.
View: purchases of telephony products:
CREATE VIEW telephony-purchases AS
SELECT product, buyer, seller, store
FROM Purchase, Product
WHERE Purchase.product = Product.name
AND Product.category = “telephony”
Slides adapted from Rao (ASU) & Franklin (Berkeley)
A Different View
CREATE VIEW Seattle-view AS
SELECT buyer, seller, product, store
FROM Person, Purchase
WHERE Person.city = “Seattle” AND
Person.name = Purchase.buyer
We can later use the views:
SELECT name, store
FROM
Seattle-view, Product
WHERE Seattle-view.product = Product.name AND
Product.category = “shoes”
What’s really happening when we query a view??
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Updating Views
How can I insert a tuple into a table that doesn’t exist?
CREATE VIEW bon-purchase AS
SELECT store, seller, product
FROM
Purchase
WHERE store = “The Bon Marche”
If we make the following insertion:
INSERT INTO bon-purchase
VALUES (“the Bon Marche”, Joe, “Denby Mug”)
We can simply add a tuple
(“the Bon Marche”, Joe, NULL, “Denby Mug”)
to relation Purchase.
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Non-Updatable Views
Given
Purchase (buyer, seller, store, product)
Person( name, phone-num, city)
CREATE VIEW Seattle-view AS
SELECT seller, product, store
FROM Person, Purchase
WHERE Person.city = “Seattle” AND
Person.name = Purchase.buyer
How can we add the following tuple to the view?
(Joe, “Shoe Model 12345”, “Nine West”)
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Materialized Views
• Views whose corresponding queries have been executed
and the data is stored in a separate database
– Uses: Caching
• Issues
– Using views in answering queries
• Normally, the views are available in addition to database
– (so, views are local caches)
• In information integration, views may be the only things we have access to.
– An internet source that specializes in woody allen movies can be seen as a view
on a database of all movies. Except, there is no database out there which
contains all movies..
– Maintaining consistency of materialized views
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Query Optimization
Goal:
Declarative SQL query
SELECT S.buyer
FROM Purchase P, Person Q
WHERE P.buyer=Q.name AND
Q.city=‘seattle’ AND
Q.phone > ‘5430000’
Imperative query execution plan:
buyer
City=‘seattle’
phone>’5430000’
Buyer=name
Inputs:
• the query
• statistics about the data
(indexes, cardinalities,
selectivity factors)
• available memory
Purchase
(Table scan)
(Simple Nested Loops)
Person
(Index scan)
Ideally: Want to find best plan.
Practically: Avoid worst plans!
Slides adapted from Rao (ASU) & Franklin (Berkeley)