physical schema - Computer Science at Rutgers

Download Report

Transcript physical schema - Computer Science at Rutgers

CS541: Database Systems
Spring 2008
Computer Science Department
Rutgers University
Rutgers University
Administration

Instructor: Amélie Marian
[email protected]
CoRE 324
(732) 445 6450 x0636
Office Hours: Mondays 3-4pm or by
appointment

TA: Minji Wu
[email protected]
Office Hours: TBA
Rutgers University
Class Information

Web page:
http://www.cs.rutgers.edu/~amelie/courses/541Spring2008.html


Meets Thursday 3:20-6:20pm in
CoRE A
Prerequisites:
CS513 and working knowledge of C or
Java or instructor’s permission
Rutgers University
Grading (subject to small changes)

15% Homework (3-4)


Due at beginning of class on due date
30% Programming Project


In teams of two (same project)
In three parts






In class project presentation and demonstration
More details later
25% Midterm Exam


Find data source and scenario
Implementation of standard index structures for query
processing
Extend project to non-standard query processing
(e.g., IR-style text retrieval, nearest-neighbor, top-k)
Tentatively scheduled for March 13
30% Final Exam
Rutgers University
Collaboration Policy



Check DCS Academic Integrity
Policy
Homework and exams are to be
done individually
Project is done only with your team
partner
Rutgers University
Supporting Material

Textbook:
Raghu Ramakrishnan, Johannes Gehrke:
Database Management Systems, 3rd edition,
McGraw-Hill, 2002

Class website:


Lecture Notes
Research Papers (for advanced topics)
Rutgers University
Communication

Please send me email, come to my
office hour, or contact Minij if you
have questions on the material,
complaints, or feedback on how to
improve the course
Rutgers University
Class Organization

Basics of Database Systems





Information Management




What is a DBMS?
Why do we need one?
How do we design one?
What are the common problems in DBMS?
Text documents
Structure and content
Approximate querying
Advanced Topics in Data Management




What is new and exciting in DB Research?
How do we deal with huge amounts of data?
What are the new challenges brought by the
internet?
How should DBMS evolve?
Rutgers University
Short History of Data Management

Evolved from file systems (1960’s)




Relational DB systems (1970’s)





Airline reservation systems
Banking systems
Corporate data
Data organized in tables and relations that model realworld
Storage structure transparent to user
High-level query language
Widely used today
New challenges






Distributed Data (e.g., internet)
Parallel Computing
Bigger systems
Multimedia Data
Data Analysis
Information Integration
Rutgers University
What is a DBMS?

Powerful tool to efficiently manage
large amounts of data



Persistent storage (more flexible than a
file system)
Data manipulation (complex query
language)
Transaction management
(simultaneous access to data)
Rutgers University
Why Use a DBMS?





Data independence and efficient
access.
Reduced application development
time.
Data integrity and security.
Uniform data administration.
Concurrent access, recovery from
crashes.
Rutgers University
Why Study Databases?

Shift from computation to information



Datasets increasing in diversity and
volume.



at the “low end”: user-input information (a
mess!)
at the “high end”: scientific applications
Digital libraries, interactive video, Human
Genome project, EOS project
... need for DBMS exploding
DBMS encompasses most of CS

OS, languages, theory, AI, multimedia, logic
Rutgers University
Basics of Database Systems:
The ER Model


Conceptual design of database
Models real-world:





Entities (Students, Professor, and Classes)
Relationships (Amélie Marian teaches 541)
Attributes are associated with entities (the
room for 541 is CoRE A)
Constraints of the data
Logical schema of the data
Rutgers University
Basics of Database Systems:
The Relational Model



A data model is a collection of concepts for
describing data.
A schema is a description of a particular collection
of data, using the a given data model.
The relational model of data is the most widely
used model today.



Two formal query languages



Main concept: relation, basically a table with rows
and columns.
Every relation has a schema, which describes the
columns, or fields.
Relational algebra
Relational calculus
Powerful and widely used query language: SQL
Rutgers University
Levels of Abstraction

Many views, single
conceptual (logical)
schema and physical
schema.



View 1
Views describe how users
see the data.
Conceptual schema defines
logical structure
Physical schema describes
the files and indexes used.
View 2
View 3
Conceptual Schema
Physical Schema
 Schemas are defined using DDL; data is modified/queried using DML.
Rutgers University
Example: University Database

Conceptual schema:




Physical schema:



Students(sid: string, name: string, login: string,
age: integer, gpa:real)
Courses(cid: string, cname:string,
credits:integer)
Enrolled(sid:string, cid:string, grade:string)
Relations stored as unordered files.
Index on first column of Students.
External Schema (View):

Course_info(cid:string,enrollment:integer)
Rutgers University
Data Independence *



Applications insulated from how data is
structured and stored.
Logical data independence: Protection from
changes in logical structure of data.
Physical data independence: Protection
from changes in physical structure of data.
 One of the most important benefits of using a DBMS!
Rutgers University
Basics of Database Systems:
Physical Storage and Index Structures
Many alternatives exist, each ideal for some
situations, and not so good in others:



Heap (random order) files: Suitable when
typical access is a file scan retrieving all
records.
Sorted Files: Best if records must be retrieved
in some order, or only a `range’ of records is
needed.
Indexes: Data structures to organize records
via trees or hashing.
 Like sorted files, they speed up searches for a
subset of records, based on values in certain
(“search key”) fields
 Updates are much faster than in sorted files.
Rutgers University
Basics of Database Systems:
Query Processing

What are the best algorithms to evaluate queries
on data


Algorithms for evaluating relational operators use
some simple ideas extensively:




Performance issues: space/time
Indexing: to retrieve small set of data
Iteration: Sometimes, faster to scan all tuples even
if there is an index. (And sometimes, we can scan
the data entries in an index instead of the table
itself.)
Partitioning: By using sorting or hashing, we can
partition the data and replace an expensive
operation by similar operations on smaller inputs.
Ideally: Want to find best plan. Practically: Avoid
worst plans!
Rutgers University
Basics of Database Systems:
Transaction Processing

Concurrent execution of user programs
is essential for good DBMS performance.



Because disk accesses are frequent, and
relatively slow, it is important to keep the cpu
humming by working on several user programs
concurrently.
Interleaving actions of different user
programs can lead to inconsistency: e.g.,
check is cleared while account balance is
being computed.
DBMS ensures such problems don’t arise:
users can pretend they are using a singleuser system.
Rutgers University
Basics of Database Systems:
Logical Data Management

Redundancy is at the root of several problems
associated with relational schemas:




redundant storage, insert/delete/update anomalies
Integrity constraints, in particular functional
dependencies, can be used to identify schemas
with such problems and to suggest refinements.
Main refinement technique: decomposition
(replacing ABCD with, say, AB and BCD, or ACD
and ABD).
Decomposition should be used judiciously:


Is there reason to decompose a relation?
What problems (if any) does the decomposition
cause?
Rutgers University
Advanced Topics in Data Management:
Information Retrieval and Web Search


Keyword search over text (unstructured)
data
User Expectations:




Many say “The first item shown should be what
I want to see!”
This works if the user has the most
popular/common notion in mind, not
otherwise.
Widely used today
Top-k query model
Rutgers University
Advanced Topics in Data Management:
Advanced Query Processing

New challenges:





Proactive (and reactive) optimization
Smart statistics collection to cope with
fast changes
Approximate query answering
Online query processing
Important answers first (top-k queries,
skyline queries)
Rutgers University
Advanced Topics in Data Management:
XML and Web Data

No application interoperability in the web
today:



HTML not understood by applications
 screen scraping brittle
Database technology: client-server
 still vendor specific
New Universal Data Exchange Format:
XML




XML = semi-structured data
XML generated by applications
XML consumed by applications
Easy access: across platforms, organizations
Rutgers University
Advanced Topics in Data Management:
Data Mining
Data mining is the exploration and analysis of large
quantities of data in order to discover valid, novel,
potentially useful, and ultimately understandable
patterns in data.
Valid: The patterns hold in general.
Novel: We did not know the pattern
beforehand.
Useful: We can devise actions from the
patterns.
Understandable: We can interpret and
comprehend the patterns.
Rutgers University
Advanced Topics in Data Management:
and more…







Distributed Databases
Parallel Databases
ORDBMS
Data Cleaning
Data Warehousing
Data Streams
…
Rutgers University
If you are interested in advanced DB
topics…

For-credit research projects available



Top-k query processing
Scoring XML data
Web data management
Contact me for more information!
Rutgers University