LN1 - WSU EECS
Download
Report
Transcript LN1 - WSU EECS
CPT-S 580-06
Advanced Databases
Yinghui Wu
EME 49
1
Welcome!
Instructor: Yinghui Wu
Office: EME 49
Email: [email protected]
Office hour: Wed/Fri (1PM to 3PM) or by
appointment
Course website: TBF
2
Motivation
This course let you get familiar with current development in
database research
We discuss problems/topics in advanced database research, and
introduce solutions
– New and currently making their way into database management
systems and applications
– not yet fully developed and open problems
Outcome: possible starting points for your research project, master
and PhD thesis, technique report
– Get you prepared for positions in academia and industry
3
Database Management Systems
Database concepts
– Database
• A database represents some aspect of the real world
• A database is a logically coherent collection of data with
some inherent meaning.
• A database is designed, built, and populated with data for a
specific purpose.
• It has an intended group of users and some preconceived
applications in which these users are interested.
– Database Management System
• A database management system (DBMS) is a collection of
programs that enables users to Define, Construct,
Manipulate and Share a database.
4
Typical DBMS Functionality
Define a particular database in terms of its data types,
structures, and constraints
Construct or Load the initial database contents on a secondary
storage medium
Manipulate the database:
– Retrieval: Querying, generating reports
– Modification: Insertions, deletions and updates to its content
– Accessing the database through Web applications
Share a database allows multiple users and programs to access
the database simultaneously
Database Management System (DBMS)
DBMS contains information about a particular enterprise
– Collection of interrelated data
– Set of programs to access the data
– An environment that is both convenient and efficient to use
Database Applications:
– Banking: transactions; Airlines: reservations, schedules; Universities:
registration, grades; Sales: customers, products, purchases
– Online retailers: order tracking, customized recommendations
– Manufacturing: production, inventory, orders, supply chain; Internet of
Things
– Human resources: employee records, salaries, tax deductions
– GIS; scientific computing;
Databases can be large and at any complexity.
Databases touch all aspects of our lives
File systems to manage data?
Data redundancy and inconsistency
– Multiple file formats, duplication of information in different
files
Difficulty in accessing data
– Need to write a new program to carry out each new task
Data isolation
– Multiple files and formats
Integrity problems
– Integrity constraints (e.g., account balance > 0) become
“buried” in program code rather than being stated explicitly
– Hard to add new constraints or change existing ones
File systems to manage data?
Atomicity of updates
– Failures may leave database in an inconsistent state with partial
updates carried out
– Example: Transfer of funds from one account to another should
either complete or not happen at all
Concurrent access by multiple users
– Concurrent access needed for performance
– Uncontrolled concurrent accesses can lead to inconsistencies
• Example: Two people reading a balance (say 100) and
updating it by withdrawing money (say 50 each) at the same
time
Security problems
– Hard to provide user access to some, but not all, data
Database systems offer solutions to all the above problems
Basic concepts in Database
Database concepts
– Data model
• A collection of concepts that can be used to describe the
structure of a database
– Schema
• The description of a database is called the database
schema, which is specified during database design and is
expected not to change frequently
– The three-schema architecture
• Internal schema
• Conceptual schema
• External schema
9
Basic concepts in Database
Database concepts
– Data independence: logical and physical
– Database languages
• DBMS languages
– Database interfaces
Data modeling
– Conceptual
– Logical
– Physical
Database design
– Normalization
10
Databases: a classification
Logical organization of data
–
–
–
–
–
Records-based database systems
Object-oriented database systems
Object-relational database systems
Deductive/logic database systems
Functional database systems
Physical organization of data
– Centralized database systems
– Distributed database systems
• Homogeneous and heterogeneous
– Client-server database systems
– Mobile database systems
11
Databases: a classification
Contents
–
–
–
–
–
–
Symbolic databases
Textual databases
Multi-media databases
Image databases
Spatial databases
Temporal databases
Application domain
–
–
–
–
–
Engineering databases
Scientific databases
Statistical databases
Manufacturing databases
Business
12
Databases: a classification
Data usage
–
–
–
–
–
Operational databases
Decision-support databases
Data warehousing
Data mining
Tactical and planning databases
Nature of data
– Structured databases
– Semi-structured (like XML data)
– Unstructured (like Web)
Self modifiability
– Passive databases
– Active databases (Triggers)
13
Yesterday’s DBMS Landscape
Application
...
DBMS
Database
Application
“Banking,
SAP, …"
“Server “
“Disk"
Yesterday’s Data
Structured data
Centralized data
Homogeneous data
Small
Cleaned
Static
15
Yesterday’s DBMS Hardware
Small main memory
Disk-based systems
16
Assumptions of yesterday’s DBMSs
Structured data with well-defined schema
Capacity of main memory <1% of the stored data
Central control to manage transactions
Pipelining is always beneficial (no storage of intermediate
results)
Cleaned data
A repository of data with simple query support
Today’s DBMS Landscape
18
Today’s DBMS Topics
19
Today’s Data ( “Big Data”)
Tables
temporal/streams
distributed data
networks
scientific data
Web/text
Natural language
20
Today’s DBMS Hardware
Large main memory
Solid state disks
Multi-core CPUs
Co-processors GPUs
21
Today’s DBMS information need
Traditional DB
Users need to know complete schema
for querying
Users must write SQL queries
Database system does not help in
searching
Lacks semantic value
Stores only facts
Intelligent DB
Users need not know complete database
schema
Users can simply use query expressions
Provides help to make searching
effective
Semantic information is stored via
Knowledge graphs and other data
structures in the database itself
Stores facts and rules
22
Assumptions of yesterday’s DBMSs
Structured data with well-defined schema
•
Capacity of main memory <1% of the stored data
•
Large-scale parallelism
Preprocessed, cleaned data
•
Effective distributed processing, job scheduling and balancing
Pipelining is always beneficial (no storage of intermediate
results)
•
DB in main memory
Central scheduler to schedule transactions
•
Semi-structured/unstructured, schemaless data,...
Manage dirty, noisy, incomplete data
A repository of data with simple query support
•
Complex analytical queries; intelligent DBMS
Question of this course
What do we have to change in
traditional DBMS to meet
tomorrow’s challenges?
Overview of topics
DBMS beyond relational databases (week 2-3)
– noSQL and newSQL
– In-situ processing
– Data stream management systems
Main-memory databases (week 4)
– Architecture and design principles
– Query and indexing strategy
25
Overview of topics
Advanced query techniques (week 5-6)
– Indexing techniques
– Query optimization
– Approximate querying
Parallel and distributed databases (week 7-8)
–
–
–
–
partition techniques
parallel and distributed models
fault tolerance and concurrency control
Distributed stream processing
26
Overview of topics
Database and knowledge discovery (9-11)
– DBMS and IR
– DBMS and scalable DM/ML
– Intelligent DBMS
Data quality (12-13)
– Dirty data: issues and problems
– Data cleaning and repairing
Other new trends and applications (14-15)
– Crowdsourcing
– Data warehouse and datacenters
– Privacy and security
27
Course format
28
Course format
A Seminar-style course: there will be no exam!
–
–
–
–
–
•
References:
•
•
•
•
Lectures: background.
Paper presentation
6 Homework
1 Final course project
Textbook: No Official Textbooks
Database systems the complete book, Hector Garcia-Molina, Jeffrey
D.Dullman, Jennifer Widom,2008
Hadoop: The Definitive Guide, Tom White, O’Reilly
Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer
(www.umiacs.umd.edu/~jimmylin/MapReduce-book-final.pdf)
Data Mining: Concepts and Techniques, Third Edition, by Jiawei Han et al.
Online Tutorials and Papers
– Research papers or chapters related to the topics (3-4 each)
• At the end of lecture notes from Ln2
29
Grading
Reviews and Presentation: 40%
Course Project: 45%
Final Project report and presentation: 15%
Homework:
Six sets of homework, starting from week 2; deadlines:
• 5 pm, Thursday, 1/28, week 3
• 5 pm, Thursday, 2/16, week 5
• 5 pm, Thursday, 2/25, week 7
• 5 pm, Thursday, 3/22, week 10
• 5 pm, Thursday, 4/5, week 12
• 5 pm, Thursday, 4/14, week 14
–
30
Review Evaluation
Pick 2 research papers each time from the lecture note to be
discussed in next two weeks, starting from Week 2.
Write a one-page review for each of the papers, 10 marks
Summary: 2 marks
• A clear problem statement: input, question/output
• The need for this line of research: motivation
• A summary of key ideas, techniques and contributions
Evaluation: 5 marks
– Criteria for the line of research (e.g., expressive power,
complexity, accuracy, scalability, etc)
– Evaluation based on your criteria; justify your evaluation
• 3 strong points
• 3 weak points
Suggest possible extensions: 3 marks
31
Presentation Evaluation
Presentation (15 minutes + 2-3 minutes Q&A)
– Background and motivation
• Challenges
• Why the problem is needed
– Problem formulation
– Algorithm description
– Experimental study
– Conclusion and Future work
32
Project – Research and development
(recommended)
Research and development:
– Topic: pick one from a suggested project list (will be published
on course website)
Example: Association rule mining over temporal networks
Development
– Pick a research paper from the reading list of ln3 – ln11
Implement its main algorithms
– Conduct its experimental study
You are encouraged to come up with your own project – talk to me first
Start early!
33
Project – Research and development
Evaluation
Distribution:
– Algorithms: technical depth, performance guarantees
20%
– Prove the correctness, complexity analysis and performance
guarantees of your algorithms
15%
– Justification (experimental evaluation)
10%
Report: in the form of technical report/research paper
–
–
–
–
–
–
Introduction: problem statement, motivation
Related work: survey
Techniques; algorithms, illustration via intuitive examples
Correctness/complexity/property/proofs
Experimental evaluation
Possible extensions
34
Project – survey
Topic: pick one topic from a suggested list
Example: distributed graph query engines; temporal/streaming
querying approaches.
Distribution:
– Select 5-6 representative papers, independently
10%
– Develop a set of criteria: the most important issues in that
line of research, based on your own understanding; justify
your criteria
10%
– Evaluate each of the papers based on your criteria 15%
– A table to summarize the assessment, based on your
criteria, draw and justify your conclusion and
recommendation for various application
10%
Your understanding of the topic
35
Project report and presentation – 15%
A clear problem statement
Motivation and challenges
Key ideas, techniques/approaches
Key results – what you have got, intuitive examples
Findings/recommendations for different applications
Demonstration: a must if you do a development project
Presentation: question handling (show that you have developed
a good understanding of the line of work)
Learn how to present your work
36
Readings for this week and next week
Overview of traditional databases
– What goes around comes around, by Michael Stonebreaker,
https://mitpress.mit.edu/sites/default/files/titles/content/97802
62693141_sch_0001.pdf
– Database Management Systems, by Raghu
Ramakrishnan and Johannes Gehrke
http://pages.cs.wisc.edu/~dbbook/index.html
Preview of noSQL and newSQL
– “noSQL databases”, by Christof Strauch –sections 1-3
– Scalable SQL and noSQL data stores, Rick Cattell,
http://www.sigmod.org/publications/sigmodrecord/1012/pdfs/04.surveys.cattell.pdf
37