LN1 - WSU EECS

Download Report

Transcript LN1 - WSU EECS

CPT-S 580-06
Advanced Databases
Yinghui Wu
EME 49
1
Welcome!
Instructor: Yinghui Wu
Office: EME 49
Email: [email protected]
Office hour: Wed/Fri (1PM to 3PM) or by
appointment
Course website: TBF
2
Motivation
 This course let you get familiar with current development in
database research
 We discuss problems/topics in advanced database research, and
introduce solutions
– New and currently making their way into database management
systems and applications
– not yet fully developed and open problems
 Outcome: possible starting points for your research project, master
and PhD thesis, technique report
– Get you prepared for positions in academia and industry
3
Database Management Systems
 Database concepts
– Database
• A database represents some aspect of the real world
• A database is a logically coherent collection of data with
some inherent meaning.
• A database is designed, built, and populated with data for a
specific purpose.
• It has an intended group of users and some preconceived
applications in which these users are interested.
– Database Management System
• A database management system (DBMS) is a collection of
programs that enables users to Define, Construct,
Manipulate and Share a database.
4
Typical DBMS Functionality
 Define a particular database in terms of its data types,
structures, and constraints
 Construct or Load the initial database contents on a secondary
storage medium
 Manipulate the database:
– Retrieval: Querying, generating reports
– Modification: Insertions, deletions and updates to its content
– Accessing the database through Web applications
 Share a database allows multiple users and programs to access
the database simultaneously
Database Management System (DBMS)
 DBMS contains information about a particular enterprise
– Collection of interrelated data
– Set of programs to access the data
– An environment that is both convenient and efficient to use
 Database Applications:
– Banking: transactions; Airlines: reservations, schedules; Universities:
registration, grades; Sales: customers, products, purchases
– Online retailers: order tracking, customized recommendations
– Manufacturing: production, inventory, orders, supply chain; Internet of
Things
– Human resources: employee records, salaries, tax deductions
– GIS; scientific computing;
 Databases can be large and at any complexity.
 Databases touch all aspects of our lives
File systems to manage data?
 Data redundancy and inconsistency
– Multiple file formats, duplication of information in different
files
 Difficulty in accessing data
– Need to write a new program to carry out each new task
 Data isolation
– Multiple files and formats
 Integrity problems
– Integrity constraints (e.g., account balance > 0) become
“buried” in program code rather than being stated explicitly
– Hard to add new constraints or change existing ones
File systems to manage data?
 Atomicity of updates
– Failures may leave database in an inconsistent state with partial
updates carried out
– Example: Transfer of funds from one account to another should
either complete or not happen at all
 Concurrent access by multiple users
– Concurrent access needed for performance
– Uncontrolled concurrent accesses can lead to inconsistencies
• Example: Two people reading a balance (say 100) and
updating it by withdrawing money (say 50 each) at the same
time
 Security problems
– Hard to provide user access to some, but not all, data
Database systems offer solutions to all the above problems
Basic concepts in Database
 Database concepts
– Data model
• A collection of concepts that can be used to describe the
structure of a database
– Schema
• The description of a database is called the database
schema, which is specified during database design and is
expected not to change frequently
– The three-schema architecture
• Internal schema
• Conceptual schema
• External schema
9
Basic concepts in Database
 Database concepts
– Data independence: logical and physical
– Database languages
• DBMS languages
– Database interfaces
 Data modeling
– Conceptual
– Logical
– Physical
 Database design
– Normalization
10
Databases: a classification
 Logical organization of data
–
–
–
–
–
Records-based database systems
Object-oriented database systems
Object-relational database systems
Deductive/logic database systems
Functional database systems
 Physical organization of data
– Centralized database systems
– Distributed database systems
• Homogeneous and heterogeneous
– Client-server database systems
– Mobile database systems
11
Databases: a classification
 Contents
–
–
–
–
–
–
Symbolic databases
Textual databases
Multi-media databases
Image databases
Spatial databases
Temporal databases
 Application domain
–
–
–
–
–
Engineering databases
Scientific databases
Statistical databases
Manufacturing databases
Business
12
Databases: a classification
 Data usage
–
–
–
–
–
Operational databases
Decision-support databases
Data warehousing
Data mining
Tactical and planning databases
 Nature of data
– Structured databases
– Semi-structured (like XML data)
– Unstructured (like Web)
 Self modifiability
– Passive databases
– Active databases (Triggers)
13
Yesterday’s DBMS Landscape
Application
...
DBMS
Database
Application
“Banking,
SAP, …"
“Server “
“Disk"
Yesterday’s Data
 Structured data
 Centralized data
 Homogeneous data
 Small
 Cleaned
 Static
15
Yesterday’s DBMS Hardware
Small main memory
Disk-based systems
16
Assumptions of yesterday’s DBMSs

Structured data with well-defined schema

Capacity of main memory <1% of the stored data

Central control to manage transactions

Pipelining is always beneficial (no storage of intermediate
results)

Cleaned data

A repository of data with simple query support
Today’s DBMS Landscape
18
Today’s DBMS Topics
19
Today’s Data ( “Big Data”)
Tables
temporal/streams
distributed data
networks
scientific data
Web/text
Natural language
20
Today’s DBMS Hardware
Large main memory
Solid state disks
Multi-core CPUs
Co-processors GPUs
21
Today’s DBMS information need
Traditional DB
 Users need to know complete schema
for querying
 Users must write SQL queries
 Database system does not help in
searching
 Lacks semantic value
 Stores only facts
Intelligent DB
 Users need not know complete database




schema
Users can simply use query expressions
Provides help to make searching
effective
Semantic information is stored via
Knowledge graphs and other data
structures in the database itself
Stores facts and rules
22
Assumptions of yesterday’s DBMSs

Structured data with well-defined schema
•

Capacity of main memory <1% of the stored data
•

Large-scale parallelism
Preprocessed, cleaned data
•

Effective distributed processing, job scheduling and balancing
Pipelining is always beneficial (no storage of intermediate
results)
•

DB in main memory
Central scheduler to schedule transactions
•

Semi-structured/unstructured, schemaless data,...
Manage dirty, noisy, incomplete data
A repository of data with simple query support
•
Complex analytical queries; intelligent DBMS
Question of this course
What do we have to change in
traditional DBMS to meet
tomorrow’s challenges?
Overview of topics
 DBMS beyond relational databases (week 2-3)
– noSQL and newSQL
– In-situ processing
– Data stream management systems
 Main-memory databases (week 4)
– Architecture and design principles
– Query and indexing strategy
25
Overview of topics
 Advanced query techniques (week 5-6)
– Indexing techniques
– Query optimization
– Approximate querying
 Parallel and distributed databases (week 7-8)
–
–
–
–
partition techniques
parallel and distributed models
fault tolerance and concurrency control
Distributed stream processing
26
Overview of topics
 Database and knowledge discovery (9-11)
– DBMS and IR
– DBMS and scalable DM/ML
– Intelligent DBMS
 Data quality (12-13)
– Dirty data: issues and problems
– Data cleaning and repairing
 Other new trends and applications (14-15)
– Crowdsourcing
– Data warehouse and datacenters
– Privacy and security
27
Course format
28
Course format
 A Seminar-style course: there will be no exam!
–
–
–
–
–
•
References:
•
•
•
•

Lectures: background.
Paper presentation
6 Homework
1 Final course project
Textbook: No Official Textbooks
Database systems the complete book, Hector Garcia-Molina, Jeffrey
D.Dullman, Jennifer Widom,2008
Hadoop: The Definitive Guide, Tom White, O’Reilly
Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer
(www.umiacs.umd.edu/~jimmylin/MapReduce-book-final.pdf)
Data Mining: Concepts and Techniques, Third Edition, by Jiawei Han et al.
Online Tutorials and Papers
– Research papers or chapters related to the topics (3-4 each)
• At the end of lecture notes from Ln2
29
Grading
 Reviews and Presentation: 40%
 Course Project: 45%
 Final Project report and presentation: 15%
Homework:
 Six sets of homework, starting from week 2; deadlines:
• 5 pm, Thursday, 1/28, week 3
• 5 pm, Thursday, 2/16, week 5
• 5 pm, Thursday, 2/25, week 7
• 5 pm, Thursday, 3/22, week 10
• 5 pm, Thursday, 4/5, week 12
• 5 pm, Thursday, 4/14, week 14
–
30
Review Evaluation
Pick 2 research papers each time from the lecture note to be
discussed in next two weeks, starting from Week 2.
Write a one-page review for each of the papers, 10 marks
 Summary: 2 marks
• A clear problem statement: input, question/output
• The need for this line of research: motivation
• A summary of key ideas, techniques and contributions
 Evaluation: 5 marks
– Criteria for the line of research (e.g., expressive power,
complexity, accuracy, scalability, etc)
– Evaluation based on your criteria; justify your evaluation
• 3 strong points
• 3 weak points
 Suggest possible extensions: 3 marks
31
Presentation Evaluation
 Presentation (15 minutes + 2-3 minutes Q&A)
– Background and motivation
• Challenges
• Why the problem is needed
– Problem formulation
– Algorithm description
– Experimental study
– Conclusion and Future work
32
Project – Research and development
(recommended)
 Research and development:
– Topic: pick one from a suggested project list (will be published
on course website)
Example: Association rule mining over temporal networks
 Development
– Pick a research paper from the reading list of ln3 – ln11
Implement its main algorithms
– Conduct its experimental study
You are encouraged to come up with your own project – talk to me first
Start early!
33
Project – Research and development
Evaluation
 Distribution:
– Algorithms: technical depth, performance guarantees
20%
– Prove the correctness, complexity analysis and performance
guarantees of your algorithms
15%
– Justification (experimental evaluation)
10%
 Report: in the form of technical report/research paper
–
–
–
–
–
–
Introduction: problem statement, motivation
Related work: survey
Techniques; algorithms, illustration via intuitive examples
Correctness/complexity/property/proofs
Experimental evaluation
Possible extensions
34
Project – survey
 Topic: pick one topic from a suggested list
Example: distributed graph query engines; temporal/streaming
querying approaches.
Distribution:
– Select 5-6 representative papers, independently
10%
– Develop a set of criteria: the most important issues in that
line of research, based on your own understanding; justify
your criteria
10%
– Evaluate each of the papers based on your criteria 15%
– A table to summarize the assessment, based on your
criteria, draw and justify your conclusion and
recommendation for various application
10%
Your understanding of the topic
35
Project report and presentation – 15%
 A clear problem statement
 Motivation and challenges
 Key ideas, techniques/approaches
 Key results – what you have got, intuitive examples
 Findings/recommendations for different applications
 Demonstration: a must if you do a development project
 Presentation: question handling (show that you have developed
a good understanding of the line of work)
Learn how to present your work
36
Readings for this week and next week
 Overview of traditional databases
– What goes around comes around, by Michael Stonebreaker,
https://mitpress.mit.edu/sites/default/files/titles/content/97802
62693141_sch_0001.pdf
– Database Management Systems, by Raghu
Ramakrishnan and Johannes Gehrke
http://pages.cs.wisc.edu/~dbbook/index.html
 Preview of noSQL and newSQL
– “noSQL databases”, by Christof Strauch –sections 1-3
– Scalable SQL and noSQL data stores, Rick Cattell,
http://www.sigmod.org/publications/sigmodrecord/1012/pdfs/04.surveys.cattell.pdf
37