CS 440: Database Management Systems

Download Report

Transcript CS 440: Database Management Systems

CS 540
Database Management Systems
Course overview
Welcome to CS540!
•
•
•
•
Instructor: Arash Termehchy
Assistant Professor at EECS
Research on data management and analytics
Information & Data Management and Analytics
(IDEA) Lab
The Era of Big Data
• Technological shifts, e.g.,
mobile devices, have created a
staggering number of
enormous data sets.
• Both opportunities and
challenges.
Opportunities: unreasonable effectiveness of data
• A. Halevy, et al. The unreasonable effectiveness of data,
IEEE Intelligence Systems, 2009.
• Observation from working with large datasets in Google.
– More data generally outperforms complex statistical
models in the data-centric prediction and discovery.
• Conclusion:
– Usually, no need for overly complex statistical models.
Opportunities are priceless!
The story of John Snow
“In the mid-1850s, Dr. John Snow plotted cholera deaths on a
map, and in the corner of a particularly hard-hit buildings was a
water pump. A 19th-century version of Big Data, which suggested
an association between cholera and the water pump.”
Integrating data sets has saved millions of lives!
Paradigm shifting influence on scientific
discovery
• “The Fourth Paradigm: Data-Intensive Scientific Discovery”,
Jim Gray
– Empirical
– Theoretical
– Computational
– Data-centric
• Sloan Sky Server database is a top
cited resource in the field of astronomy.
– Astronomical observation => database query
Challenges: data volume
• Sloan Sky Server will soon store 30 terabyte per day.
• Hardon Colider can generate 500 exabyte per day.
• 90% of world data generated in the last two years (2013)
– Every two year : ten times more data
Challenges: data variety/ diversity
• Database systems used to deal with
a single static database.
• Need to transform and
or integrate large number
of evolving data sets.
• Impossible to do manually.
“A data integration expert is never without a job”
Challenges: usability
“….(in the next few years) we project a need for 1.5 million
additional analysts in the United States who can analyze data
effectively…“,
-- McKinsey Big Data Study, 2012
Current systems are not built for scientists and normal
users.
“It may take a PhD in computer science to successfully
deploy a data analytics algorithm!”
The notion of database management system (DBMS)
• Data processing used to be mostly ad-hoc programming.
• W. McGee, Generalization: Key to Successful Electronic
Data Processing, Journal of ACM, 1959.
• Generalization, aka abstraction/ data modeling
– File: A sequence of records.
– Operation: sort, select part of the file, …
• Makes data management and processing usable.
– People can learn and use the abstraction instead of
developing new data processing programs.
Abstraction is the key
• How to develop usable abstractions for our data?
– Data models, query languages,
– Relational data model, graph data model, …
• How to implement these abstractions efficiently?
– Database systems internal
– Storage management, indexing, ….
What this course is not about
• We do not discuss the basic concepts
– ER model, relational model, relational algebra,
SQL, database design, database programming
• You should know them already
– If you are not, drop the course and take CS 340
• We do not discuss how to tune or implement an
application using MySQL, Oracle, …
Topics
• How to develop usable abstractions for our data?
– data independence principle
– relational data model
– graph data model
• How to implement these abstractions efficiently?
– storage management and indexing
– query processing algorithms
– query optimization
– concurrency control and recovery
– parallel and distributed data processing
– data transformation & integration
This is a research-oriented course
• Learn & discuss the concepts and algorithms
– Read and summarize classic and new research
papers.
– Discuss them in our lectures.
• Develop data systems
– Apply the lessons learned to interesting data
problems.
Learning the fundamentals: paper review
• Read and summarize the papers before the lecture:
–
–
–
–
What is the main problem discussed in the paper?
Why is it important?
What are the main ideas of the proposed solutions?
What are the final results of the paper?
• References on the course website on “how to read
scientific papers”.
• Post a 300-words summary of the paper on Piazza by
12:00 pm of the day of the class.
– Private posts
• One paper per lecture marked by * in the course website
– You can skip two reviews.
• 10% of the total grade.
Learning the fundamentals: Lectures
•
•
•
•
•
Review and discuss the papers.
Slides will be available on the website after the class.
Provide the road map for studying
Attendance is not required but encouraged.
Participate and ask questions!
Apply your understanding: Assignments
•
•
•
•
•
•
•
Six assignments:
Announced on Piazza, posted on the course website.
Both written and programming.
Submit using TEACH
Write using word processors and submit in pdf.
Start early!
25% of the overall grade
Learning the fundamentals: Exam
• Midterm exam in class.
– Closed books and notes
• Tests your knowledge of the papers and subjects
discussed in the class.
• 30% of the total grade.
• No final exam (instead you work on your projects).
Apply your understanding: project
• System/ research project on data management / analytics
– System: build a rather complex system using available
methods
• Solve a real-problem over large data sets
• More challenging than a well-defined implementation
• Identify and solve design choices and tradeoff.
– Research: build a system with some novel ideas
• Identify an interesting problem and read the state-of-the-art
papers on the problem.
• Propose and implement some new ideas to solve the problem.
• Groups of 1 – 3 students.
– Larger groups are not allowed.
• 35% of the total grade.
Projects themes
• You should pick a project on following themes.
• Data interaction
– An interactive query interface
• an interface that learns from previous interactions.
– An interactive and usable query interface
• easier to use than SQL.
• effective and efficient keyword, visual, or natural language
interface.
• Data cleaning & transformation
– Most datasets are not clean: missing values, …
– Most data sets are not in relational format:
• Online posts, spreadsheets, ...
– A system that cleans or restructures data and load it in a
relational database to get interesting insights.
– Reduce the manual cleaning/ restructuring burden
Projects themes
• Data integration
– A system that integrates multiple datasets into one
relational database.
– Reduce the amount of manual work in integration.
– Sample: how to integrate online posts from different
websites.
• Predictive modeling
– A system that learns predictive models over large
relational databases.
• automatic feature extraction, relational learning, deep
learning
– A system that efficiently performs probabilistic
inference over relational databases.
Projects themes
• Your project may combine multiple themes
– A system that learns a predictive model over multiple
relational databases.
– Combines data integration and predictive modeling
themes.
• A lecture to go over sample projects next week.
Project millstones
• Project proposal:
–
–
–
–
Group members
What do you want to solve?
Relevant references
Which tools, data sets, systems you will use?
• Midterm presentation: 7 minutes (3 + 4 Q&A)
–
–
–
–
Detailed description of the problem
Your approach to solve it.
Review of the related work
Your progress, challenges, and your plan to solve them.
• Presentation in class: 15 minutes (11 + 4 Q&A)
• Final report:
–
–
–
–
Problem & solution
Detailed comparison with the related work
Analysis of empirical studies
Conclusion.
Project
• Discuss your ideas and progress with the course staff
during the term.
• Graded based on technical depth and presentation.
• Check out the course website for more information.
• Start early!
Communication
• Communicate with the course staff
– TA: Jose Picado, Parisa Ataie
– Piazza
• preferred method of communication
– Office hours
• Arash: Thursday 4:30 – 5:30 pm
• Jose: Tuesday/ Thursday 10 – 11 am
• Parisa: Wednesday 9 – 10 am
– Email the staff for other types of questions
• Use [cs540] tag in the subject line.
• Communicate with your peers on course materials and lectures.
• Check the Piazza and course website for announcements,
course policies and schedule.
What is next?
• A classic paper on the relational model by its
inventor.
• Physical and logical data independence and
how they have evolved.