Transcript not

CS345: Advanced Databases
Chris Ré
What this course is
Database fundamentals:
– Theory
– Old Crusty, Good SQL stuff
– No/New/Not-Yet SQL
New stuff: Knowledge bases & Inference
Databases is a strange and beautiful area:
Theory, Algorithms, Systems, & Applications
It’s a bit scattered, and I love it.
A Brief, Biased
Database History
Charles
Bachmann
Edgar
Codd
Jim
Gray
Three Turing Award Winners
Seminal contributions made in Industry
The Birth of the Relational Model
(1971)
database: a handful of relations
(tables) with fixed schema.
WorksIn(Employee,Dept)
Query with small # of operations:
Selection (filter),
Projection, Join, Union.
Basically, an operational finite model theory.
Data and Query Model
R(A,B) = { (a1,b2),…,(an,bn) }
S(B,C,D) = { (b’1,c1,d1),…,(b’m,cm,dm) }
Data
PA(R) ={ a : exists b. (a,b) in R }
Projection
sF(R) ={ (a,b) : F( (a,b) ) for t in R }
F : D(R) -> {True, False}
Selection
Join(R,S) =
{ (a,b,c,d) : (a,b) in R & (b,c,d) in S}
Join
Key idea of the Relational Model
Declarative
User says what they want--not how to get it.
Key question: Can one
implement the Relational
Model efficiently?
System R
Pat Selinger
In,1974 System R shows possible
to get good performance.
1st Implementation of SQL.
IBM didn’t Push it,
worried about IMS cannibalization, but…
Others Come on to the Scene…
Larry Ellison hears about IBM’s Research
prototype and founds a company….
Fast Forward to Today
Relational model is dominate model of data.
Takeaways about Database Research
Started with mathematical
elegance and with close ties to
industry.
Improve runtime performance as a
proxy to increase programmer
productivity.
The Big Ideas
Independence
Declarative languages can improve
productivity
– Different team members work independently
• Backend, Storage, UI, BI, Etc.
– Transactional model.
– Challenge: Support efficient concurrent
access?
Performance
Parallel programming is hard; SQL is most
popular parallel programming language.
– How do you deal with asymmetry of memory
hierarchy (Disk/MM/Cache)?
– How do you structure parallel optimization?
– Concurrency?
Manageability
Systems live over time, and the system
should automate many routine tasks.
– Maintain derived data products (views)
– Self-monitoring systems (autonomic)
Course Topics
A user says what they want—
not how to get it.
Topic 1: QP Fundamentals
Query Processing Fundamentals
1.
2.
3.
4.
5.
Empirical Join evaluation from 70s!
System R: The Archetype (Cardinalityw)
Formal Query Languages
Acyclic Query Evaluation (Structure)
Worst-case Optimal Join Algorithms (S + C)
This will be the most
formal part of the
course.
Analyzing your data before it was big
(when it was just very large…)
Topic 2: OLAP-Style Analytics
Building new and old data systems:
1. Theory of Materialized View
2. Gamma (Parallel DBs)
3. MapReduce & the Rise of NoSQL
(2000s)
4. NewSQL & Optimizing Joins on MR
(theory)
5. Fagin’s Algorithm (theory)
6. Statistical Analytic Systems
My biased view of the future…
Topic 3: Next-Generation Systems
1. Information Extraction
2. Probabilistic Query Evaluation
(Theory)
3. Scalable Inference
4. Knowledge Bases
Transactions.
Topic 4: OLTP Style
Transactional Systems
1. The rise of Key-Value Stores
2. The case for determinism
3. CALM & CAPs
4. The Return of Main Memory DBs.
5. Spanner, F1, and Data Centers
Course Logistics
Grading
• Course Project (More next)
– Do something interesting with data.
– Teams OK
– Form teams soon and email me by Jan 12.
• Midterm Exam
Projects in each topic
1. Knowledgebase Construction
–
Pick a domain and build a KBC system for it with
DeepDive
2. Join Algorithms
– Certificate versions (see me)
– MapReduce? GraphLab? Spark?
3. Analytics Systems
4. Transactional Systems.
You are free to
choose other
projects
Datasets
• Snapshot of the web marked up with NLP
tools and structured data (KBP and KBA
challenges)
• 500k+ docs used by PaleoBiologists and
structured data.
• We can mark up even more stuff.
• Benchmark ML, graphs if you want to work
on analytics or join evaluation.
Wednesday
• Wednesday we begin the ancient art of
join evaluation. All who pass this way must
pass through this ancient topic!
• Read: Shapiro.
– not too carefully, we’ll go through details