Transcript Notes
Wrapup
Amol Deshpande
CMSC424
“Inventing the Future”
Wednesday at 3:30pm
1115 CSIC
http://www.cs.umd.edu/projects/ITF/
Exam
DBMS at a glance
Data Models
Conceptual representation of the data
Data Retrieval
How to ask questions of the database
How to answer those questions
Data Storage
How/where to store data, how to access it
Data Integrity
Manage crashes, concurrency
Manage semantic inconsistencies
Not fully disjoint categorization !!
DBMS at a glance
Data Models
E/R Model, Relational model
Very simple and hence effective
Easy to make things complicated, very hard to keep them simple
No other data model has survived for so long
What is the future of XML ?
DBMS at a glance
Data Retrieval
How to ask questions of the database
Declarative languages are great
Hide complexity from users, can optimize things, can evolve easily
SQL
– More or less declarative
How to answer those questions
Parsing --> Optimization --> Processing
Operators: Hashing, sorting, joins, aggregation
Data structures
– Hash indexes: Good for equality queries
– Tree indexes: For everything else
Optimization: Complex, but key piece of a database system
DBMS at a glance
Data Storage
How/where to store data, how to access it
Need to be cognizant of the memory hierarchy
Memory is cheap, disk is very expensive to access
Further disk is cheap to access sequentially, much more
expensive to access randomly
– Many of our decisions are influenced by this
RAID: Surviving failures
Accessing data: Indexes
What happens if a new form of storage comes along with
different properties (say holographic storage ?)
We will need to rethink the tradeoffs, but we now know the
approach
DBMS at a glance
Data Integrity
Manage crashes, concurrency
Transactions, 2-phase locking
Write-ahead logging
DBMS pretty much the last word on concurrency/recovery
OSs don’t come close to supporting anything like that
Manage semantic inconsistencies
Normalization, FDs
Not easy to identify tools, but we have learned how to think
about them
– Try to capture them in the E/R diagram as much as
possible
Motivation: Data Overload
We began the first lecture with discussing the data overload
Huge amounts of data generated every day
Much faster than our ability to process it
Increasing ability to capture more enterprise data
Web, blogs, RSS Feeds etc
Multimedia
– Flickr and cellphone cameras has led a revolution in how
people take pictures
– Videos will be next
– Not hard to imagine capturing every moment of your life
Sensor/RFID data
– Tiny sensors/RFID just beginning to become ubiquitous
– Billions of these generating a tiny amount of data every
second is still too much
Biological/Scientific data
Motivation: Data Overload
Relational databases help for structured data
But increasingly not sufficient
The things we want to do with data can’t be expressed in SQL
E.g. with biological data, web
Too much unstructured data
Distributed data generation creates additional headaches
Almost impossible to try to collect the data in one location
Making sense of this requires not only advances in data
processing, but also in data understanding/mining
Interdisciplinary efforts
Some Lessons from RDBMS
But can use the lessons learned from developing RDBMS
Data independence / abstraction is good
Hide details, even if initially it leads to inefficiency
Look for structure
Every seemingly highly unstructured data might have structure
Look for patterns in usage
Relational database are fast because query processing is
predictable
– Unlike say OS workloads which are very hard to optimize for
If you can identify patterns, you can probably optimize them
Declarative languages are great
Say what you want, not how to get it