Transcript Notes

Wrapup
Amol Deshpande
CMSC424
“Inventing the Future”
 Wednesday at 3:30pm
 1115 CSIC
 http://www.cs.umd.edu/projects/ITF/
 Exam
DBMS at a glance
 Data Models
 Conceptual representation of the data
 Data Retrieval
 How to ask questions of the database
 How to answer those questions
 Data Storage
 How/where to store data, how to access it
 Data Integrity
 Manage crashes, concurrency
 Manage semantic inconsistencies
 Not fully disjoint categorization !!
DBMS at a glance
 Data Models
 E/R Model, Relational model
 Very simple and hence effective
 Easy to make things complicated, very hard to keep them simple
 No other data model has survived for so long
 What is the future of XML ?
DBMS at a glance
 Data Retrieval
 How to ask questions of the database
 Declarative languages are great
 Hide complexity from users, can optimize things, can evolve easily
 SQL
– More or less declarative
 How to answer those questions
 Parsing --> Optimization --> Processing
 Operators: Hashing, sorting, joins, aggregation
 Data structures
– Hash indexes: Good for equality queries
– Tree indexes: For everything else
 Optimization: Complex, but key piece of a database system
DBMS at a glance
 Data Storage
 How/where to store data, how to access it
 Need to be cognizant of the memory hierarchy
 Memory is cheap, disk is very expensive to access
 Further disk is cheap to access sequentially, much more
expensive to access randomly
– Many of our decisions are influenced by this
 RAID: Surviving failures
 Accessing data: Indexes
 What happens if a new form of storage comes along with
different properties (say holographic storage ?)
 We will need to rethink the tradeoffs, but we now know the
approach
DBMS at a glance
 Data Integrity
 Manage crashes, concurrency
 Transactions, 2-phase locking
 Write-ahead logging
 DBMS pretty much the last word on concurrency/recovery
 OSs don’t come close to supporting anything like that
 Manage semantic inconsistencies
 Normalization, FDs
 Not easy to identify tools, but we have learned how to think
about them
– Try to capture them in the E/R diagram as much as
possible
Motivation: Data Overload
 We began the first lecture with discussing the data overload
 Huge amounts of data generated every day
 Much faster than our ability to process it
 Increasing ability to capture more enterprise data
 Web, blogs, RSS Feeds etc
 Multimedia
– Flickr and cellphone cameras has led a revolution in how
people take pictures
– Videos will be next
– Not hard to imagine capturing every moment of your life
 Sensor/RFID data
– Tiny sensors/RFID just beginning to become ubiquitous
– Billions of these generating a tiny amount of data every
second is still too much
 Biological/Scientific data
Motivation: Data Overload
 Relational databases help for structured data
 But increasingly not sufficient
 The things we want to do with data can’t be expressed in SQL
 E.g. with biological data, web
 Too much unstructured data
 Distributed data generation creates additional headaches
 Almost impossible to try to collect the data in one location
 Making sense of this requires not only advances in data
processing, but also in data understanding/mining
 Interdisciplinary efforts
Some Lessons from RDBMS
 But can use the lessons learned from developing RDBMS
 Data independence / abstraction is good
 Hide details, even if initially it leads to inefficiency
 Look for structure
 Every seemingly highly unstructured data might have structure
 Look for patterns in usage
 Relational database are fast because query processing is
predictable
– Unlike say OS workloads which are very hard to optimize for
 If you can identify patterns, you can probably optimize them
 Declarative languages are great
 Say what you want, not how to get it