Master Title Slide - Microsoft Research
Download
Report
Transcript Master Title Slide - Microsoft Research
What's Next for Database?
Jim Gray
Microsoft
http://research.microsoft.com/~Gray
Outline
Looking at the past:
old problems now look easy
Looking forward:
data avalanche here
integrate ALL kinds of data
Watershed: The new world
Programs + data: Info Ecosystem
All data classes (Objectifying Information)
Approximate answers
Keynote ▪ 30 September 2005 ▪ 9:00
Old Problems Now Look Easy
1985 goal: 1,000
transactions per second
Couldn’t do it at the time
At the time:
100 transactions/second
50 M$ for the computer
(y2005 dollars)
Keynote ▪ 30 September 2005 ▪ 9:00
Old Problems Now Look Easy
1985 goal: 1,000
transactions per second
Couldn’t do it at the time
At the time:
100 transactions/second
50 M$ for the computer
(y2005 dollars)
Now: easy
Laptop does 8,200 debitcredit tps
~$400 desktop
Thousands of DebitCredit Transactions-Per-Second:
Easy and Inexpensive, Gray & Levine,
MSR-TR-2005-39, ftp://ftp.research.microsoft.com/pub/tr/TR-2005-39.doc
Keynote ▪ 30 September 2005 ▪ 9:00
Hardware & Software Progress
Throughput 2x per 2 years Throughput/$ 2x per 1.5 years
tracks MHz
40%/y hardware, 20%/y software
1000.00
100,000
X86&X64 tpmC per CPU over time
100.00
20
X86&X64 tpmC per Mhz over time
1,000
Throughput / k$
tpmC/cpu
10,000
30x in 10 years
41%/year
Double every 2 years
TPC-A and TPC-C
tps/$ Trends
10.00
TPC-C
TPC A
1.00
~100x in 10 years
~2x per 1.5 years
15
0.10
10
5
0.01
100
0
1995 1996 1997 1998 1999 2000 2001 2002
2003 2004 2005 2006
1990
1992
1994
1996
1998
2000
2002
2004
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
No obvious end in sight!
A Measure of Transaction Processing 20 Years Later ftp://ftp.research.microsoft.com/pub/tr/TR-2005-57.doc
IEEE Data Engineering Bulletin, V. 28.2, pp. 3-4, June 2005
Keynote ▪ 30 September 2005 ▪ 9:00
100x Improvement Every Decade
$1B job becomes $10M job
$1M job becomes 10K$ job
Terabytes common now (~500$ today)
Petabytes in a decade.
Challenge:
We can capture & store everything.
What’s interesting?
What can you tell me about X?
Keynote ▪ 30 September 2005 ▪ 9:00
Q: How Much is “Everything”
A: About 15 Exabytes
Q: How much is digital?
A: 70% and growing
Q: Where does it come from?
A: Video, voice, sensors,
Q: How fast is it growing?
A: Growing 10%/y now,
55%/y when ALL digital
Information Growth vs
Storage Media
PB/y
print
0.2
2%
film
427
4%
video
300
5%
computer 1,693 55%
Source: Larson & Varian, “How Much Information”: as of 2003
http://www.sims.berkeley.edu/research/projects/how-much-info/
Keynote ▪ 30 September 2005 ▪ 9:00
CAG
Where is the Data?
Smart Objects Everywhere
Phones, PDAs, Cameras,… have small DBs.
Disk drives have enough cpu, memory
to run a full-blown DBMS.
All these devices want-need to share data.
Need a simple-but-complete dbms
They need an Esperanto:
a data exchange language and paradigm.
Billions of Clients Millions of Servers
Keynote ▪ 30 September 2005 ▪ 9:00
The Perfect System
Knows everything
Knows what you want to know
Tells you the answer…
in a an easy-to-understand way;
just before you ask
Tells you what you should have asked
And…
It is inexpensive to buy
It is inexpensive to own.
Well, maybe not everyone wants this…
but every organization does.
Keynote ▪ 30 September 2005 ▪ 9:00
Oh! And the PEOPLE COSTS are HUGE!
People costs have always exceeded IT capital.
But now that hardware is “free” …
Self-managing, self-configuring, self-healing, selforganizing and … is key goal.
No DBAs for cell phones or cameras.
Requires
Clear and simple knobs on modules
Software manages these knobs
Keynote ▪ 30 September 2005 ▪ 9:00
Our Challenge
Capture, Store, Organize, Search, Display
All information.
Personal
Organizational
Societal
There is a huge gap between
what we have today and
what we need.
Data capture is relatively easy
Curate, Organize, Search, Display still too hard.
Keynote ▪ 30 September 2005 ▪ 9:00
Outline
Looking at the past:
old problems now look easy
Looking forward:
data avalanche here
integrate ALL kinds of data
Watershed: The new world
Programs + data: Info Ecosystem
All data classes (Objectifying Information)
Approximate answers
Keynote ▪ 30 September 2005 ▪ 9:00
DBMS Re-conceptualization
Re-Unification of Programs & Data
Allows Objectification of Information
eg: what is a gene? What properties&methods?
what is a person? What properties&methods?
What is an X? What properties&methods?
Need to “glue” all these models together
Time, Space, text,… are core types
Person, event, document, gene,.. are extensions.
The “Action” is in these extensions.
Keynote ▪ 30 September 2005 ▪ 9:00
Code and Data: Separated at Birth
COBOL
IDENTIFICATION: document
AUTHOR, PROGRAM-ID, INSTALLATION,
SOURCE-COMPUTER, OBJECT-COMPUTER,
SPECIAL-NAMES, FILE-CONTROL, I-O-CONTROL,
DATE-WRITTEN, DATE-COMPILED,
SECURITY.
ENVIRONMENT: OS
CONFIGURATION SECTION.
INPUT-OUTPUT SECTION.
DATA: Files/Records
FILE SECTION.
WORKING-STORAGE SECTION.
LINKAGE SECTION.
REPORT SECTION.
SCREEN SECTION.
“data”
PROCEDURE: code
“knowledge”
Keynote ▪ 30 September 2005 ▪ 9:00
CODASYL - DBTG
COnference on DAta SYstems Languages
Data Base Task Group
Defined DDL for a network data model
Set-Relationship semantics
Cursor Verbs
Isolated from procedures.
No encapsulation
Klaus Wirth: Programs = Algorithms + Data Structures
The Object-Relational World
marry programming languages and DBMSs
Stored procedures evolve to “real” languages
VB, Java, C#,.. With real object models.
Data encapsulated: a class with methods
Tables are enumerable & indexable
Business
record sets with foreign keys
Objects
Records are vectors of objects
Opaque or transparent types
Set operators on transparent classes
Transactions:
Preserve invariants
A composition strategy
An exception strategy
Ends Inside-DB Outside-DB dichotomy
Keynote ▪ 30 September 2005 ▪ 9:00
Ask not “How to add objects to databases?”,
Ask “What kind of object is a database?”
Q: Given an object model, what is a DB?
A: DataSet class and methods
(nested relation with metadata)
The basis for the ecosystem
Distributed DB
Extensible DB
Interoperable DB
….
implicit in ODBC, OleDB
explicit within the DBMS ecosystem
Input:
Command (any language)
Output:
Dataset
Keynote ▪ 30 September 2005 ▪ 9:00
Question
Dataset
Tables
or Text
or cube
Or…..
DB System Architecture
sets
records
os
but applications need to query
other data types
Added:
Keynote ▪ 30 September 2005 ▪ 9:00
sets
…
records
os
A Mess?
utilities
Notification
Space
Time
Data Mine
Cubes
Text
ETL
Replication
XML
Queues
Procedures
+Text, Time, Space
+ Triggers and queues
+ Replication, Pub/sub
+ Extract-Transform-Load
+ Cubes, Data mining
+ XML, XQuery
+ Programming Languages
+ Many more extensions coming
utilities
The classic DBMS model
Evolving to be
Information Services Container
develop, deploy, and execution environment
+ Programming Languages
+ Triggers and queues
+ Replication, Pub/sub
+ Extract-Transform-Load
+ Text, Time, Space
+ Cubes, Data mining
+ XML, XQuery
+ Many more extensions coming
sets
records
os
utilities
Classic ++
DBMS is an ecosystem
OO is the key structuring strategy:
Everything is a class
Database is a complex object
Core object is DataSet
Classes publish/consume them
Depends on strong Object Model
Keynote ▪ 30 September 2005 ▪ 9:00
DataSet
What’s Outside?
Remote Node
Remote Node
Internet
Other us
Other us
Applications
Other us
Our API
Buffer Pool
catalogs
itterators
Query Processor
Keynote ▪ 30 September 2005 ▪ 9:00
data
Other us
Classic: What’s Outside?
Three Tier Computing
Clients gather input, do presentation
do some workflow (script)
Send high-level requests to ORB
(Object Request Broker)
ORB dispatches workflows,
orchestrate flows & queues
Workflows invoke business objects
Business object read/write database
Keynote ▪ 30 September 2005 ▪ 9:00
Presentation
workflows
Business
Objects
Databases
DBMS is Web Service!
Client/server is back; the revenge of TP-lite
Web servers and runtimes (Apache, IIS, J2EE, .NET)
displaced TP monitors & ORBS
Presentation
Give persistent objects
Holistic programming model & environment
Keynote ▪ 30 September 2005 ▪ 9:00
workflows
Business
Objects
DBMS
Web services (soap, wsdl, xml)
are displacing current brokers
DBMS listening to Port 80
publishing WSDL, DISCO,WS-Sec
Servicing SOAP calls.
DBMS is a web service
Basis for distributed systems.
A consequence of OR DBMS
Databases
Queues & Workflows
Apps are loosely connected via
Queued messages
Workflow:
Queues are databases.
Script
Basis for workflow
Execute
Queues: the first class to add to Administer &
an OR DBMS
Expedite
all built on queues
Queues fire triggers.
Active databases
Synergy with DBMS
security, naming, persistence, types, query,…
Keynote ▪ 30 September 2005 ▪ 9:00
What’s new here?
DBMS have tight-integration with
language classes (Java, C#, VB,.. )
The DB is a class
Question
Dataset
You can add classes to DB.
Adding indices is “easy”
If you have a new idea.
Now have solid queue systems
Adding workflow is “easy”
If you have a new idea.
This is a vehicle for publishing data
on the Web.
Interne
t
Keynote ▪ 30 September 2005 ▪ 9:00
Web service
Tables
or Text
or cube
Or…..
Tables
or Text
or cube
Or…..
Text, Temporal, and Spatial
Data Access
Q: What comes after queues?
A: Basic types: text, time, space,…
Great application of OR technology
Key idea:
table valued functions == indices
An index is a table, organized differently
Query executor uses index to map:
Key → set (aka sequence of rows)
Table valued function can do this map
Optimizer can use it.
+extras: cost function, cardinality,…
select Title, Abstract, T.Rank
from Books join
FreeTextTable(Title,
on
select galaxy, distance
from GetNearbyObjEQ(22,37)
select store, holiday, sum(sales)
from Sales join
HolidayDates(2004) T
on Sales.day = T.day
group by store, holiday
BIG DEAL:
Approximate answers: Rank and Support
Keynote ▪ 30 September 2005 ▪ 9:00
Abstract,
'XML semistructured') T
BookID = T.Key
Data Mining
and Machine Learning
Tasks: classification, association, prediction
Tools: Decision trees, Bayes, A Priori,
clustering, regression, Neural net,…
now unified with DBs
Create table T (x,y,z,u,v,w)
Learn “x,y,z” from “u,v,w” using <algorithm>
Train T with data.
Then can ask:
Probability x,y,z,u,v,w
What are the u,v,w probabilities given x,y,z
Example: Learn height from age.
Anyone with a data mining algorithm has
full access to the DBMS infrastructure.
Challenge: Better learning algorithms.
Keynote ▪ 30 September 2005 ▪ 9:00
Notification:
Stream and Sensor Processing
Traditionally:
Query billions of facts
Streams:
millions of queries one new fact
New protein compare to all DNA
Change in price or time
Implications
Q?
A!
New aggregation operators (extension)
New programming style
Streams in products:
Queries represented as records fact, fact, fact…
New query optimizations.
facts
Q Q
Q QQ
Q Q
Sensor networks
push queries out to sensors.
Simpler programming model
Optimizes power & bandwidth
Keynote ▪ 30 September 2005 ▪ 9:00
Notification
Semi-Structured Data
“Everyone starts with the same schema:
<stuff/>.”
Then they refine it.” J. Widom
“Strong schema” has pros-and-cons.
Files <stuff/> and XML <<foo/> <bar/>>
are here to stay. Get over it!
File directories are databases;
Pivot on any attribute
Folders are standing queries.
Freetext+schema search (better precision/recall)
Cohabit with row-stores
Keynote ▪ 30 September 2005 ▪ 9:00
Publish-Subscribe, Replication
Extract-Transform-Load (ETL)
Data has many users
Replicas for availability and/or performance
Mobile users do local updates synchronize later.
Classic Warehouse
Replicate to data warehouse
Data marts subscribe to publications
Disaster Recovery geoplex
ETL is a major application & component
Data loading
Data scrubbing
Publish/subscribe workflows.
Key to data integration (capture / scrub)
Keynote ▪ 30 September 2005 ▪ 9:00
Restatement: DB Systems evolved to be
containers for information services
develop, deploy, and execution environment
Everything is a class
Database is a complex object
Core object is DataSet
Approximate answers
This architecture lets you
add your new ideas.
Keynote ▪ 30 September 2005 ▪ 9:00
sets
records
os
utilities
DBMS is an ecosystem
Key structuring strategy:
DataSet
Summary:
Looking at the past:
old problems now look easy
Looking forward:
data avalanche here
integrate ALL kinds of data
Watershed: The new world
Programs + data: Info Ecosystem
All data classes (Objectifying Information)
Approximate answers
Keynote ▪ 30 September 2005 ▪ 9:00
Additional Resources
Papers at: http://research.microsoft.com/~gray/JimGrayPublications.htm
Talks at:
http://research.microsoft.com/~gray/JimGrayTalks.htm
Basis for this talk:
“The Revolution in Database Architecture”
http://research.microsoft.com/research/pubs/view.aspx?tr_id=735
Very interesting & related: David Campbell
“Service Oriented Database Architecture:
App Server-Lite?”
http://research.microsoft.com/research/pubs/view.aspx?tr_id=983
Keynote ▪ 30 September 2005 ▪ 9:00
Thank you!
Thank you for attending this session and the 2005 PASS
Community Summit in Grapevine! Please help us
improve the quality of our conference by completing your
session evaluation form. Completed evaluation forms may
be given to the room monitor as you exit or to staff at the
registration desk.
•