My Current DB Projects and General Database

Download Report

Transcript My Current DB Projects and General Database

Database Activities and Trends
Jim Gray
Microsoft Research
2 June 2006, Microsoft, TechNet, London
Outline
• What I have been doing
• BIG Changes in DBs
• Q&A
15 minutes
15 minutes
20 minutes
Scalability Projects
• TerraServer: Geospatial data online
– Now part of Virtual Earth
http://local.live.com/
• SATA disk evaluation
– Copy 1.5 Petabytes (count types of errors)
MSR-TR-2005-166
• Disk and Network performance
move 1GB/s from CERN to Pasadena
• Bricks
MSR-TR-2004-62
– BI-Bricks: cheap boxes/disks for BI
– Server Bricks: TerraServer Bricks: MSR-TR-2004-107
DB Projects
• Spatial data access inside SQL
– Gives a good example of using CLR to extend SQL
– Sample is part of SQL 2005 programming samples.
– Many papers
…, MSR-TR-2005-122, MSR-TR-2006-52
• To Blob or NOT to Blob?
– Explored what is the break-even point of Blobs vs Files.
Guess what! Almost all files should be blobs.
MSR-TR-2006-45
• GPU TeraSort:
– You have been hearing about Many-Core from Intel
– Nvidia & ATI give you 100 cores today (2x next year)
Not me, but
10x the operations per second than the CPU
very cool!
10x the memory bandwidth of the CPU
– How to program them?
– Sort represents IO, memory, processing.
– GPU TerraSort demos this
MSR-TR-2005-183
– Accelerator: C# extension is a GPU compiler.
MSR-TR-2005-184
eScience Projects
• SkyServer: Astronomy data online
http://skyserver.sdss.org/
– A real Data Grid app – Web services are popular
– SkyQuery and CasJobs use web services.
http://casjobs.sdss.org/CasJobs/
– Spatial access built as SQL 2005 C# extensions.
• Doing Finite Element Analysis with a DB and Vis tools
Supporting Finite Element Analysis with a Relational Database Backend;
Part I: There is Life beyond Files MSR-TR-2005-49
• Ecological sensors (soil, water, ocean,…)
– Only public thing so far: http://lifeunderyourfeet.org/
– Many papers coming
• Starting BioInfo efforts (Portable PubMed Central, ….)
Portable PubMedCentral
• “Information at your fingertips”
• Helping build PortablePubMedCentral
• Deployed US, China, England, Italy, South
Africa, (Japan soon).
• Each site can accept documents
• Archives replicated
• Federate thru web services
• Working to integrate Word/Excel/…
with PubmedCentral – e.g. WordML, XSD,
• To be clear: NCBI is doing 99% of the work,
but it is very cool and very significant.
Outline
• What I have been doing
• BIG Changes in DBs
• Q&A
15 minutes
15 minutes
20 minutes
DB System Architecture
sets
records
os
Worked, but applications wanted
to query other data types
Added:
sets
utilities
Notification
Space
Time
Data Mine
Cubes
Text
ETL
Replication
XML
Queues
Procedures
+Text, Time, Space
+ Cubes, Data mining
+ XML, XQuery
+ Programming Languages
+ Triggers and queues
+ Replication, Pub/sub
+ Extract-Transform-Load
+ Many more extensions coming
utilities
• The classic DBMS model
…
records
os
A Mess?
DB Systems evolved to be
containers for information services
develop, deploy, and execution environment
• Classic ++
–
–
–
–
–
–
–
+ Triggers and queues
+ Replication, Pub/sub
+ Extract-Transform-Load
+ Text, Time, Space
+ Cubes, Data mining
+ XML, XQuery
+ Many more extensions coming
sets
records
os
utilities
– + Programming Languages
• DBMS is an ecosystem
OO is the key structuring strategy:
–
–
–
–
–
Everything is a class
Database is a complex object
Core object is DataSet
Classes publish/consume them
Depends on strong Object Model
DataSet
The Object-Relational World
marry programming languages and DBMSs
• Stored procedures evolve to “real” languages
Java, C#,.. With real object models.
• Data encapsulated: a class with methods
• Classes may be persistent
Business
• Tables are enumerable & index-able
Objects
record sets with foreign keys
• Records are vectors of objects
• Opaque or transparent types
• Set operators on transparent classes
• Transactions:
– Preserve invariants
– A composition strategy
– An exception strategy
• Ends Inside-DB Outside-DB dichotomy
Ask not “How to add objects to databases?”,
Ask “What
kind of object is a database?”
Q: Given an object model, what is a DB?
A: DataSet class and methods
(nested relation with metadata)
This is the basis for the ecosystem
Distributed DB
Extensible DB
Interoperable DB
Tables
Question
or Text
….
or cube
Dataset
Or…..
This was implicit in ODBC
but is now explicit within the DBMS ecosystem
Input:
Command (any language)
Output: Dataset Entity Set in ADO.NET 3.0
Queues & Workflows
SODA (Service Oriented Data Architecture)
Service Oriented Database Architecture: App Server-Lite? MSR TR 2005 129
• Apps are loosely connected
via Queued messages
Workflow:
• Queues are databases.
Script
Execute
• Basis for workflow
Administer &
• Queues: the first class to
Expedite
add to an OR DBMS
all built on queues
• Queues fire triggers.
Active databases
• Synergy with DBMS
security, naming, persistence, types, query,…
Text, Temporal, and Spatial
Data Access
select Title, Abstract, Rank
• Q: What comes after queues?
from Books join
(Title,
A: Basic types: text, time, space,… FreeTextTable
Abstract,
‘XML semistructured') T
• Great application of OR technology on BookID = T.Key
• Key idea:
table valued functions == indices
An index is a table, organized differently
Query executor uses index to map: select galaxy, distance
from GetNearbyObjEQ(22,37)
Key → set (aka sequence of rows)
• Table valued function can do this map
Optimizer can use it.
select store, holiday, sum(sales)
• +extras: cost function, cardinality,… from Sales join
HolidayDates(2004) T
on Sales.day = T.day
group by store, holiday
• BIG DEAL:
Approximate answers: Rank and Support
What’s new here?
Question
Dataset
• DBMS have tight-integration with
language classes (Java, C#, VB,.. )
• The DB is a class
• You can add classes to DB.
• Adding indices is “easy”
If you have a new idea.
• Now have solid Queue systems
Adding workflow is “easy”
If you have a new idea.
• This is a vehicle for publishing data
on the Web.
Tables
or Text
or cube
Or…..
Cubes
• Data cubes now standard
• MDX is very powerful
(Multi-Dimensional eXpressions)
SELECT <axis_spec>
FROM <cube_spec>
WHERE <slicer_spec>
• Dimension, Measure, Operator
concepts highly evolved beyond
snowflake schema
• Cube stores cohabit with row stores
ROLAP + MOLAP + (x xOLAP)
(relational +multidimensional online analytic processing)
• Very sophisticated algorithms
• A big part of the ecosystem
RED
WHITE
BLUE
Semi-Structured Data
• “Everyone starts with the same schema:
<stuff/>.”
Then they refine it.” J. Widom
• We are a “strong schema” community
• That has pros-and-cons.
• Files <stuff/> and XML <<foo/> <bar/>>
are here to stay. Get over it!
• File directories are becoming databases;
– Pivot on any attribute
– Folders are standing queries.
– Freetext+schema search (better precision/recall)
• XSD (xml schema) and xQuery are transitional;
But we have to do them to get to the real answer.
• Cohabit with row-stores.
• Challenge: figure out what comes after XSD+xQuery
Data Mining
and Machine Learning
• Tasks: classification, association, prediction
• Tools: Decision trees, Bayes, A Priori,
clustering, regression, Neural net,…
• now unified with DBs
– Create table T (x,y,z,u,v,w)
Learn “x,y,z” from “u,v,w” using <algorithm>
– Train T with data.
– Then can ask:
• Probability x,y,z,u,v,w
• What are the u,v,w probabilities given x,y,z
– Example: Learn height from age.
• Anyone with a data mining algorithm has
full access to the DBMS infrastructure.
• Challenge: Better learning algorithms.
DM – DB Synergy
Create the model:
CREATE MINING MODEL HeightFromAgeSex
( ID long key,
Gender text discrete,
Age long continuous,
Height long continuous PREDICT)
USING Decision_Trees
Train a data mining model:
INSERT INTO Height
SELECT ID, Gender, Age, Height
FROM People
learn height
from Gender + Age
DB verbs to
drive Modeler
Predict height from model:
SELECT height,
PredictProbability(height)
FROM Height PREDICTION JOIN New
ON New.Gender = Height.Gender
AND New.Age
= Height.Age
Probabilistic
Reasoning
Notification,
Stream Processing, and
Sensor Processing
• Traditionally:
Query billions of facts
• Streams:
millions of queries one new fact
Q?
– New protein compare to all DNA
– Change in price or time
A!
facts
• Implications
– New aggregation operators (extension)
– New programming style
– Streams in products:
• Queries represented as records
• New query optimizations.
fact, fact, fact…
Q Q
Q QQ
Q Q
• Sensor networks
– push queries out to sensors.
– Simpler programming model
– Optimizes power & bandwidth
Notification
Restatement: DB Systems evolved to be
containers for information services
develop, deploy, and execution environment
records
os
sets
records
os
utilities
– Everything is a class
– Database is a complex object
– Core object is DataSet
sets
utilities
• DBMS is an ecosystem
Key structuring strategy:
• The architecture lets you
add your new ideas.
DataSet
Language + DB Integration
(the Microsoft contribution)
• LINQ are a BIG deal (SQL and XML)
http://msdn.microsoft.com/data
• Entity Sets are next step in Data Sets
ADO.NET V3 automates entities
Data access today
Explicit DB
connections
command text
Connection
Rows
void EmpsByDate(DateTime date) {
using( SqlConnection con = new SqlConnection(
Settings.Default.AdventureWorksSQL)) {
con.Open();
Opaque
Command
DataReader
SqlCommand cmd = con.CreateCommand();
cmd.CommandText = @"
SELECT SalesPersonID, FirstName, HireDate
Relational Engine
Customer SalesPerson
FROM SalesPerson sp
INNER JOIN Employee e ON
sp.SalesPersonID = e.EmployeeID
INNER JOIN Contact c ON
e.EmployeeID = c.ContactID
Entities ≠ Rows
WHERE e.HireDate < @date";
cmd.Parameters.AddWithValue("@date", date);
DbDataReader r = cmd.ExecuteReader();
Untyped resultsets
while(r.Read()) {
Console.WriteLine(
"{0:d}:\t{1}", r["HireDate"], r["FirstName"]);
}
}
void EmpsByDate(DateTime date) {
Auto-Gen
classes
No explicit
connections
Query<T>
Connection
Command
DataReader
Relational Engine
foreach (SalesPerson p in people) {
Console.WriteLine("{0:d}\t{1}",
p.HireDate, p.FirstName );
}
}
ObjectContext
MapDataReader
Rows
Strongly typed
commands
Order
MapCommand
using (AdventureWorksDB aw =
new AdventureWorksDB()) {
var people = from p in aw.SalesPeople
where p.HireDate < date
select p;
SalesData
MapConnection
Entities
public partial class AdventureWorksDB :
System.Data.Objects.ObjectContext {
public System.Data.Objects.
Query<SalesOrder> SalesOrders
{ … }
public System.Data.Objects.
Query<SalesPerson> SalesPeople
{ … }
}
Objects Domain
Objects
Data access tomorrow
Customer
Strongly typed
results
SalesPerson
Outline
• What I have been doing
• BIG Changes in DBs
• Q&A
15 minutes
15 minutes
20 minutes