Distributed Computing Economics - Research

Download Report

Transcript Distributed Computing Economics - Research

Distributed Computing Economics
Jim Gray
Microsoft Research [email protected]
Talk at SD Forum: http://www.sdforum.org/
18 Sept 2003, PARC Auditorium, Palo Alto, CA.
Slides at: http://research.microsoft.com/~gray/talks
1
Two (?) Talks
• Distributed Computing Economics
• Online Science (what I have been doing).
2
Distributed Computing Economics
•
•
•
•
•
•
Why is Seti@Home a great idea?
Why is Napster a great deal?
Why is the Computational Grid uneconomic?
When does computing on demand work?
What is the “right” level of abstraction?
Is the Access Grid the real killer app?
Based on: Distributed Computing Economics,
Jim Gray, Microsoft Tech report, March 2003, MSR-TR-2003-24
http://research.microsoft.com/research/pubs/view.aspx?tr_id=655
3
Computing is Free
• Computers cost 1k$ (if you shop right)
(yes, there are 1μ$ to 1M$ computers, but..)
• So 1 cpu day == 1$ (computers last 3 years)
• If you pay the phone bill
Internet bandwidth costs 50 … 500$/mbps/m
(not including routers and management).
• So 1GB costs 1$ to send and 1$ to receive
Caveat: All numbers rounded to nearest factor of 3.4
Why is Seti@Home a Good Deal?
• Send 300 KB
• User computes for ½ day:
• ROI: 1500:1
costs 3e-4$
benefit .5e-1$
5
Seti@Home
The worlds most powerful computer
• 61 TF is sum of top 4 of Top 500.
• 61 TF is 9x the number 2 system.
• 61 TF more than the sum of systems 2..10
Seti@Home
http://setiathome.ssl.berkeley.edu/totals.html
20 May 2003
Total
Last 24 Hours
Users
4,493,731
1,900
Results received
886 M
1,4 M
Total CPU time
1.5 M years
1,514 years
Floating Point
Operations
3 E+21 ops
3 zeta ops
5 E+18 FLOPS/day
61.3 TeraFLOPs
6
Why was Napster a Good Deal?
• Send 5 MB
costs 5e-3$
½ a penny per song
• Both sender and receiver can afford it.
• Same logic powers web sites (Yahoo!...):
– 1e-3$/page view advertising revenue
– 1e-5$/page view cost of serving web page
– 100:1 ROI
8
The Cost of Computing:
Computers are NOT free!
• IBM, HP, Dell make billions
• Capital Cost of a TpcC
system is mostly
storage and
storage software (database)
• IBM 32 cpu, 512 GB ram
2,500 disks, 43 TB
TpcC Cost Components DB2/AIX
http://www.tpc.o rg/results/individual_results/IB M /IB M p690es_05092003.pdf
software
10%
cpu/mem
29%
storage
61%
(680,613 tpmC @ 11.13 $/tpmc available 11/08/03)
http://www.tpc.org/results/individual_results/IBM/IBMp690es_05092003.pdf
• A 7.5M$ super-computer
• Total Data Center Cost:
40% capital & facilities
60% staff
(includes app development)
9
Computing Equivalents
1 $ buys
•
•
•
•
•
•
•
•
1 day of cpu time
4 GB (fast) ram for a day
1 GB of network bandwidth
1 GB of disk storage for 3 years
10 M database accesses
10 TB of disk access (sequential)
10 TB of LAN bandwidth (bulk)
10 KWhrs == 4 days of computer time
Depreciating over 3 years, and there are about 1k days in 3 years.
10
Some consequences
• Beowulf networking is
10,000x cheaper than WAN networking
factors of 105 matter.
• The cheapest and fastest way to move
Terabytes cross country is sneakernet.
24 hours = 4 MB/s
50$ shipping vs 1,000$ wan cost.
• Sending 10PB CERN data via network is silly:
buy disk bricks in Geneva,
fill them,
ship them.
TeraScale SneakerNet: Using Inexpensive Disks for Backup, Archiving, and Data Exchange
Jim Gray; Wyman Chong; Tom Barclay; Alex Szalay; Jan vandenBerg
Microsoft Technical Report may 2002, MSR-TR-2002-54
http://research.microsoft.com/research/pubs/view.aspx?tr_id=569
11
How Do You Move A Terabyte?
Context
Speed
Rent
$/TB
$/Mbps
Mbps $/month
Sent
Time/TB
Home phone 0.04
40
1,000
3,086
6 years
Home DSL
0.6
50
117
360
5 months
T1
1.5
1,200
800
2,469
2 months
T3
43
28,000
651
2,010
2 days
OC3
155
49,000
316
976
14 hours
OC 192
9600
1,920,000
200
617
14 minutes
100 Mpbs
100
1 day
Gbps
1000
2.2 hours
Source: TeraScale Sneakernet, Microsoft Research, Jim Gray et. all
12
Computational Grid Economics
• To the extent that computational grid is like
Seti@Home or ZetaNet or Folding@home
or… it is a great thing
• The extent that the computational grid is MPI
or data analysis, it fails on economic grounds:
move the programs to the data, not the data to
the programs.
• The Internet is NOT the cpu backplane.
• An alternate reality: Nearly free networking
– Telcos go bankrupt an price=cost=0
– Taxpayers pay your phone bill so price=0
and telcos BIG government subsidy
13
When to Export a Task
IF instruction density >
100,000 instructions/byte
AND remote computer is free (costs you nothing)
THEN ROI > 0
ELSE ROI < 0
14
Computing on Demand
• Was called outsourcing / service bureaus
in my youth. CSC and IBM did it.
• It is not a new way of doing things: think payroll.
Payroll is standard outsource.
• Now Hotmail, Salesforce.com,Oracle.com,….
• Works for standard apps.
• COD works for commoditized services.
• Airlines outsource reservations.
Banks outsource ATMs.
• But Amazon, Amex, Wal-Mart, eTrade, eBay...
Can’t outsource their core competence.
15
What do you Outsource?
•
•
•
•
•
Disk blocks?
Files ?
SQL ?
RPC ?
Application?
•
•
•
•
•
Ø
Xdrive
SkyServer
TerraServer
AOL,
Google,
Hotmail,
Yahoo!, ….
16
What’s the right abstraction level for
Internet Scale Distributed Computing?
•
•
•
•
Disk block?
File?
Database?
Application?
No too low.
No too low.
No too low.
Yes, of course.
– Blast search
– Google search
– Send/Get eMail
– Portals that federate astronomy archives
(http://skyQuery.Net/)
• Web Services (.NET, EJB, OGSA)
give this abstraction level.
17
Access Grid
•
•
•
•
•
Q: What comes after the telephone?
A: eMail?
A: Instant messaging?
Both seem retro: text & emotons.
Access Grid
could revolutionize human communication.
• But, it needs a new idea.
• Q: What comes after the telephone?
18
Distributed Computing Economics
•
•
•
•
•
•
Why is Seti@Home a great idea?
Why is Napster a great deal?
Why is the Computational Grid uneconomic
When does computing on demand work?
What is the “right” level of abstraction?
Is the Access Grid the real killer app?
Based on: Distributed Computing Economics,
Jim Gray, Microsoft Tech report, March 2003, MSR-TR-2003-24
http://research.microsoft.com/research/pubs/view.aspx?tr_id=655
19
Two (?) Talks
• Distributed Computing Economics
• Online Science (what I have been doing).
– The World Wide Telescope
– I have been looking for a distributed DB
for most of my career.
– I think I found one! (sort of).
20
The World Wide Telescope
• I have been looking for a distributed DB
for most of my career.
• I think I found one! (sort of).
21
The Evolution of Science
• Observational Science
– Scientist gathers data by direct observation
– Scientist analyzes Information
• Analytical Science
– Scientist builds analytical model
– Makes predictions.
• Computational Science
– Simulate analytical model
– Validate model and makes predictions
• Science - Informatics
Information Exploration Science
Information captured by instruments
Or Information generated by simulator
– Processed by software
– Placed in a database / files
– Scientist analyzes database / files
22
Computational Science Evolves
• Historically, Computational Science = simulation.
• New emphasis on informatics:
–
–
–
–
–
Capturing,
Organizing,
Summarizing,
Analyzing,
Visualizing
• Largely driven by
observational science, but
also needed by simulations.
• Too soon to say if
comp-X and X-info
will unify or compete.
BaBar, Stanford
P&E
Gene Sequencer
From
http://www.genome.uci.edu/
23
Space Telescope
Both comp-X and X-info
Generating Petabytes
• Science-Info
• Comp-Science
dealing with
generating an
Information avalanche
Information avalanche
bio-info,
comp-chem,
astro-info,
comp-physics,
text-info,
comp-bio,
comp-astro,
comp-linguistics,
comp-music,
comp-entertainment,
24
comp-warfare
Information Avalanche Stories
• Turbulence: 100 TB simulation
then mine the Information
• BaBar: Grows 1TB/day
2/3 simulation Information
1/3 observational Information
• CERN: LHC will generate 1GB/s
10 PB/y
• VLBA (NRAO) generates 1GB/s today
• NCBI: “only ½ TB” but doubling each year
very rich dataset.
25
• Pixar: 100 TB/Movie
Astro-Info
World Wide Telescope
http://www.astro.caltech.edu/nvoconf/
http://www.voforum.org/
• Premise: Most data is (or could be online)
• Internet is the world’s best telescope:
–
–
–
–
It has data on every part of the sky
In every measured spectral band: optical, x-ray, radio..
As deep as the best instruments (2 years ago).
It is up when you are up.
The “seeing” is always great
(no working at night, no clouds no moons no..).
– It’s a smart telescope:
links objects and data to
literature on them.
26
Why Astronomy Data?
IRAS 25m
•It has no commercial value
–No privacy concerns
–Can freely share results with others
–Great for experimenting with algorithms
2MASS 2m
•It is real and well documented
– High-dimensional data (with confidence intervals)
– Spatial data
– Temporal data
DSS Optical
•Many different instruments from
many different places and
many different times
•But, it’s the same universe
so comparisons make sense & are interesting.
•Federation is a goal
•There is a lot of it (petabytes)
•Great sandbox for data mining algorithms
IRAS 100m
WENSS 92cm
–Can share cross company
–University researchers
NVSS 20cm
•Great way to teach both
Astronomy and
Computational Science
27
ROSAT ~keV
GB 6cm
What X-info Needs from us (cs)
(not drawn to scale)
Miners
Scientists
Science Data
& Questions
Data Mining
Algorithms
Plumbers
Database
To store data
Execute
Queries
Question &
Answer
Visualization
Tools
28
Show Maria’s 5-minute PPT
SDSS Image Cutout slide show
by Maria A. Nieto-Santisteban of JHU
http://www.research.microsoft.com/~Gray/talks/FDIS_ImgCutoutPresentation.ppt
29
Data Access is hitting a wall
FTP and GREP are not adequate
•
•
•
•
You can GREP 1 MB in a second
You can GREP 1 GB in a minute
You can GREP 1 TB in 2 days
You can GREP 1 PB in 3 years.
•
•
•
•
You can FTP 1 MB in 1 sec
You can FTP 1 GB / min (= 1 $/GB)
… 2 days and 1K$
… 3 years and 1M$
• Oh!, and 1PB ~5,000 disks
• At some point you need
indices to limit search
parallel data search and analysis
• This is where databases can help
30
Next-Generation Data Analysis
• Looking for
– Needles in haystacks – the Higgs particle
– Haystacks: Dark matter, Dark energy
• Needles are easier than haystacks
• Global statistics have poor scaling
– Correlation functions are N2, likelihood techniques N3
• As data and processing grow at same rate,
we can only keep up with N logN
• A way out?
– Discard notion of optimal (data is fuzzy, answers are approximate)
– Don’t assume infinite computational resources or memory
• Requires combination of statistics & computer science
31
Analysis and Databases
• Statistical analysis deals with
–
–
–
–
–
–
–
–
Creating uniform samples
data filtering & censoring bad data
Assembling subsets
Estimating completeness
Counting and building histograms
Generating Monte-Carlo subsets
Likelihood calculations
Hypothesis testing
• Traditionally these are performed on files
• Most of these tasks are much better done inside a
database
close to the data.
• Move Mohamed to the mountain,
32
not the mountain to Mohamed.
Goal:
Easy Data Publication & Access
• Augment FTP with data query:
Return intelligent data subsets
• Make it easy to
– Publish: Record structured data
– Find:
• Find data anywhere in the network
• Get the subset you need
– Explore datasets interactively
• Realistic goal:
– Make it as easy as
publishing/reading web sites today.
33
Data Federations of Web Services
• Massive datasets live near their owners:
–
–
–
–
Near the instrument’s software pipeline
Near the applications
Near data knowledge and curation
Super Computer centers become Super Data Centers
• Each Archive publishes a web service
– Schema: documents the data
– Methods on objects (queries)
• Scientists get “personalized” extracts
• Uniform access to multiple ArchivesFederation
– A common global schema
34
Web Services: The Key?
• Web SERVER:
– Given a url + parameters
– Returns a web page (often dynamic)
Your
program
Web
Server
• Web SERVICE:
– Given a XML document (soap msg)
– Returns an XML document
– Tools make this look like an RPC.
• F(x,y,z) returns (u, v, w)
– Distributed objects for the web.
– + naming, discovery, security,..
• Internet-scale
distributed computing
Your
program
Data
In your
address
space
Web
Service
35
The Challenge
• This has failed several times before–
understand why.
• Develop
– Common data models (schemas),
– Common interfaces (class/method)
• Build useful prototypes (nodes and portals)
• Create a community
that uses the prototypes and
evolves the prototypes.
36
Grid and Web Services Synergy
• I believe the Grid will be many web services
• IETF standards Provide
– Naming
– Authorization / Security / Privacy
– Distributed Objects
Discovery, Definition, Invocation, Object Model
– Higher level services: workflow, transactions, DB,..
• Synergy: commercial Internet & Grid tools
37
Some Interesting Things
We are Doing in SDSS
(what’s new)
• SkyServer is “done.”
Now it is 99% perspiration to
load 25 TB (many times)
and manage it.
• I’m using it as a research vehicle to
explore new DB ideas.
• Others are cloning it for other surveys.
Some doing DB2 & Oracle variants.
38
SkyServer Overview (10 min)
• 10 minute SkyServer tour
–
–
–
–
–
–
Pixel space
http://skyserver.sdss.org/en/
Record space: http://skyserver.sdss.org/en/tools/explore/obj.asp?id=2255030989160697
select top 10 * from weblog..weblog
Doc space:
Ned
where yy = 2003 and mm=7 and dd =25
order by seq desc
Set space:
Web & Query Logs:
select top 10 * from weblog..sqlLog
order by theTime Desc
Dr1 WebService
http://skyserver.pha.jhu.edu/dr1/en/tools/chart/navi.asp
• You can download (thanks to Cathan Cook )
– Data + Database code:
– Website:
• Data Mining the SDSS SkyServer Database
MSR-TR-2002-01
http://research.microsoft.com/~gray/SDSS/personal_skyserver.htm
39
Cutout Service (10 min)
A typical web service
•
•
•
•
•
Show it
Show WSDL
Show fixing a bug
Rush through code.
You can download it.
Maria A. Nieto-Santisteban did most of this (Alex and I started it)
http://research.microsoft.com/~gray/SDSS/personal_skyserver.htm
40
SkyQuery: http://skyquery.net/
• Distributed Query tool using a set of web services
• Four astronomy archives from
Pasadena, Chicago, Baltimore, Cambridge (England).
• Feasibility study, built in 6 weeks
– Tanu Malik (JHU CS grad student)
– Tamas Budavari (JHU astro postdoc)
– With help from Szalay, Thakar, Gray
• Implemented in C# and .NET
• Allows queries like:
SELECT o.objId, o.r, o.type, t.objId
FROM SDSS:PhotoPrimary o,
TWOMASS:PhotoPrimary t
WHERE XMATCH(o,t)<3.5
AND AREA(181.3,-0.76,6.5)
AND o.type=3 and (o.I - t.m_j)>2
41
SkyQuery Structure
• Each SkyNode publishes
– Schema Web Service
– Database Web Service
• Portal is
– Plans Query (2 phase)
– Integrates answers
– Is itself a web service
Image
Cutout
SDSS
INT
SkyQuery
Portal
FIRST
2MASS
42
SkyQuery and The Grid
• This is a DataGrid
• SkyQuery is being used
• It works today
as a vehicle to explore
• It is challenging for OGSA-DAIS OGSA and DAIS
(hello world in OGSI-DAI is complex)
requirements.
Image
Cutout
SDSS
INT
SkyQuery
Portal
FIRST
2MASS
43
MyDB added to SkyQuery
• Moves analysis to the data
• Users can cooperate
(share MyDB)
• Still exploring this
• Let users add personal DB
1GB for now.
• Use it as a workbook.
• Online and batch queries.
INT
Image
Cutout
SDSS
SkyQuery
Portal
MyDB
FIRST
2MASS
44
Some Database Topics
• Sparse tables:
column vs row store
tag and index tables
pivot
• Maplist (cross apply)
• Dealing with bad statistics:
• Object Relational has arrived.
45
Column Store Pyramid
• Users see fat base tables
(universal relation)
• Define popular columns index
tag table 10% ~ 100 columns
• Make many skinny indices
1% ~ 10 columns
• Query optimizer picks right plan
• Automate definition & use
• Fast read, slow insert/update
• Data warehouse
BASE
TAG
Obese query
Fat
quer
y
Typical Semi-join
• Note: prior to Yukon,
index had 16 column limit.
A bane of my existence.
INDICIES
46
Simpl
Examples
create table base (
id bigint, f1 int primary key,
f2 int, …,f1000 int)
create index tag on base (id)
include (f1, …, f100)
BASE
TAG
Obese query
Typical Semi-join
Fat
quer
y
create index skinny on
base(f2,…f17)
INDICIES
47
Simpl
A Semi-Join Example
create table fat(a int primary key, b int, c int, fat char (988))
declare @i int, @j int; set @i = 0
again:
insert fat values(@i, cast(100*rand() as int), cast (100*rand() as int), ' ')
set @i = @i + 1; if (@i < 1000000) goto again
create index ab on fat(a,b)
create index ac on fat(a,c)
b=c
137 K IO
31 sec
dbcc dropcleanbuffers with no_infomsgs
select count(*) from fat with(index (0)) where c = b
-- Table 'fat'. Scan 3, reads 137,230, CPU : 1.3 s, elapsed 31.1s.
1GB
dbcc dropcleanbuffers with no_infomsgs
select count(*) from fat where b=c
-- Table 'fat'. Scan 2, reads: 3,482 CPU 1.1 s,
ab
elapsed: 1.4 s.
ac
8MB 8MB
b=c
3.4K IO 1.4 sec
48
Moving From Rows to Columns
Pivot & UnPivot
What if the table is sparse?
LDAP has 7 mandatory
and 1,000 optional attributes
Store row, col, value
Features
object attribute value
●●●●
create table Features ( object varchar ,
attribute varchar,
value varchar,
primary key ( object,
attribute))
select *
from (features pivot value
on
attribute
in
(year, color) ) as T
where object = ‘4PNC450’
4PNC450
4PNC450
4PNC450
4PNC450
year
color
make
model
2000
white
Ford
Taurus
●●●●
T
Object year
4PNC450 2000
color
white
49
Maplist Meets SQL – cross apply
select p.*, q.*
from parent as p cross apply f(p.a, p.b, p.c) as q
where p.type = 1
• Your table-valued function F(a,b,c)
returns all objects related to a,b,c.
•
•
•
•
p1
f(p1)
p2
spatial neighbors,
sub-assemblies,
members of a group,
items in a folder,…
• Apply this function to each row
• Classic drill-down
use outer apply if f() may be null
f(p2)
pn
f(pn)
50
When SQL Optimizer Guesses Wrong,
Life is DREADFUL
• SQL is a non-procedural language.
• The compiler/optimizer picks the procedure
based on statistics.
• If the stats are wrong or missing….
Bad things happen.
Queries can run VERY slowly.
• Strategy 1: allow users to specify plan.
• Strategy 2: make the optimizer smarter
(and accept hints from the user.)
51
• A query selects some fields
of an index
and of huge table.
• Bookmark plan:
– look in index for a subset.
– Lookup subset in Fat table.
Index
An Example of the Problem
Huge table
• This is
– great if subset << table.
– terrible if subset ~ table.
• If statistics are wrong,
or if predicates not independent,
you get the wrong plan.
52
A Fix: Let user ask for stats
• Create Statistics on View(f1,..,fn)
• Then the optimizer has the right data
Picks the right plan.
Statistics on Views,
C. Galindo-Legaria, M. Josi, F. Waas, M. Wu, VLDB 2003,
• Q3:
Select count(*) from Galaxy
where r < 22 and r_extinction > 0.120
Bookmark: 34 M
random IO,
520 minutes
Create Statistics on Galaxy(objID )
Scan:
5 M sequential IO
18 minutes
• Ultimately this should be automated,
but for now,… it’s a step in the right direction.
53
Object Relational Has Arrived
• VMs are moving inside the DB
• Yukon includes Common Language Runtime
(Oracle & DB2 have similar mechanisms).
• So, C++, VB, C# and Java are
co-equal with TransactSQL.
code
• You can define classes and methods
SQL will store the instances
Access them via methods
code
• You can put your analysis code
INSIDE the database.
• Minimizes data movement.
You can’t move petabytes to the client
But we will soon have petabyte databases.
data
data
+code
54
Spatial Data Search
The Pre CLR design
Transact SQL
sp_HTM
(20 lines)
469 lines of
“glue”
looking like:
// Get Coordinates param datatype, and param length information of
if (srv_paraminfo(pSrvProc, 1, &bType1, &cbMaxLen1, &cbActualLen1, NULL, &fNull1) == FAIL)
ErrorExit("srv_paraminfo failed...");
// Is Coordinate param a character string
if (bType1 != SRVBIGVARCHAR && bType1 != SRVBIGCHAR &&
bType1 != SRVVARCHAR
&& bType1 != SRVCHAR)
ErrorExit("Coordinate param should be a string.");
The HTM code body
// Is Coordinate param non-null
if (fNull1 || cbActualLen1 < 1 || cbMaxLen1 <= cbActualLen1)
ErrorExit("Coordinate param is null.");
// Get pointer to Coordinate param
pzCoordinateSpec = (char *) srv_paramdata (pSrvProc, 1);
if (pzCoordinateSpec == NULL)
ErrorExit("Coordinate param is null.");
pzCoordinateSpec[cbActualLen1] = 0;
// Get OutputVector datatype, and param length information
if (srv_paraminfo(pSrvProc, 2, &bType2, &cbMaxLen2, &cbActualLen2, NULL, &fNull2) == FAIL)
ErrorExit("Failed to get type info on HTM Vector param...");
55
The “glue” CLR design
Discard 450 lines of UGLY code
C# SQL
sp_HTM
(50 lines)
Thanks!!! To Peter Kukol
(who wrote this)
using System;
using System.Data;
using System.Data.SqlServer;
using System.Data.SqlTypes;
using System.Runtime.InteropServices;
namespace HTM {
public class HTM_wrapper {
[DllImport("SQL_HTM.dll")] static extern unsafe void * xp_HTM_Cover_get (byte *str);
public static unsafe void HTM_cover_RS(string input) {
// convert the input from Unicode (array of 2 bytes) to an array of bytes (not shown)
byte * input; byte * output;
// invoke the HTM routine
output = (byte *)xp_HTM_Cover_get(input);
// Convert the array to a table
SqlResultSet outputTable = SqlContext.GetReturnResultSet();
if (output[0] == 'O') { // if Output is “OK”
uint
c = *(UInt32 *)(s + 4); // cast results as dataset
Int64 * r = ( Int64 *)(s + 8); // Int64 r[c-1,2]
for (int i = 0; i < c; ++i) {
SqlDataRecord newRecord = outputTable.CreateRecord();
newRecord.SetSqlInt64(0, r[0]);
newRecord.SetSqlInt64(1, r[1]); r++;r++;
outputTable.Insert(newRecord);
} } // return outputTable;
}
}
}
The HTM
code body
56
The Clean CLR design
Discard all glue code
return array cast as table
CREATE ASSEMBLY HTM_A
FROM '\\localhost\HTM\HTM.dll'
CREATE FUNCTION HTM_cover( @input NVARCHAR(100) )
RETURNS @t TABLE (
HTM_ID_START BIGINT NOT NULL PRIMARY KEY,
HTM_ID_END BIGINT NOT NULL
)
AS
EXTERNAL NAME HTM_A:HTM_NS.HTM_C::HTM_cover
using System;
using System.Data;
using System.Data.Sql;
using System.Data.SqlServer;
using System.Data.SqlTypes;
using System.Runtime.InteropServices;
namespace HTM_NS {
public class HTM_C {
public static Int64[,2] HTM_cover(string input) {
// invoke the HTM routine
return (Int64[,2]) xp_HTM_Cover(input); // the actual HTM C# or C++ or Java or VB code goes here.
Your/My
code
goes here
}
} }
57
Performance (Beta1)
On a 2.2 Ghz Xeon
• Call a Transact SQL function
33μs
• Call a C# function
50μs
• Table valued function
not good in β1
• Array (== table) valued function 200 μs
+ per row 27 μs
58
The Code
CREATE ASSEMBLY ReturnOneA
FROM '\\localhost\C:\ReturnOne.dll'
GO
CREATE FUNCTION ReturnOne_Int( @input INT) RETURNS INT
AS EXTERNAL NAME ReturnOneA:ReturnOneNS.ReturnOneC::ReturnOne_Int
GO
---------------------------------------------- time echo an integer
declare @i int, @j int, @cpu_seconds float, @null_loop float
declare @start datetime, @end datetime
set @j = 0
set @i = 10000
set @start = current_Timestamp
while(@i > 0)
begin
set @j = @j + 1
set @i = @i -1
end
set @end = current_Timestamp
set @null_loop = datediff(ms, @start,@end) / 10.0
set @i = 10000
set @start = current_Timestamp
while(@i > 0)
begin
select @j = dbo.ReturnOne_Int(@i)
set @j = @j + 1
set @i = @i -1
end
set @end = current_Timestamp
set @cpu_seconds = datediff(ms, @start,@end) / 10.0 - @null_loop
print 'average cpu time for 1,000 calls to ReturnOne_Int was ' +
str(@cpu_seconds,8,2)+ ' micro seconds'
Program in DB
in different
language (Tsql)
calling function
Function
written in C#
inside the DB
using System;
using System.Data;
using System.Data.SqlServer;
using System.Data.SqlTypes;
using System.Runtime.InteropServices;
namespace ReturnOneNS {
public class ReturnOneC {
public static int ReturnOne_Int(int input) {
return input;
}
59
}
}
What Is the Significance?
• No more inside/outside DB dichotomy.
• You can put your code near the data.
• Indeed, we are letting users put personal
databases near the data archive.
• This avoids moving large datasets.
• Just move questions and answers.
60
Meta-Message
• Trying to fit science data into databases
• When it does not fit, something is wrong.
• Look for solutions
– Many solutions come from OR extensions
– Some are fundamental engine changes
• More structure in DB
• Richer operator sets
• Better statistics
61
Two (?) Talks
• Distributed Computing Economics
• Online Science (what I have been doing).
62