Using SQL Server For Large Databases and Microsoft Grid Computing

Download Report

Transcript Using SQL Server For Large Databases and Microsoft Grid Computing

Microsoft
Large Databases
and
Grid Computing
Jim Gray
Microsoft Research
[email protected]
http://research.Microsoft.com/~gray
21 May 2003
About me
• in Microsoft research (located in San Francisco)
• A database researcher
– IBM, Tandem, DEC, Microsoft
• Work on Scalable Systems
– Building supercomputers
from commodity components.
• Do academic/government things too
– PITAC, GriPhyn TAB, NSF/CISE,
Library of Congress, …
• For the last 4 years,
been working with the astronomy community
to build the World Wide Telescope.
Agenda
• TerraServer
– What it is
– What we learned
– What we are doing now.
• SkyServer / WWT
– What it is
– What we learned
– What we are doing now
• Grid Computing
– General comments
– Build a web service
TerraServer
TerraService.net
• A photo of the United States
–
–
–
–
–
1 meter resolution (photographic/topographic)
USGS data
Some demographic data (BestPlaces.net)
Home sales data
Linked to Encarta Encyclopedia
• 15 TB raw, 6 TB cooked (grows 10GB/w)
• Point, Pan, zoom interface
• Among top 1,000 websites
– 40k visitors/day
– 4M queries/day
– 3 B page views (in 5 years)
• All in an SQL database
TerraServer Statistics
June ‘98
Unique Users
Page Views
Image Tiles
Db Queries
Bytes Xfered
Jan ‘99
Jan ‘00
Daily
Peak
Average
Day
40,011
277,292
1,266,838 2,401,209
3,735,789 10,475,674
4,484,089 12,388,104
70 gb
163 gb
May ‘00
June 1998 Oct, 2002
63,656,904
2,015,539,605
5,943,641,024
7,134,186,170
108tb
Sept ’01
900 m Rows
755m
Rows
SQL 2000
298 m Rows
231 m Rows
SQL 2000
217 m Rows
173 m Rows
SQL 7.0
.75 TB Db
.8 TB Db
SQL 7.0
SQL 7.0
1.0 TB Db
1.0 TB Db
1 Server / Win NT 4.0 EE
SQL 2000
1 Server
.8 TB Db
SQL 7.0
1 Server
1.5 TB Db
2nd Server / Win 2k DataCenter
Dec ‘02
1.4 TB Db
SQL 2000
SQL 2000
2.0 TB Db
SQL 2000
2.0 TB Db
1.2 TB Db
SQL 2000
1.0 TB Db
SQL 2000
2.0 TB Db
4 Node / Win2k Datacenter
Failover Cluster
TerraServer Cluster
8 Compaq DL360 “Photon” Web Servers
One SQL database per rack
Each rack contains 4.5 TB
1 rack not in picture
18.0 TB total
Meta Data
Stored on 101 GB
“Fast, Small Disks”
(18 x 18.2 GB)
Imagery Data
Stored on 4 339 GB
“Slow, Big Disks”
(15 x 73.8 GB)
Added 90 72.8 GB
Disks in Feb 2001
to create 18 TB SAN
Fiber SAN
Switches
O O
E E
J J
SQL\Inst1
P Q
K
L
F
G
SQL\Inst2
R S
M N
H
4 Compaq ProLiant 8500 Db Servers
I
Cluster Configuration
TerraServer SAN
Compaq
StorageWorks
1
Cisco 12000
Internet Router
Internet
Database
Compaq DL360 (10)
Cluster
MA8000/HSG80
Controllers (3)
Summit 7i
Switch (2)
2
Compaq
SAN
Internet
switch
by Brocade
Communications
ADIC
LTO
Tape
Library
Compaq DL360 (6)
(Windows 2000 Web Servers)
TerraServer.microsoft.com
3
Compaq
ProLiant 8500
(4)
Extreme
Networks
Summit 48
Switch
Microsoft
Corporat
e LAN
TerraServer Becomes a Web Service
TerraServer.net -> TerraService.Net
• Web server is for people.
• Web Service is for programs
– The end of screen scraping
– No faking a URL:
pass real parameters.
– No parsing the answer:
data formatted into your
address space.
• Hundreds of users but a
specific example:
– US Department of Agriculture
And now.. 4 slides from the “customer”
who built a portal using TerraService
Data Gateway Functional
Overview
ITC - Fort Collins, Colorado
NCGC - Fort Worth, Texas
Customer Orders
Data
Terra
Service
Navigation
Service
Soil Data
Viewer
XML
XML
Billing
Services
Rimage
CD
Service
XML
ASP
Catalog
Service
FTP
Services
Ship
Service
<<Requests Products>>
Send order info
validate (dtd)
Insert into SQL
@@Identity / GUID to client
return est time
raise OrderMgr.event
Order
Placer
Package
Service
Product Catalog Updates
Order
Database
Called by anyone
rasies to stats svc'
XML Request for data
Logger
Selects from
Listen for OrderPlacer Raised
Event
Select sequenced Item
Output XML
rasie event : stats.delivery start
Geospatial
Data
Data
Services
B
u
f
f
e
r
A
r
e
a
W
i
t
h
i
n
F
i
e
l
d
s
Item
Broker
1
0
9
9
A
5
A
1
0
B
6
A
2
7
.
5
2
1
.
7
1
1
1
3
9
C
1
2
1
6
1
7
1
4
1
6
A
1
3
A
1
9
2
2
0
.
9
1
8
2
7
.
3
2
0 6
.
9
2
.
9
2
4
1
5
.
9
2
9
2
2
A
2
8
A
3
9
.
3
2
8
1
1
.
5
2
7
2
6
4
.
6
3
2
3
3
8
.
9
3
1
3
1
A
1
.
6
3
0
A
1
1
.
3
4
.
8
2
5
3
0
5
.
3
2
5
A
1
2
.
0
2
1
Acknowledges item ready for delivery
N
2
0
0
0
P
i
p
e
l
i
n
e
s
9
7
B
u
f
f
e
r
A
r
e
a
W
i
t
h
i
n
F
i
e
l
d
s
5
A
6
A
1
0
B
1
8
2
0
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3
0
A
3
1
0
2
0
0
0
3
1
A
4
0
0
0
F
e
e
t
3
2
F
i
e
l
d
s
W
i
t
h
i
n
B
u
f
f
e
r
L
a
n
d
u
n
i
t
s
U
S
D
A
1
:
1
5
8
4
0
N
R
C
S
Custom End Product
WebInterpretation
XML
Soil
Soil Report
Data Viewer
Map
Web Server - COM+ Applications
ArcIMS Connector
WebSDV
Connects to
ArcIMS;
communication is
done through
ArcIMS XML (AXL)
Retrieves and
processes Soils
Data from the
NASIS relational
Database
IMSNavigator
Generates maps
(JPGs) using
ArcIMS
Image Retriever
Retrieves
imagery from
the Microsoft
TerraServer
Database Server - ESRI Spatial Data Server
ESRI
Spatial Data Engine
Database Server - Microsoft SQL Server
Business
Rules
National Soils
Data
Geospatial
Data
Microsoft Terraserver
Terraserver
Brief tour of TerraService
• Show map service
• Show some methods
• See
TerraService.NET:
An Introduction to Web Services
Tom Barclay; Jim Gray; Eric Strand; Steve Ekblad; Jeffrey Richter,
MSR TR 2002-53, pp 13, June 2002
What We Learned
• You can build and manage a very popular website
with relatively little effort
(if you do it right and have Tom Barclay)
• Loading 20 TB takes a lot of energy
• And you get to do it many times -- automate
• Tape and tape software are problematic
• Triplex and snap-shot disks works
(we have never had to use it, but..)
• The internet gives you 2-9’s
Servers can run at 4 9’s easily, 5 9’s with effort.
What we are doing now.
•
•
•
•
•
•
•
Building with 3K$ 2TB bricks
4 bricks = 1 backend
Triplexing systems
Duplexing sites.
4*3*2 = 24k$ for Geoplex
Very simple operations model
See:
• “TeraScale SneakerNet:
Using Inexpensive Disks for Backup, Archiving, and Data Exchange,”
Jim Gray; Wyman Chong; Tom Barclay; Alex Szalay; Jan Vandenberg, pp. 1-8, May 2002
Agenda
• TerraServer
– What it is
– What we learned
– What we are doing now.
• SkyServer / WWT
– What it is
– What we learned
– What we are doing now
• Grid Computing
– General comments
– Build a web service
SkyServer
SkyServer.SDSS.org
• Like the TerraServer,
but looking the other way:
a picture of ¼ of the
universe
• Pixels +
Data Mining
• Astronomers get about 400
attributes for each “object”
• Get Spectrograms
for 1% of the objects
Why Astronomy Data?
IRAS 25m
•It has no commercial value
–No privacy concerns
–Can freely share results with others
–Great for experimenting with algorithms
2MASS 2m
•It is real and well documented
–High-dimensional data (with confidence intervals)
–Spatial data
–Temporal data
•Many different instruments from
many different places and
many different times
•Federation is a goal
•The questions are interesting
DSS Optical
IRAS 100m
WENSS 92cm
NVSS 20cm
–How did the universe form?
•There is a lot of it (petabytes)
ROSAT ~keV
GB 6cm
Demo of SkyServer
•
•
•
•
•
Shows standard web server
Pixel/image data
Point and click
Explore one object
Explore sets of objects (data mining)
Virtual Observatory
http://www.astro.caltech.edu/nvoconf/
http://www.voforum.org/
Premise: Most data is (or could be online)
So, the Internet is the world’s best telescope:
–
–
–
–
It has data on every part of the sky
In every measured spectral band: optical, x-ray, radio..
As deep as the best instruments (2 years ago).
It is up when you are up.
The “seeing” is always great
(no working at night, no clouds no moons no..).
– It’s a smart telescope:
links objects and data to literature on them.
Time and Spectral Dimensions
The Multiwavelength Crab Nebulae
Crab star
1053 AD
X-ray,
optical,
infrared, and
radio
views of the nearby
Crab Nebula, which is
now in a state of
chaotic expansion after
a supernova explosion
first sighted in 1054
A.D. by Chinese
Astronomers.
Slide courtesy of Robert Brunner @ CalTech.
Data Federations of Web Services
• Massive datasets live near their owners:
–
–
–
–
Near the instrument’s software pipeline
Near the applications
Near data knowledge and curation
Super Computer centers become Super Data Centers
• Each Archive publishes a web service
– Schema: documents the data
– Methods on objects (queries)
• Scientists get “personalized” extracts
• Uniform access to multiple ArchivesFederation
– A common global schema
Grid and Web Services Synergy
• I believe the Grid will be many web services
share data (computrons are free)
• IETF standards Provide
– Naming
– Authorization / Security / Privacy
– Distributed Objects
Discovery, Definition, Invocation, Object Model
– Higher level services: workflow, transactions, DB,..
• Synergy: commercial Internet & Grid tools
Web Services: The Key?
• Web SERVER:
– Given a url + parameters
– Returns a web page (often dynamic)
Your
program
Web
Server
• Web SERVICE:
– Given a XML document (soap msg)
– Returns an XML document
– Tools make this look like an RPC.
• F(x,y,z) returns (u, v, w)
– Distributed objects for the web.
– + naming, discovery, security,..
• Internet-scale
distributed computing
Your
program
Data
In your
address
space
Web
Service
SkyQuery: a prototype
• Defining Astronomy Objects and Methods.
• Federated 3 Web Services (fermilab/sdss, jhu/first, Cal Tech/dposs)
multi-survey cross-match
Distributed query optimization (T. Malik, T. Budavari, Alex Szalay @ JHU)
http://SkyQuery.net/
• My first web service (cutout + annotated SDSS images) online
– http://skyservice.pha.jhu.edu/devel/ImgCutout/chart.asp
• WWT is a great Web Services (.Net) application
– Federating heterogeneous data sources.
– Cooperating organizations
– An Information At Your Fingertips challenge.
Demo of Image Cutout Service
•
•
•
•
Shows image cutout
Show project and debugging project
Show hello World
Show “theAnswer” method
SkyQuery (http://skyquery.net/)
• Distributed Query tool using a set of services
• Feasibility study, built in 6 weeks from scratch
– Tanu Malik (JHU CS grad student)
– Tamas Budavari (JHU astro postdoc)
• Implemented in C# and .NET
• Allows queries like:
SELECT o.objId, o.r, o.type, t.objId
FROM SDSS:PhotoPrimary o,
TWOMASS:PhotoPrimary t
WHERE XMATCH(o,t)<3.5
AND AREA(181.3,-0.76,6.5)
AND o.type=3 and (o.I - t.m_j)>2
SkyNode Basic Web Services
• Metadata information about resources
– Waveband
– Sky coverage
– Translation of names to universal dictionary (UCD)
• Simple search patterns on the resources
– Cone Search
– Image mosaic
– Unit conversions
• Simple filtering, counting, histogramming
• On-the-fly recalibrations
Portals: Higher Level Services
• Built on Atomic Services
• Perform more complex tasks
• Examples
–
–
–
–
–
Automated resource discovery
Cross-identifications
Photometric redshifts
Outlier detections
Visualization facilities
• Goal:
– Build custom portals in days from existing building blocks
(like today in IRAF or IDL)
Architecture
Image cutout
SkyNode
First
Web Page
SkyQuery
SkyNode
2Mass
SkyNode
SDSS
Summary So Far
•
•
•
•
•
Some real web services deployed today
Easy to build & deploy
Services publish data, Portals unify it
Tools really work!
I’m using C# and foundation classes of
VisualStudio, a great! Tool
• A nice book explaining the ideas:
(.Net Framework Essentials, Thai, Lam isbn 0-596-00302-1)
Possible Relevance to You
• This web service stuff is REAL
• If you have a class,
It is a way to publish data:
Internet
Web
Intranet
Service
• It is a way to find data
data comes with schema
no more screen scraping/parsing
• Business model unclear
– Your ideas go here.
Your
program
Data
In your
address
space
What We Learned
• Web services really are a breakthrough.
• Data mining worked beautifully. See
Data Mining the SDSS SkyServer Database,”
J. Gray, D. Slutz, A. Szalay, A. Thakar, P. Kuntz, C. Stoughton, MSR TR 2002-1, pp1-40,
2002.
• You can operate a system in Chicago
from San Francisco –
Terminal Server is wonderful.
• The Internet gives you 2 9’s of availability
• TeraScale SneakerNet works well
What we are doing now.
•
•
•
•
Loading more data (next data release)
Preparing for the next generation
Building the WWT
Web Services for the Virtual Observatory,
Alexander S. Szalay, Tamás Budavária, Tanu Malika, Jim Gray, and Ani Thakar,
SPIE Astronomy Telescopes and Instruments, 22-28 August 2002, Waikoloa,
Hawaii,
• Petabyte Scale Data Mining: Dream or Reality?,
Alexander S. Szalay; Jim Gray; Jan vandenBerg, SIPE Astronomy Telescopes
and Instruments, 22-28 August 2002, Waikoloa, Hawaii,
• Online Scientific Data Curation, Publication, and Archiving
Jim Gray; Alexander S. Szalay; Ani R. Thakar; Christopher Stoughton; Jan
vandenBerg, SPIE Astronomy Telescopes and Instruments, 22-28 August 2002,
Waikoloa, Hawaii,
Agenda
• TerraServer
– What it is
– What we learned
– What we are doing now.
• SkyServer / WWT
– What it is
– What we learned
– What we are doing now
• Grid Computing
– General comments
– Build a web service
The Grid
•
•
•
•
Computation Grid: harvest Internet cpus.
Data Grid: Share files
Application Grid: Web services
Access Grid: teleconferencing
The Microsoft View
• Web Services will subsume the Grid
–The Grid will be data and services
not renting cycles
• OGSA: evolution of Globus Toolkit to Web
services concepts and technologies…
• Lots of encouragement from
Microsoft, IBM, Oracle, Sun
• GGF as forum for discussion
Engagement with Grid Community
• Goal: GXA as infrastructure for Grids
• Working with Globus & GGF
– Funding work at Argonne National Lab (Globus)
– Globus Toolkit 3, and CondorG on Windows
• http://www.globus.org/win-alpha/ (we sponsored this)
– OGSA for .NET (prototyping)
• http://www.globus.org/ogsa/
– Also OGSI.NET at U. VA is very interesting
• http://www.cs.virginia.edu/~gsw2c/ogsi.net.html
– GGF
• Active membershp
• HPC .net kit – see http://www.microsoft.com/HPC
– Part of .net server scale out development
– Includes MPI-CH 1.2.4, distributed job scheduler,…
– Thomas Sterling, Beowulf on Windows, MIT Press 2001
What’s Microsoft Doing
• Mostly .NET, W3C standards, web services, …
• I think SkyQuery is the best web service (grid
app) in GriPhyN today.
• My stuff is grid computing
• But…
• Globus (GT3), OGSA, and CondorG ported to
Windows (we sponsored it)
• We have a HPC toolkit: MPI-CH 1.2.4
• See
http://www.microsoft.com/windows2000/hpc/ for
many useful links
I Can Talk About Computing on
Demand But… Best to read
• Distributed Computing Economics,
Jim Gray, MSR-TR-2003-24, March 2003
• The slides that follow are based on that
paper.
Distributed Computing Economics
•
•
•
•
•
•
Why is Seti@Home a great idea
Why is Napster a great deal?
Why is the Computational Grid uneconomic
When does computing on demand work?
What is the “right” level of abstraction
Is the Access Grid the real killer app?
Computing is Free
• Computers cost 1k$ (if you shop right)
• So 1 cpu day == 1$
• If you pay the phone bill (and I do)
Internet bandwidth costs 50 … 500$/mbps/m
(not including routers and management).
• So 1GB costs 1$ to send and 1$ to receive
Why is Seti@Home a Good Deal?
• Send 300 KB for
• User computes for ½ day:
• ROI: 1500:1
costs 3e-4$
benefit .5e-1$
Why is Napster a Good Deal?
• Send 5 MB
costs 5e-3$
• ½ a penny per song
• Both sender and receiver can afford it.
• Same logic powers web sites (Yahoo!...):
– 1e-3$/page view advertising revenue
– 1e-5$/page view cost of serving web page
– 100:1 ROI
The Cost of Computing:
Computers are NOT free!
• Capital Cost of a TpcC
system is mostly
storage and
storage software (database)
• IBM 32 cpu, 512 GB ram
2,500 disks, 43 TB
(680,613 tpmC @ 11.13 $/tpmc available 11/08/03)
http://www.tpc.org/results/individual_results/IBM/IBMp690es_05092003.pdf
• A 7.5M$ super-computer
• Total Data Center Cost:
40% capital &facilities
60% staff
(includes app development)
TpcC Cost Components DB2/AIX
http://www.tpc.o rg/results/individual_results/IB M /IB M p690es_05092003.pdf
software
10%
storage
61%
cpu/mem
29%
Computing Equivalents
1 $ buys
•
•
•
•
•
•
•
1 day of cpu time
4 GB ram for a day
1 GB of network bandwidth
1 GB of disk storage
10 M database accesses
10 TB of disk access (sequential)
10 TB of LAN bandwidth (bulk)
Some consequences
• Beowulf networking is
10,000x cheaper than WAN networking
factors of 105 matter.
• The cheapest and fastest way to move a
Terabyte cross country is sneakernet.
24 hours = 4 MB/s
50$ shipping vs 1,000$ wan cost.
• Sending 10PB CERN data via network
is silly:
buy disk bricks in Geneva,
fill them,
ship them – one way.
TeraScale SneakerNet: Using Inexpensive Disks for Backup, Archiving, and Data Exchange
Jim Gray; Wyman Chong; Tom Barclay; Alex Szalay; Jan vandenBerg
Microsoft Technical Report may 2002, MSR-TR-2002-54
http://research.microsoft.com/research/pubs/view.aspx?tr_id=569
How Do You Move A Terabyte?
Context
Speed
Rent
$/TB
$/Mbps
Mbps $/month
Sent
Time/TB
Home phone 0.04
40
1,000
3,086
6 years
Home DSL
0.6
70
117
360
5 months
T1
1.5
1,200
800
2,469
2 months
T3
43
28,000
651
2,010
2 days
OC3
155
49,000
316
976
14 hours
OC 192
9600
1,920,000
200
617
14 minutes
100 Mpbs
100
1 day
Gbps
1000
2.2 hours
Computational Grid Economics
• To the extent that computational grid is like
Seti@Home or ZetaNet or Folding@home
or… it is a great thing
• The extent that the computational grid is MPI
or data analysis, it fails on economic grounds:
move the programs to the data, not the data to
the programs.
• The Internet is NOT the cpu backplane.
• The USG should not hide this economic fact
from the academic/scientific research
community.
Computing on Demand
• Was called outsourcing / service bureaus in my
youth. CSC and IBM did it.
• Payroll is standard outsource.
• Now we have Hotmail, Salesforce.com,
Oracle.com,….
• Works for standard apps.
• Airlines outsource reservations.
Banks outsource ATMs.
• But Amazon, Amex, Wal-Mart, ...
Can’t outsource their core competence.
• So, COD works for commoditized services.
• It is not a new way of doing things: think payroll.
What’s the right abstraction level for
Internet Scale Distributed Computing?
•
•
•
•
Disk block?
File?
Database?
Application?
No too low.
No too low.
No too low.
Yes, of course.
– Blast search
– Google search
– Send/Get eMail
– Portals that federate astronomy archives
(http://skyQuery.Net/)
• Web Services (.NET, EJB, OGSA)
give this abstraction level.
Access Grid
•
•
•
•
•
Q: What comes after the telephone?
A: eMail?
A: Instant messaging?
Both seem retro technology: text & emotons.
Access Grid
could revolutionize human communication.
• But, it needs a new idea.
• Q: What comes after the telephone?
Distributed Computing Economics
•
•
•
•
•
•
Why is Seti@Home a great idea?
Why is Napster a great deal?
Why is the Computational Grid uneconomic
When does computing on demand work?
What is the “right” level of abstraction?
Is the Access Grid the real killer app?
Based on: Distributed Computing Economics,
Jim Gray, Microsoft Tech report, March 2003, MSR-TR-2003-24
http://research.microsoft.com/research/pubs/view.aspx?tr_id=655
Agenda
• TerraServer
– What it is
– What we learned
– What we are doing now.
• SkyServer / WWT
– What it is
– What we learned
– What we are doing now
• Grid Computing
– General comments
– Build a web service