Data Challenges I`m Struggling With
Download
Report
Transcript Data Challenges I`m Struggling With
Data Challenges I'm
Struggling With
Jim Gray, Microsoft Research
1. Sneakernet is probably the best way to moving WAN data at 1GBps
File transfer efforts are currently 550MBps via Internet2.
How to manage the multi-petybyte file repository we are about to generate.
2. The TerraServer has evolved from a mainframe to a bunch of bricks.
The new design has been operating for a year and we are quite pleased with it.
But we face "how-do-you-manage a bunch?" and what the best geoplex strategy?.
3. The SkyServer website is built using database technology and web services.
Now moving the web services inside the database.
Others are working to design a scale-out version of the server.
There are several interesting data challenges in these changes.
4. Using relational tuples to represent spatial volumes as constraints.
Point-in-polygon and polygon-overlap queries can then be quickly evaluated.
I will briefly describe this idea.
How Do You Move A Terabyte?
Context
Speed
Rent
$/TB
$/Mbps
Mbps $/month
Sent
Time/TB
Home phone 0.04
40
1,000
3,086
6 years
Home DSL
0.6
50
117
360
5 months
T1
1.5
1,200
800
2,469
2 months
T3
43
28,000
651
2,010
2 days
OC3
155
49,000
316
976
14 hours
OC 192
9600
1,920,000
200
617
14 minutes
100 Mpbs
100
1 day
Gbps
1000
2.2 hours
Source: TeraScale Sneakernet, Microsoft Technical Report May 2002, MSR-TR-2002-54
http://research.microsoft.com/research/pubs/view.aspx?tr_id=569
Moving Data Bricks
• WAN costs >> 100$/Mbps/month
>> 1$/GB
• Beowulf networking
10,000x cheaper than WAN
factors of 105 matter.
• The cheapest and fastest way
to move a Terabyte cross country
is sneakernet.
24 hours = 4 MB/s
50$ shipping vs 1,000$ wan cost.
Giga Byte Per Second File Mover
• CERN to Pasadena
– Windows TCP/IP stack improvements
– Opteron demo
– Disk-to-Disk at 550MBps now (~2 TB/Hour)
• What we learned:
– Linux tcp stack is good/better at high perf
we are catching up.
– NTFS is better than various Linux FS
– Near the PCI-X limit
tcp limit
– Good way to engage the community.
CERN-Caltech Trasfer Speeds
Newisys->Newisys
PCI -X limit
MBps
• GOAL: 1GBps disk-to-disk.
1000
900
800
700
600
500
400
300
200
100
0
Mar-04
File Transfer MBps
1 Stream tcp MBps
May-04
Jun-04
Aug-04
Sep-04
But then what?
Managing Petabytes
•
•
•
•
•
•
CERN files are 30MB
They produce 1 B files/year.
How name them?
How manage them?
Depends on workload: how use them.
It’s a DB problem.
Data Challenges I'm
Struggling With
Jim Gray, Microsoft Research
1. Sneakernet is probably the best way to moving WAN data at 1GBps
File transfer efforts are currently 550MBps via Internet2.
How to manage the multi-petybyte file repository we are about to generate.
2. The TerraServer has evolved from a mainframe to a bunch of bricks.
The new design has been operating for a year and we are quite pleased with it.
But we face "how-do-you-manage a bunch?" and what the best geoplex strategy?.
3. The SkyServer website is built using database technology and web services.
Now moving the web services inside the database.
Others are working to design a scale-out version of the server.
There are several interesting data challenges in these changes.
4. Using relational tuples to represent spatial volumes as constraints.
Point-in-polygon and polygon-overlap queries can then be quickly evaluated.
I will briefly describe this idea.
TerraServer – What’s new
• Web Service and Web Server
• New ~1 ft2/pixel full color image
of 120 urban areas
• Storage Bricks
–
–
–
–
–
–
–
Commodity servers”
4 TB raw / 2 TB Raid1 SATA storage
Dual 2 GHz + 4GB RAM
3 Bricks = TerraServer data
Data partitioned
Moving to Yukon
Working on low TCO
auto-manage
• Low Cost Availability Pair & Spare
–
–
–
–
RAID1 Mirroring
Mirrored Bunches (Yukon log ship?)
Spare Brick
Web Application
• Load balances mirrors
• Uses surviving database on failure
KVM / IP
TerraServer Challenges
• Best Geoplex strategy?
• Moving Web Services into the DB?
• Managing bunches (lower TCO).
Data Challenges I'm
Struggling With
Jim Gray, Microsoft Research
1. Sneakernet is probably the best way to moving WAN data at 1GBps
File transfer efforts are currently 550MBps via Internet2.
How to manage the multi-petybyte file repository we are about to generate.
2. The TerraServer has evolved from a mainframe to a bunch of bricks.
The new design has been operating for a year and we are quite pleased with it.
But we face "how-do-you-manage a bunch?" and what the best geoplex strategy?.
3. The SkyServer website is built using database technology and web services.
Now moving the web services inside the database.
Others are working to design a scale-out version of the server.
There are several interesting data challenges in these changes.
4. Using relational tuples to represent spatial volumes as constraints.
Point-in-polygon and polygon-overlap queries can then be quickly evaluated.
I will briefly describe this idea.
Skyserver
• Personal DB
Quick Overview (Services)
• SkyServer (skyserver.sdss.org)
– Web site delivers Sloan Digital Sky Survey data
– Also has education
– 1,000x less popular than Terraserver,
but HUGE for a science website.
• A Batch Job System with Personal DBs
– Lets users run jobs http://casjobs.sdss.org/CasJobs/
– Parameters & Answers to & from Personal DB
– Simple batch job scheduler.
• Web Services: http://www.voservices.org/
–
–
–
–
Photographic objects
Spectrographic objects
Transformation functions
7 out of the 8 are .NET.
Quick Overview (Integration)
• SkyQuery (http://skyquery.net/)
Distributed Query portal using web services
• Archives: Pasadena, Chicago, Baltimore, Cambridge (England)
• Has grown from 4 to 20 archives,
½ day to add new one (adapt DB and register with Portal)
becoming international standard
• Each SkyNode publishes
– Schema Web Service
– Database Web Service
• Portal
– Plans Query (2 phase)
– Integrates answers
Image
– Is itself a web service
Cutout
SDSS
2MASS
FIRST
SkyQuery
Portal
INT
Challenges
• Parallel data search (data pump).
How to partition?
How manage load
• Moving web services to DB
What is the right approach?
• Move objects into DB
Spatial access methods
Data analysis in the DB.
Data Challenges I'm
Struggling With
Jim Gray, Microsoft Research
1. Sneakernet is probably the best way to moving WAN data at 1GBps
File transfer efforts are currently 550MBps via Internet2.
How to manage the multi-petybyte file repository we are about to generate.
2. The TerraServer has evolved from a mainframe to a bunch of bricks.
The new design has been operating for a year and we are quite pleased with it.
But we face "how-do-you-manage a bunch?" and what the best geoplex strategy?.
3. The SkyServer website is built using database technology and web services.
Now moving the web services inside the database.
Others are working to design a scale-out version of the server.
There are several interesting data challenges in these changes.
4. Using relational tuples to represent spatial volumes as constraints.
Point-in-polygon and polygon-overlap queries can then be quickly evaluated.
I will briefly describe this idea.
A Detail: 3 Ways We Do Spatial?
• Hierarchical mesh (extension to SQL)
– Uses table valued stored procedures
– Acts as a new “spatial access method”
– Porting to Yukon CLR for a 10x speedup.
• Zones: fits SQL like a glove
– Amazingly simple, amazingly good.
• Constraints: a really novel idea
– Lets us do algebra on regions.
• Paper:There Goes the Neighborhood: Relational Algebra for Spatial Data Search
• Idea in backup slides.
Equations Define Subspaces
y
• For (x,y) above the line
ax+by > c
• Reverse the space by
-ax + -by > -c
• Intersect a 3 volumes:
a1x + b1y > c1
a2x + b2y > c2
a3x + b3y > c3
x=c/a
x
y=c/b
y
x
Domain is Union of Convex Hulls
Not a
convex hull
+
• Simple volumes are
unions of convex hulls.
• Higher order curves also
work
• Complex volumes have
holes and their holes
have holes. (that is
harder).
Now in Relational Terms
create table HalfSpace (
domainID
int not null
-- domain name
foreign key references Domain(domainID),
convexID int not null,
-- grouping a set of ½ spaces
halfSpaceID int identity(), -- a particular ½ space
x
float not null,
-- the (a,b,..) parameters
y
float not null,
-- defining the ½ space
z
float not null,
c
float not null,
-- the constant (“c” above)
primary key (domainID, convexID, halfSpaceID)
(x,y,z) inside a convex if it is inside all
lines of the convex
(x,y,z) inside a convex if it is NOT OUTSIDE ANY line of the convex
select convexID
from HalfSpace
where @x * x + @y * y + @x * z <
group by all convexID
having count(*) = 0
-- return the convex hulls
-- from the constraints
l -- point outside the line?
-- consider all the lines of a convexID
-- count outside == 0
The Algebra is Simple (Boolean)
@domainID = spDomainNew (@type varchar(16), @comment varchar(8000))
@convexID = spDomainNewConvex (@domainID int)
@halfSpaceID = spDomainNewConvexConstraint (@domainID int, @convexID int,
@x float, @y float, @z float, @l float)
@returnCode = spDomainDrop(@domainID)
select * from fDomainsContainPoint(@x float, @y float, @z float)
Once constructed they can be manipulated with the Boolean operations.
@domainID = spDomainOr (@domainID1 int, @domainID2 int,
@type varchar(16), @comment varchar(8000))
@domainID = spDomainAnd (@domainID1 int, @domainID2 int,
@type varchar(16), @comment varchar(8000))
@domainID = spDomainNot (@domainID1 int,
@type varchar(16), @comment varchar(8000))
What! No Bounding Box?
• Bounding box limits search.
A subset of the convex hulls.
• If query runs at 3M halfspace/sec then no
need for bounding box,
unless you have more than 10,000 lines.
• But, if you have a lot of half-spaces
then bounding box is good.
HTM Approach
• Table-valued function
find points near a point
– Select * from fGetNearbyEq(ra,dec,r)
• Use Hierarchical Triangular Mesh www.sdss.jhu.edu/htm/
– Space filling curve, bounding triangles…
– Standard approach
• 13 ms/call… So 70 objects/second.
• Too slow, so precompute neighbors:
Materialized view.
• At 70 objects/sec
it takes 6 months to compute a billion objects.
Zone Approach
• Divide space into zones
• Key points by Zone, offset
(on the sphere this need wrap-around margin.)
• Point search
look in a few zones
at a limited offset: ra ± r
a bounding box that has
1-π/4 false positives
• All inside the relational engine
• Avoids “impedance mismatch”
r
• Can “batch” all-all comparisons
• 33x faster and parallel
√(r +(ra-zoneMax) )
cos(radians(zoneMax))
6 days, not 6 months!
2
ra-zoneMax
x
2
zoneMax
Ra ± x
In SQL
select o1.objID
from zone o1
where o1.zoneID between
floor((@dec-@r)/@zoneHeight) and
floor((@dec+@r)/@zoneHeight)
and o1.ra between @ra - @r and @ra + @r
and o1.dec between @dec-@r and @dec+@r
-- find objects
-- in the zoned table
-- where zone #
-- overlaps the circle
-- quick filter on ra
-- quick filter on dec
Bounding
box
and ( (sqrt( power(o1.cx-@cx,2)+power(o1.cy-@cy,2)+power(o1.cz-@cz,2))))
< @r
-- careful filter on distance
Eliminates the
~ 21% = 1-π/4
False positives
Summary
• SQL is a set oriented language
• You can express constraints as rows
• Then You
– Can evaluate LOTS of predicates per second
– Can do set algebra on the predicates.
• Benefits from SQL parallelism
• SQL == Prolog?