Building Peta Byte Data Stores

Download Report

Transcript Building Peta Byte Data Stores

Building Peta Byte Data Stores
Jim Gray
Microsoft Research
Research.Microsoft.com/~Gray
The Asilomar Report on Database Research
Phil Bernstein, Michael Brodie, Stefano Ceri, David DeWitt, Mike Franklin, Hector Garcia-Molina, Jim Gray, Jerry Held, Joe
Hellerstein, H. V. Jagadish, Michael Lesk, Dave Maier, Jeff Naughton, Hamid Pirahesh, Mike Stonebraker, and Jeff Ullman
September 1998
… the field needs to radically broaden its research focus to
attack the issues of capturing, storing, analyzing, and
presenting the vast array of online data.
… -- broadening the definition of database management to
embrace all the content of the Web and other online data
stores, and rethinking our fundamental assumptions in light
of technology shifts.
… encouraging more speculative and long-range work,
moving conferences to a poster format, and publishing all
research literature on the Web.
http://research.microsoft.com/~gray/Asilomar_DB_98.html
So, how are we doing?
•
•
•
•
Capture, store, analyze, present terabytes?
Making web data accessible?
Publishing on the web (CoRR?)
Posters-Workshops vs Conferences-Journals?
Outline
• Technology:
– 1M$/PB: store everything online (twice!)
• End-to-end high-speed networks
– Gigabit to the desktop
So: You can store everything,
Anywhere in the world
Online everywhere
• Research driven by apps:
– TerraServer
– National Virtual Astronomy Observatory.
How Much Information
Is there?
• Soon everything can be
recorded and indexed
• Most data never be seen by humans
• Precious Resource:
Human attention
Auto-Summarization
Auto-Search
is key technology.
Everything
!
Recorded
All Books
Yotta
Zetta
Exa
MultiMedia
Peta
All LoC books
(words)
.Movi
e
A Photo
Tera
Giga
Mega
www.lesk.com/mlesk/ksg97/ksg.html
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
Kilo
Trends:
ops/s/$ Had Three Growth Phases
1890-1945
Mechanical
Relay
7-year doubling
1945-1985
Tube, transistor,..
2.3 year doubling
1985-2000
Microprocessor
1.0 year doubling
1.E+09
ops per second/$
doubles every
1.0 years
1.E+06
1.E+03
1.E+00
1.E-03
doubles every
7.5 years
doubles every
2.3 years
1.E-06
1880
1900
1920
1940
1960
1980
2000
Storage capacity
beating Moore’s law
Disk TB Shipped per Year
1E+7
ExaByte
1E+6
• 4 k$/TB today (raw disk)
1998 Disk Trend (Jim Porter)
http://www.disktrend.com/pdf/portrpkg.pdf.
1E+5
disk TB
growth:
112%/y
Moore's Law:
58.7%/y
1E+4
1E+3
1988
Moores law
Revenue
TB growth
Price decline
1991
1994
1997
58.70% /year
7.47%
112.30% (since 1993)
50.70% (since 1993)
2000
Cheap Storage and/or Balanced System
• Low cost storage
(2 x 3k$ servers) 6K$ TB
2x ( 800 Mhz, 256Mb + 8x80GB disks + 100MbE)
• Balanced server (5k$/.64 TB)
–
–
–
–
–
2x800Mhz (2k$)
512 MB
8 x 80 GB drives (2.4K$)
Gbps Ethernet + switch (500$/port)
10k$ TB, 20K$/RAIDED TB
2x800 Mhz
512 MB
Hot Swap Drives for Archive or
Data Interchange
• 35 MBps write
(so can write
N x 80 GB
in 40 minutes)
• 80 GB/overnite
= ~N x 3 MB/second
@ 19.95$/nite
13$
250$
The “Absurd” Disk
• 2.5 hr scan time
(poor sequential access)
• 1 access per second / 5 GB
(VERY cold data)
• It’s a tape!
100 MB/s
200 Kaps
1 TB
Disk vs Tape
Disk
Tape
– 80 GB
– 35 MBps
– 5 ms seek time
– 3 ms rotate latency
– 4$/GB for drive
3$/GB for ctlrs/cabinet
– 4 TB/rack
–
–
–
–
–
– 1 hour scan
– 1 week scan
40 GB
10 MBps
10 sec pick time
30-120 second seek time
2$/GB for media
8$/GB for drive+library
– 10 TB/rack
Guestimates
Cern: 200 TB
3480 tapes
2 col = 50GB
Rack = 1 TB
=12 drives
The price advantage of tape is narrowing, and
the performance advantage of disk is growing
At 10K$/TB, disk is competitive with nearline tape.
It’s Hard to Archive a Petabyte
It takes a LONG time to restore it.
• At 1GBps it takes 12 days!
• Store it in two (or more) places online (on disk?).
A geo-plex
• Scrub it continuously (look for errors)
• On failure,
– use other copy until failure repaired,
– refresh lost copy from safe copy.
• Can organize the two copies differently
(e.g.: one by time, one by space)
Next step in the Evolution
• Disks become supercomputers
– Controller will have 1bips, 1 GB ram, 1 GBps net
– And a disk arm.
• Disks will run full-blown app/web/db/os stack
• Distributed computing
• Processors migrate to transducers.
Terabyte (Petabyte) Processing
Requires Parallelism
parallelism: use many little devices in parallel
1,000 x parallel:
At 10 MB/s:
1.2 days to scan
100 seconds scan.
1 Terabyte
1 Terabyte
10 MB/s
Use
100 processors &
1,000 disks
Parallelism Must Be Automatic
• There are thousands of MPI programmers.
• There are hundreds-of-millions of people using
parallel database search.
• Parallel programming is HARD!
• Find design patterns and automate them.
• Data search/mining has parallel design patterns.
Gilder’s Law:
3x bandwidth/year for 25 more years
• Today:
– 10 Gbps per channel
– 4 channels per fiber: 40 Gbps
– 32 fibers/bundle = 1.2 Tbps/bundle
•
•
•
•
In lab 3 Tbps/fiber (400 x WDM)
In theory 25 Tbps per fiber
1 Tbps = USA 1996 WAN bisection bandwidth
Aggregate bandwidth doubles every 8 months!
1 fiber = 25 Tbps
Sense of scale
• How fat is your pipe?
• Fattest pipe on MS
campus is the WAN!
94 MBps Coast to Coast
300 MBps OC48 = G2
Or
memcpy()
90 MBps PCI
20MBps disk / ATM / OC3
Redmond/Seattle, WA
Information Sciences Institute
Microsoft
Qwest
University of Washington
Pacific Northwest Gigapop
New York
HSCC (high speed connectivity consortium)
DARPA
Arlington, VA
San Francisco,
CA
5626 km
10 hops
Outline
• Technology:
– 1M$/PB: store everything online (twice!)
• End-to-end high-speed networks
– Gigabit to the desktop
So: You can store everything,
Anywhere in the world
Online everywhere
• Research driven by apps:
– TerraServer
– National Virtual Astronomy Observatory.
Interesting Apps
• EOS/DIS
• TerraServer
• Sloan Digital Sky Survey
Kilo
Mega
Giga
Tera
Peta
Exa
103
106
109
1012
1015
1018
today, we are here
The Challenge -- EOS/DIS
• Antarctica is melting -- 77% of fresh water liberated
– sea level rises 70 meters
– Chico & Memphis are beach-front property
– New York, Washington, SF, LA, London, Paris
• Let’s study it! Mission to Planet Earth
• EOS: Earth Observing System (17B$ => 10B$)
– 50 instruments on 10 satellites 1999-2003
– Landsat (added later)
• EOS DIS: Data Information System:
– 3-5 MB/s raw, 30-50 MB/s processed.
– 4 TB/day,
– 15 PB by year 2007
The Process Flow
• Data arrives and is pre-processed.
– instrument data is
calibrated,
gridded
averaged
– Geophysical data is derived
• Users ask
for stored data
OR to analyze and combine data.
• Can make the pull-push split dynamically
Pull Processing
Other Data
Push Processing
Key Architecture Features
•
•
•
•
•
•
2+N data center design
Scaleable OR-DBMS
Emphasize Pull vs Push processing
Storage hierarchy
Data Pump
Just in time acquisition
2+N data center design
•
•
•
•
duplex the archive (for fault tolerance)
let anyone build an extract (the +N)
Partition data by time and by space (store 2 or 4 ways).
Each partition is a free-standing OR-DBBMS
(similar to Tandem, Teradata designs).
• Clients and Partitions interact
via standard protocols
– HTTP+XML,
Data Pump
• Some queries require reading ALL the data
(for reprocessing)
• Each Data Center scans ALL the data every 2 days.
– Data rate 10 PB/day = 10 TB/node/day = 120 MB/s
• Compute on demand small jobs
•
•
•
less than 100 M disk accesses
less than 100 TeraOps.
(less than 30 minute response time)
• For BIG JOBS scan entire 15PB database
• Queries (and extracts) “snoop” this data pump.
•
•
•
•
•
Just-in-time acquisition 30%
Hardware prices decline 20%-40%/year
So buy at last moment
Buy best product that day: commodity
Depreciate over 3 years so that facility is fresh.
(after 3 years, cost is 23% of original). 60% decline peaks at 10M$
10
10
10
10
5
EOS DIS Disk Storage Size and Cost
assume 40% price decline/year
Data Need TB
4
3
2
Storage Cost M$
10
1
1994
1996
1998
2000
2 PB
@
100M$
2002
2004
2006
2008
Problems
• Management (and HSM)
• Design and Meta-data
• Ingest
• Data discovery, search, and analysis
• Auto Parallelism
• reorg-reprocess
What this system taught me
• Traditional storage metrics
– KAPS: KB objects accessed per second
– $/GB: Storage cost
• New metrics:
– MAPS: megabyte objects accessed per second
– SCANS: Time to scan the archive
– Admin cost dominates (!!)
– Auto parallelism is essential.
Outline
• Technology:
– 1M$/PB: store everything online (twice!)
• End-to-end high-speed networks
– Gigabit to the desktop
So: You can store everything,
Anywhere in the world
Online everywhere
• Research driven by apps:
– TerraServer
– National Virtual Astronomy Observatory.
Microsoft TerraServer:
http://TerraServer.Microsoft.com/
• Build a multi-TB SQL Server database
• Data must be
–
–
–
–
1 TB
Unencumbered
Interesting to everyone everywhere
And not offensive to anyone anywhere
–
–
–
–
1.5 M place names from Encarta World Atlas
7 M Sq Km USGS doq (1 meter resolution)
10 M sq Km USGS topos (2m)
1 M Sq Km from Russian Space agency (2 m)
• Loaded
• On the web (world’s largest atlas)
• Sell images with commerce server.
Background
• Earth is 500 Tera-meters square
– USA is 10 tm2
• 100 TM2 land in 70ºN to 70ºS
• We have pictures of 9% of it
• Someday
– multi-spectral image
– of everywhere
– once a day / hour
– 7 tsm from USGS
– 1 tsm from Russian Space Agency
•
•
•
•
Compress 5:1 (JPEG) to 1.5 TB.
Slice into 10 KB chunks (200x200 pixels)
Store chunks in DB
Navigate with
– Encarta™ Atlas
• globe
• gazetteer
.2x.2 km2 tile
.4x.4 km2 image
.8x.8 km2 image
1.6x1.6 km2 image
TerraServer 4.0 Configuration
3 Active Database Servers
SQL\Inst1 - Topo & Relief Data
Compaq
Compaq
Compaq
Controller
Controller
Controller
E
L
S
Compaq
Compaq
DL360
DL360
DL360
DL360
DL360
DL360
DL360
DL360
SQL\Inst2 – Aerial Imagery
SQL\Inst3 – Aerial Imagery
Logical Volume Structure
One rack per database
All volumes triple mirrored (3x)
MetaData on 15k rpm 18.2 GB drives
Image Data on 10k rpm 72.8 GB drives
MetaData 101GB
Image1 339 GB
Image2 339 GB
Image3 339 GB
Image4 339 GB
Controller
F
G
Controller
H
I
Controller
Controller
M N
T U
Controller
Controller
O P
V U
Compaq 8500
SQL\Inst1
Compaq 8500
SQL\Inst2
Compaq 8500
Web
Servers
8 2-proc
“Photon”
DL360
SQL\Inst3
Compaq 8500
Passive Srvr
2 spare volumes allocated per cluster
6 Additional 339 GB volumes to be
added by year end (2 per Db Server)
File
Group
Admin
Gazetteer
Image
Meta
Search
Grand Total
Rows
(millions)
1
17
254
254
46
572
Total Size
(GB)
0 GB
5 GB
2,237 GB
70 GB
10 GB
2,322 GB
Data Size
(GB)
0.1 GB
1 GB
2,220 GB
53 GB
5 GB
2,280 GB
Index Size
(GB)
0 GB
3 GB
17 GB
17 GB
5 GB
42 GB
TerraServer 4.0 Schema
External
Group
Image
Source
Search
Job
Search
Dest
AltCountry
Country
Name
External
Link
SourceMeta
Scale
Job
Search
Job Log
AltState
State
Name
External
Geo
ImageMeta
Load
Job
JobQueue
AltPlace
Place
Name
Image
Search
Imagery
JobSystem
Media
Feature
Type
Small
PlaceName
Famous
Category
Image
Type
TerraServer
MediaFile
Pyramid
Famous
Place
NoImage
Terra
Database
Search
Imagery
Gazetteer
Admin LoadMgmt
BAD OLD Load
DLT
Tape
DLT
Tape
“tar”
NT
\Drop’N’
DoJob
LoadMgr
DB
Wait 4
Load
Backup
LoadMgr
LoadMgr
ESA
Alpha
Server
4100
100mbit
EtherSwitch
60
4.3 GB
Drives
Alpha
Server
4100
ImgCutter
\Drop’N’
\Images
Enterprise Storage Array
STC
DLT
Tape
Library
108
9.1 GB
Drives
108
9.1 GB
Drives
108
9.1 GB
Drives
Alpha
Server
8400
10: ImgCutter
20: Partition
30: ThumbImg
40: BrowseImg
45: JumpImg
50: TileImg
55: Meta Data
60: Tile Meta
70: Img Meta
80: Update Place
...
Remote Management
Internet Data Center
Tukwila, WA
Load
Process
Terminal
Server
Active Server Pages
Loading
Scheduling
System
2 TB
Database
Terra
Scale
Executive Briefing Center, Redmond WA
Mounted
Tar98
2 TB
Database
2 TB
Database
SQL
Server
SQL
Server
SQL
Server
Stored
Procs
Stored
Procs
Stored
Procs
Corporate
Network
Compaq
ProLiant
8500
450 GB
Staging
Area
Read
Image
Files
Terra
Cutter
After a Year:
30M
Count
• 15 TB of data (raw)
3B records
• 2.3 billion Hits
• 2.0 billion DB Queries
• 1.7 billion Images sent
(2 TB of download)
• 368 million Page Views
• 99.93% DB Availability
• 4rd design now Online
• Built and operated by
team of 4 people
TerraServer Daily Traffic
Jun 22, 1998 thru June 22, 1999
Sessions
Hit
Page View
DB Query
Image
20M
10M
0
Down Time
TotalTime (Hours)
(Hours:minutes)
8640
6:00
7920
5:30
7200
5:00
6480
Operations
4:30
5760
4:00
5040
4320
3600
2880
Up
3:30
3:00
2:30
Scheduled
2:00
2160
1:30
1440
1:00
720
0:30
0
0:00
HW+Software
TerraServer.Microsoft.NET
A Web Service
Before .NET
Html
Page
Internet
Image
Tile
Web Browser
TerraServer
Web
Site
TerraServer
SQL Db
With .NET
Application
Program
Internet
GetAreaByPoint
GetAreaByRect
TerraServer
GetPlaceListByName
Web
GetPlaceListByRect
GetTileMetaByLonLatPt
GetTileMetaByTileId
GetTile
ConvertLonLatToNearestPlace
ConvertPlaceToLonLatPt
.
.
.
Service
TerraServer
SQL Db
TerraServer
Recent/Current Effort
•
•
•
•
•
•
•
•
Added USGS Topographic maps (4 TB)
High availability (4 node cluster with failover)
Integrated with Encarta Online
The other 25% of the US DOQs (photos)
Adding digital elevation maps
Open architecture: publish SOAP interfaces.
Adding mult-layer maps (with UC Berkeley)
Geo-Spatial extension to SQL Server
Outline
• Technology:
– 1M$/PB: store everything online (twice!)
• End-to-end high-speed networks
– Gigabit to the desktop
So: You can store everything,
Anywhere in the world
Online everywhere
• Research driven by apps:
– TerraServer
– National Virtual Astronomy Observatory.
Astronomy is Changing
(and so are other sciences)
•
•
•
•
•
•
Astronomers have a few PB
Doubles every 2 years.
Data is public after 2 years.
So: Everyone has ½ the data
Some people have 5%more “private data”
So, it’s a nearly level playing field:
– Most accessible data is public.
(inter) National Virtual Observatory
•
•
•
•
•
•
•
Almost all astronomy datasets will be online
Some are big (>>10 TB)
Total is a few Petabytes
Bigger datasets coming
Data is “public”
Scientists can mine these datasets
Computer Science challenge:
Organize these datasets
Provide easy access to them.
The Sloan Digital Sky Survey
SLIDES BY Alex Szlay
A project run by the Astrophysical Research Consortium (ARC)
The University of Chicago
Princeton University
The Johns Hopkins University
The University of Washington
Fermi National Accelerator Laboratory
US Naval Observatory
The Japanese Participation Group
The Institute for Advanced Study
SLOAN Foundation, NSF, DOE, NASA
Goal: To create a detailed multicolor map of the Northern Sky
over 5 years, with a budget of approximately $80M
Data Size: 40 TB raw, 1 TB processed
Features of the SDSS
Special 2.5m telescope, located at Apache Point, NM
3 degree field of view.
Zero distortion focal plane.
Two surveys in one:
Photometric survey in 5 bands.
Spectroscopic redshift survey.
Huge CCD Mosaic
30 CCDs 2K x 2K (imaging)
22 CCDs 2K x 400 (astrometry)
Two high resolution spectrographs
2 x 320 fibers, with 3 arcsec diameter.
R=2000 resolution with 4096 pixels.
Spectral coverage from 3900Å to 9200Å.
Automated data reduction
Over 70 man-years of development effort.
(Fermilab + collaboration scientists)
Very high data volume
Expect over 40 TB of raw data.
About 3 TB processed catalogs.
Data made available to the public.
Scientific Motivation
Create the ultimate map of the Universe:
 The Cosmic Genome Project!
Study the distribution of galaxies:
 What is the origin of fluctuations?
 What is the topology of the distribution?
Measure the global properties of the Universe:
 How much dark matter is there?
Local census of the galaxy population:
 How did galaxies form?
Find the most distant objects in the Universe:
 What are the highest quasar redshifts?
Cosmology Primer
The Universe is expanding:
the galaxies move away from us
spectral lines are redshifted
The fate of the universe depends on
the balance between gravity
and the expansion velocity
v = Ho r
Hubble’s law
 = density/critical
if  <1, expand forever
Most of the mass in the Universe is
dark matter, and it may be cold (CDM)
 d>  *
The spatial distribution of galaxies is
correlated, due to small ripples in
the early Universe.
P(k): power spectrum
The ‘Naught’ Problem
What are the global parameters of the Universe?
H0
0
0
the Hubble constant
the density parameter
the cosmological constant
55-75 km/s/Mpc
0.25-1
0 - 0.7
Their values are still quite uncertain today...
Goal: measure these parameters with an accuracy of a few percent
High Precision Cosmology!
The Cosmic Genome Project
The SDSS will create the ultimate map
of the Universe, with much more detail
than any other measurement before
daCosta
etal 1995
deLapparent, Geller and Huchra
1986
Gregory and Thompson 1978
SDSS Collaboration 2002
Area and Size of Redshift Surveys
1.00E+09
SDSS
photo-z
1.00E+08
No of objects
1.00E+07
SDSS
main
SDSS
abs line
1.00E+06
SDSS
red
1.00E+05
CfA+
SSRS
2dF
LCRS
1.00E+04
SAPM
1.00E+03
1.00E+04
2dFR
1.00E+05
1.00E+06
QDOT
1.00E+07
1.00E+08
Volume in M pc 3
1.00E+09
1.00E+10
1.00E+11
The Spectroscopic Survey
Measure redshifts of objects  distance
SDSS Redshift Survey:
1 million galaxies
100,000 quasars
100,000 stars
Two high throughput spectrographs
spectral range 3900-9200 Å.
640 spectra simultaneously.
R=2000 resolution.
Automated reduction of spectra
Very high sampling density and completeness
Objects in other catalogs also targeted
First Light Images
Telescope:
First light May 9th 1998
Equatorial scans
The First Stripes
Camera:
5 color imaging of >100 square degrees
Multiple scans across the same fields
Photometric limits as expected
NGC 6070
The First Quasars
Three of the four highest redshift
quasars have been found in the
first SDSS test data !
SDSS Data Products
Object catalog
parameters of >108 objects
Redshift Catalog
parameters of 106 objects
400 GB
2 GB
Atlas Images
5 color cutouts of >109 objects
1.5 TB
Spectra
in a one-dimensional form 106
60 GB
Derived Catalogs
- clusters
- QSO absorption lines
60 GB
4x4 Pixel All-Sky Map
heavily compressed 5 x 105
1 TB
All raw data saved in a tape vault at Fermilab
Parallel Query Implementation
• Getting 200MBps/node thru SQL today
• = 4 GB/s on 20 node cluster.
User Interface
Analysis Engine
Master
SX Engine
DBMS Federation
DBMS
Slave
Slave
Slave
DBMS
Slave
DBMS
RAID
DBMS
RAID
DBMS
RAID
RAID
Who will be using the archive?
Power Users
sophisticated, with lots of resources
research is centered around the archive data
moderate number of very intensive queries
mostly statistical, large output sizes
General Astronomy Public
frequent, but casual lookup of objects/regions
the archives help their research, but not central to it
large number of small queries
a lot of cross-identification requests
Wide Public
browsing a ‘Virtual Telescope’
can have large public appeal
need special packaging
could be a very large number of requests
How will the data be analyzed?
The data are inherently multidimensional
=> positions, colors, size, redshift
Improved classifications result in complex N-dimensional volumes
=> complex constraints, not ranges
Spatial relations will be investigated
=> nearest neighbors
=> other objects within a radius
Data Mining: finding the ‘needle in the haystack’
=> separate typical from rare
=> recognize patterns in the data
Output size can be prohibitively large for intermediate files
=> import output directly into analysis tools
Different Kind of Spatial Data
• All Objects on Celestial Sphere Surface
– Position a point by 2 spherical angles (RA,
DEC).
– Position by Cartesian {x,y,z} – easier to
search ‘within 1 arc-minute’.
• Hierarchy of Spherical Triangles
for Indexing.
– SDSS tree is 5 levels deep
8192 triangles
Experiment with Relational DBMS
• See if SQL’s Good Indexing and Scanning
Compensates for Poor Object Support.
• Leverage Fast/Big/Cheap Commodity
Hardware.
• Ported 40 GB Sample Database (from SDSS
Sample Scan) to SQL Server 2000
• Building public web site and data server
20 Astronomy Queries
• Implemented spatial access extension to SQL (HTM)
• Implement 20 Astronomy Queries in SQL (see paper
for details).
• 15M rows 378 cols, 30 GB.
Can scan it in 8 minutes (disk IO limited).
• Many queries run in seconds
• Create Covering Indexes on queried columns.
• Create ‘Neighbors’ Table listing objects within 1 arcminute (5 neighbors on the average) for spatial joins.
• Install some more disks!
Query to Find Gravitational
Lenses
Find all objects within 1 arc-minute of each
other that have very similar colors (the color
ratios u-g, g-r, r-i are less than 0.05m)
1 arc-minute
SQL Query to Find
Gravitational Lenses
select count(*) from sxTag T, sxTag U, neighbors N
where T.UObj_id = N.UObj_id
and U.UObj_id = N.neighbor_UObj_id
and N.UObj_id < N.neighbor_UObj_id -- no dups
and T.u>0 and T.g>0 and T.r>0 and T.i>0
and U.u>0 and U.g>0 and U.r>0 and U.i>0
and ABS((T.u-T.g)-(U.u-U.g))<0.05 -- similar color
and ABS((T.g-T.r)-(U.g-U.r))<0.05
and ABS((T.r-T.i)-(U.r-U.i))<0.05
Finds 5223 objects, executes in 6 minutes.
SQL Results so far.
• Have run 17 of 20 Queries so far.
• Most Queries are IO bound, scanning at 80MB/sec on
4 disks in 6 minutes
(at the PCI bus limit)
• Covering indexes reduce execution to < 30 secs.
• Common to get Grid Distributions:
select
convert(int,ra*30)/30.0, -- ra bucket
convert(int,dec*30)/30.0, -- dec bucket
count(*)
-- bucket count
from Galaxies
where (u-g)>1 and r<21.5
group by (1), (2)
Drop Page Fields Here
Distribution of Galaxies
Galaxy Density - 2arcmin cells
m of Cnt
30
25
20
count 15
10
5
1.17
ra
216.8
216.5
216.2
215.9
215.6
215.3
ra
215
-0.43
214.7
214.4
214.1
213.8
213.5
213.2
212.9
0.37
212.6
212.3
212
0
-1.23
dec
Outline
• Technology:
– 1M$/PB: store everything online (twice!)
• End-to-end high-speed networks
– Gigabit to the desktop
So: You can store everything,
Anywhere in the world
Online everywhere
• Research driven by apps:
– TerraServer
– National Virtual Astronomy Observatory.