Data Challenges I`m Struggling With
Download
Report
Transcript Data Challenges I`m Struggling With
TerraServer Web Site
USGS Image Data
• Digital OrthoQuads
• Urban Area
– 15 TB, 280,000 files
uncompressed
– Digitized aerial imagery
– 96% coverage
conterminous US
– 1 meter resolution
– < 15 years old
–
–
–
–
–
–
• Digital Raster Graphics
1 foot resolution
Natural Color
133 major U.S. cities
30 available 2004
2001 or later
Produced by NIMA for
Homeland Security
– 1 TB compressed
TIFF, 65,000 files
– Scanned topo maps
– 100% U.S. coverage
– 1:24,000, 1:100,000
and 1:250,000 scale
maps
– Maps vary in age
Image Coverage
• 100% U.S., Topo Maps (light green)
2m to 1024m resolution
• 96% 48 Conterminous States, (dark green)
Ortho Imagery, 1m to 1024m resolution
What is TerraServer?
• A remote sensing image store
– Images are “shredded” into small, uniform sized tiles
– Multiple input images are mosaic’d into a single “scene”
• An HTML based web application
– Navigate by point-n-click, gazetteer or coordinates
– Pan-n-zoom by clicking on simple navigation buttons
– (No fancy java, activex, flash controls)
• An XML based web service
– Meta-data and tiles accessible via W3C web service API
• An OpenGIS compliant Web Map Server
– Tiles merged into single “map” image
– Re-projection from UTM to “geographic” supported
Design Objectives
User/App Goals
Technology Goals
• Public: Access to
remote sensing data
with no GIS expertise
required
• Ubiquitous: No special
hw/sw required by client
• Delivery: All
OnLine/Internet Based,
no tape or CD
distribution
• Simple: Designed to be
used by a “6th grade
geography student”
• Scale-up: creating
multi-TB PC Server
• Availability: Test
software and ops in a
24x7 situation
• Lights out: all
operations &
maintenance occurs
remotely
• Easy: Minimal ops and
dev staff
• Programmable: Meta &
Imagery data
accessible as a “web
service”
TerraServer Concepts
• Tile: An “addressable” chunk extracted from a “Theme”
• Theme: An image asset/data-set in a single projection “class”,
i.e. UTM NAD83, Geographic, etc.
• Scene: A grouping of related Tiles
that can be shown together as a “seamless mosaic”
without any image processing tricks, e.g. a UTM zone,
a single satellite swath, etc.
• Scale: An ID# assigned to the ground resolution of pixels in a
Tile.
• Image Pyramid: The set of Scales a Theme’s Tiles
rendered in and stored, e.g. DOQ scale 10 .. 21
(1m/pix to 2048 m/pix)
More Concepts
• Scale System: Begins at 10 for 1 meter/pix resolution, +/- 1
for each 2x resolution, i.e. 11 = 2m, 12 = 4m, 9 = .5m
• Grid System: X,Y identifies Tile location within a scene.
Positive, Scale based integers starting from lower left corner
of the Scene
• Tile Shape: Square.
• Tile Size: 200 is 200x200 pixels
10,000,000
15
0
0
199
0
0
199
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
UTM Zone
0 1 2 3 4 5 6 7 8 9 10 1112 13141516 171819 2021 22
0
500,000
0
1,000,000
Why 200 x 200 Pixel Tiles
• Fit different size monitors (using html tables)
– 2x1 = 800 x 600 monitors
– 3x2 = 1024 x 768 monitors
– 4x3 = 1280 x 1024 and larger monitors
• Progressive display over slow speed lines
– Tiles average < 10kb
– Panning takes advantage of browser cache
• Maintains image quality
– Without storing uncompressed full res imagery
– USGS DOQ imagery overlaps by 300 pixels
– USGS DOQ images arrive in random order
TerraServer Hardware
• Storage Bricks
–
–
–
–
–
–
–
Commodity servers”
4 TB raw / 2 TB Raid1 SATA storage
Dual 2 GHz + 4GB RAM
3 Bricks = TerraServer data
Data partitioned
Moving to Yukon
Working on low TCO
auto-manage
• Low Cost Availability Pair & Spare
–
–
–
–
RAID1 Mirroring
Mirrored Bunches (Yukon log ship?)
Spare Brick
Web Application
• Load balances mirrors
• Uses surviving database on failure
KVM / IP
The TerraServer Story
http://terraserver-usa.com/image.aspx?t=4&s=9&x=5655&y=52766&z=10&w=2
• 1997: 8 DEC Alpha,
– 8GB,
– 480x18GB disks
– ~1 TB
• 2000: 4x8 Pentium3 600Mhz,
–
–
–
–
16GB ram
540 36GB FC SCSI disks
FC SAN
~18TB
• 2004: 7x2 Xeon
–
–
–
–
~100 250 GB SATA disks
28 TB
70k$
NO TAPE
• Next:
– Geo-Plex
SAN: 2M$ capital, 200k$/y rent
Servers
14%
Rack
Units (U)
57%
Lan
Support
& Gear
12%
Egress
(Mbps)
9%
Headcou
nt Direct
8%
Bricks: 100k$ capital 100k$/y rent
Servers
20%
Rack
Units (U)
30%
Headcou
nt Direct
12%
Lan
Support&
Gear
20%
Egress
(Mbps)
18%
KVM / IP
Soon: The Virtual Observatory
• Many new surveys are coming
– SDSS is a dry run for the next ones
– LSST will be 5TB/night
• All the data will be on the Internet
– ftp, web services…
• Data and applications will be
associated with the projects
– Distributed world wide, cross-indexed
– Federation is a must
• Will be the best telescope in the world
– World Wide Telescope
• Finds the “needle in the haystack”
• Successful demonstrations NOW!
SkyServer, Virtual Observatory
• The Problem:
– Bring Astronomy data online
– Integrate the online archives
• Our/my focus
– Database and architecture help
– Algorithms
– Implementation
– Connect to CS community
The Cosmic Genome Project
The SDSS will create the ultimate map
of the Universe, with much more detail
than any other measurement before
daCosta
etal 1995
deLapparent, Geller and Huchra
1986
Gregory and Thompson 1978
Features of the SDSS
2.5m telescope, located at Apache Point, NM
3 degree field of view.
Zero distortion focal plane.
Two surveys in one:
Photometric survey in 5 bands.
Spectroscopic redshift survey.
Huge CCD Mosaic
30 CCDs 2K x 2K (imaging)
22 CCDs 2K x 400 (astrometry)
Two high resolution spectrographs
2 x 320 fibers, with 3 arcsec diameter.
R=2000 resolution with 4096 pixels.
Spectral coverage from 3900Å to 9200Å.
Automated data reduction
Over 70 man-years of development effort.
(Fermilab + collaboration scientists)
Very high data volume
Expect over 20 TB of raw data.
About 1 TB processed catalogs.
Data made available to the public.
The Telescope
Special 2.5m telescope
3 degree field of view
Zero distortion focal plane
Wind screen moved separately
How Good is the Telescope?
• SDSS telescope has 120 Million CCD pixels:
– 55 second photometric exposure.
– 8 MB/sec data rate.
– 0.4 arc-sec pixel size.
1/8 inch (3 mm)
One Mile
0.4 arc-sec
(1.7 km)
• Also Spectroscopic Survey of 1 million obj.
• As good as the atmosphere
• Seeing 20 nights per year.
The Photometric Survey
Northern Galactic Cap
5 broad-band filters ( u', g', r',
i', z’ )
limiting magnitudes (22.3, 23.3, 23.1, 22.3, 20.8)
drift scan of 10,000 square degrees
55 sec exposure time
40 TB raw imaging data -> pipeline ->
100,000,000 galaxies
50,000,000 stars
calibration to 2% at r'=19.8
only done in the best seeing (20 nights/yr)
pixel size is 0.4 arcsec,
astrometric precision is 60 milliarcsec
Southern Galactic Cap
multiple scans (> 30 times) of the same stripe
Continuous data rate of 8 Mbytes/sec
The Spectroscopic Survey
Measure redshifts of objects distance
SDSS Redshift Survey:
1 million galaxies
100,000 quasars
100,000 stars
Two high throughput spectrographs
spectral range 3900-9200 Å.
640 spectra simultaneously.
R=2000 resolution.
Automated reduction of spectra
Very high sampling density and completeness
Objects in other catalogs also targeted
The Mosaic Camera
Photometric Calibrations
The SDSS will create a new
photometric system:
u' g' r' i' z'
Primary standards:
observed with the USNO
40-inch telescope in Flagstaff
Secondary standards:
observed with the SDSS
20-inch telescope at Apache
Point – calibrating the SDSS
imaging data
The Spectrographs
Two double spectrographs
very high throughput
two 2048x2048 CCD detectors
mounted on the telescope
light fed through slithead
The Fiber Feed System
Galaxy images are captured by optical fibers
lined up on the spectrograph slit
Manually plugged during the day into Al plugboards
640 fibers in each bundle
The largest fiber system today
Spectrograph
Status
Spectrographs:
Laboratory observations of solar spectrum
First astronomical observations March 1999
First Light Images
Telescope:
First light May 9th 1998
Equatorial scans
The First Stripes
Camera:
5 color imaging of >100 square degrees
Multiple scans across the same fields
Photometric limits as expected
NGC 2068
UGC 3214
NGC 6070
The First Quasars
Three of the four highest redshift
quasars have been found in the
first SDSS test data !
The Stripes
• 25 stripes over the SDSS area, covering
about 2800 square degrees
• Resolution: 0.4 arc seconds.
• About 20% lost due to bad seeing
• Masks: seeing, bright stars, etc.
• Images generated from query by web service
The Masks
• A Stripe - masks
• Masks are derived from the database
– Search and intersect extended objects with
boundaries
Major Changes in Astronomy
• Visual Observation --> Photographic Plates
--> Massive Scans of the Sky collecting Terabytes.
• A Practice Scan of the SDSS Telescope Discovered
3 of the 4 most Distant Quasars!
• SDSS plus other Surveys will yield a Digital Sky
– Telescope Quality Data available Online.
– Spatial Data Mining will find new objects.
– New research areas - Study Density
Fluctuations.
Next Generation
•
•
•
•
•
50TB/night
First light 2010
Survey sky every 2 weeks
Survey time domain
Lots of activity outside
the optical domain.
Moving Data Bricks
• The cheapest and fastest way
to move a Terabyte cross
country
is sneakernet.
24 hours = 4 MB/s
50$ shipping vs 1,000$ wan
cost.
Giga Byte Per Second File Mover
• CERN to Pasadena
– Windows TCP/IP stack improvements
– Opteron demo
– Disk-to-Disk at 550MBps now (~2 TB/Hour)
• What we learned:
– Linux tcp stack is good/better at high perf
we are catching up.
– NTFS is better than various Linux FS
– Near the PCI-X limit
tcp limit
– Good way to engage the community.
CERN-Caltech Trasfer Speeds
Newisys->Newisys
PCI -X limit
MBps
• GOAL: 1GBps disk-to-disk.
1000
900
800
700
600
500
400
300
200
100
0
Mar-04
File Transfer MBps
1 Stream tcp MBps
May-04
Jun-04
Aug-04
Sep-04