- Microsoft Research

Download Report

Transcript - Microsoft Research

Where The Rubber Meets the Sky
Giving Access to Science Data
Jim Gray
Microsoft Research
[email protected]
Http://research.Microsoft.com/~Gray
Alex Szalay
Johns Hopkins University
[email protected]
1
Outline
• Want to build a TerraServer for Hungary?
• My view of eScience
2
TerraServer / TerraService
http://terraService.Net/
•
•
•
•
•
•
•
•
•
http://TerraServer-USA.com/
USGS Photo of US
Online since June 1998
Operated by Microsoft
20 TB data source
10 M web hits/day
A web service
Our laboratory
I recommend you clone it for Hungary
100x less data (92k km2), very useful
– Education, land management, science
Info framework.
3
TerraServer – Today – LOW TCO
• Storage Bricks
– Commodity servers”
– 4 TB raw / 2 TB Raid1 SATA storage
– Dual 2 Ghz + 4GB RAM
• Bunch
– 3 Bricks = TerraServer data
– Data partitioned
KVM / IP
• Low Cost Availability Pair & Spare
–
–
–
–
RAID1 Mirroring
Mirrored Bunches
Spare Brick
Web Application
• Load balances mirrors
• Uses surviving database on failure
4
Outline
• Want to build a TerraServer for Hungary?
• My view of eScience
5
New Science Paradigms
• Thousand years ago:
science was empirical
describing natural phenomena
• Last few hundred years:
theoretical branch
using models, generalizations
2
 .
4G
c2
a
 a   3   a 2
 
• Last few decades:
a computational branch
simulating complex phenomena
• Today:
data exploration (eScience)
unify theory, experiment, and simulation
using data management and statistics
– Data captured by instruments
Or generated by simulator
– Processed by software
– Scientist analyzes database / files
6
Information Avalanche and eScience
• In science, industry, government,….
– better observational instruments and
– and, better simulations
producing a data avalanche
• New emphasis on informatics:
– Capturing, Organizing,
Summarizing, Analyzing, Visualizing
• Each science is objectfying itself
– Defining core concepts
– Integrating all data and literature online
– Hungary could be a leader in this
Image courtesy
C. Meneveau & A. Szalay @ JHU
BaBar, Stanford
P&E Gene Sequencer From
http://www.genome.uci.edu/
(you have the Martians – great tech education )
7
Space Telescope
The Big Picture
Experiments &
Instruments
Other Archives
Literature
questions
facts
facts
?
answers
Simulations
The Big Problems
•
•
•
•
•
•
Data ingest
Managing a petabyte
Common schema
How to organize it?
How to reorganize it?
How to coexist with others?
•
•
•
Data Query and Visualization tools
Support/training
Performance
– Execute queries in a minute
– Batch (big) query scheduling
8
The Virtual Observatory
• Premise: most data is (or could be online)
• The Internet is the world’s best telescope:
– It has data on every part of the sky
– In every measured spectral band:
optical, x-ray, radio..
– As deep as the best instruments (2 years ago).
– It is up when you are up
– The “seeing” is always great
– It’s a smart telescope:
links objects and data to literature
• Software is the capital expense
– Share, standardize, reuse..
9
What X-info Needs from us (cs)
(not drawn to scale)
Miners
Scientists
Science Data
& Questions
Data Mining
Algorithms
Plumbers
Database
To store data
Execute
Queries
Question &
Answer
Visualization
Tools
10
Data Access Hitting a Wall
Current science practice based on data download
(FTP/GREP)
Will not scale to the datasets of tomorrow
•
•
•
•
You can GREP 1 MB in a second
You can GREP 1 GB in a minute
You can GREP 1 TB in 2 days
You can GREP 1 PB in 3 years.
•
•
•
•
You can FTP 1 MB in 1 sec
You can FTP 1 GB / min (~1$)
… 2 days and 1K$
… 3 years and 1M$
• Oh!, and 1PB ~5,000 disks
• At some point you need
indices to limit search
parallel data search and analysis
• This is where databases can help
11
Next-Generation Data Analysis
• Looking for
– Needles in haystacks – the Higgs particle
– Haystacks: dark matter, dark energy,
turbulence, ecosystem dynamics
• Needles are easier than haystacks
• Global statistics have poor scaling
– Correlation functions are N2, likelihood techniques N3
• As data and computers grow at Moore’s Law,
we can only keep up with N logN
• A way out?
– Relax optimal notion (data is fuzzy, answers are approximate)
– Don’t assume infinite computational resources or memory
12
• Requires combination of statistics & computer science
Smart Data: Unifying DB and Analysis
• There is too much data to move around
Do data manipulations at database
– Build custom procedures and functions into DB
Move Mohamed to the mountain,
– Unify data Access & Analysis
not the mountain to Mohamed.
– Examples
• Statistical sampling and analysis
• Temporal and spatial indexing
• Pixel processing
• Automatic parallelism
• Auto (re)organize
• Scalable to Petabyte datasets
13
Experiment Budgets ¼…½ Software
Software for
• Instrument scheduling
• Instrument control
• Data gathering
• Data reduction
• Database
• Analysis
• Visualization
Millions of lines of code
Repeated for experiment
after experiment
Not much sharing or learning
Let’s work to change this
Identify generic tools
• Workflow schedulers
• Databases and libraries
• Analysis packages
• Visualizers
• …
14
Simulation (computational science) are > ½ software
How to Help?
• Can’t learn the discipline before you start
(takes 4 years.)
• Can’t go native – you are a CS person
not a bio,… person
• Have to learn how to communicate
Have to learn the language
• Have to form a working relationship with
domain expert(s)
• Have to find problems that leverage your skills
15
Working Cross-Culture
A Way to Engage With Domain Scientists
• Find someone who is desperate for help
• Communicate in terms of scenarios
• Work on a problem that gives 100x benefit
– Weeks/task vs hours/task
• Solve 20% of the problem
– The other 80% will take decades
• Prototype
• Go from working-to-working, Always have
– Something to show
– Clear next steps
– Clear goal
• Avoid death-by-collaboration-meetings.
16
Working Cross-Culture -- 20 Questions:
A Way to Engage With Domain Scientists
• Astronomers proposed 20 questions
• Typical of things they want to do
• Each would require a week or more in old way
(programming in tcl / C++/ FTP)
• Goal, make it easy to answer questions
• This goal motivates DB and tools design
17
The 20 Queries
Q11: Find all elliptical galaxies with spectra that have an
anomalous emission line.
Q12: Create a grided count of galaxies with u-g>1 and r<21.5
over 60<declination<70, and 200<right ascension<210,
on a grid of 2’, and create a map of masks over the
same grid.
Q13: Create a count of galaxies for each of the HTM triangles
which satisfy a certain color cut, like 0.7u-0.5g-0.2i<1.25
&& r<21.75, output it in a form adequate for
visualization.
Q14: Find stars with multiple measurements and have
magnitude variations >0.1. Scan for stars that have a
secondary object (observed at a different time) and
compare their magnitudes.
Q15: Provide a list of moving objects consistent with an
asteroid.
Q16: Find all objects similar to the colors of a quasar at
5.5<redshift<6.5.
Q17: Find binary stars where at least one of them has the
colors of a white dwarf.
Q18: Find all objects within 30 arcseconds of one another
that have very similar colors: that is where the color
ratios u-g, g-r, r-I are less than 0.05m.
Q19: Find quasars with a broad absorption line in their
spectra and at least one galaxy within 10 arcseconds.
Return both the quasars and the galaxies.
Q20: For each galaxy in the BCG data set (brightest color
galaxy), in 160<right ascension<170, -25<declination<35
Also some good queries at:
count of galaxies within 30"of it that have a photoz18
within
http://www.sdss.jhu.edu/ScienceArchive/sxqt/sxQT/Example_Queries.html
0.05 of that galaxy.
Q1: Find all galaxies without unsaturated pixels within 1' of a
given point of ra=75.327, dec=21.023
Q2: Find all galaxies with blue surface brightness between
and 23 and 25 mag per square arcseconds, and 10<super galactic latitude (sgb) <10, and declination
less than zero.
Q3: Find all galaxies brighter than magnitude 22, where the
local extinction is >0.75.
Q4: Find galaxies with an isophotal surface brightness (SB)
larger than 24 in the red band, with an ellipticity>0.5, and
with the major axis of the ellipse having a declination of
between 30” and 60”arc seconds.
Q5: Find all galaxies with a deVaucouleours profile (r¼ falloff
of intensity on disk) and the photometric colors
consistent with an elliptical galaxy. The deVaucouleours
profile
Q6: Find galaxies that are blended with a star, output the
deblended galaxy magnitudes.
Q7: Provide a list of star-like objects that are 1% rare.
Q8: Find all objects with unclassified spectra.
Q9: Find quasars with a line width >2000 km/s and
2.5<redshift<2.7.
Q10: Find galaxies with spectra that have an equivalent width
in Ha >40Å (Ha is the main hydrogen spectral line.)
http://SkyServer.sdss.org
• Solves the 20 queries
• Has 150 hours of online instruction
– Translated to Hungarian
• Professional astronomers us it as the
SDSS Science Catalog Analysis Service.
• Clone operating in Hungary.
19
SkyQuery (http://skyquery.net/)
• Distributed Query tool using a set of web services
• Many astronomy archives from
Pasadena, Chicago, Baltimore, Cambridge
(England)
• Has grown from 4 to 15 archives,
now becoming international standard
• Allows queries like:
SELECT o.objId, o.r, o.type, t.objId
FROM SDSS:PhotoPrimary o,
TWOMASS:PhotoPrimary t
WHERE XMATCH(o,t)<3.5
AND AREA(181.3,-0.76,6.5)
AND o.type=3 and (o.I - t.m_j)>2
20
SkyQuery Structure
• Portal is
– Plans Query (2 phase)
– Integrates answers
– Is itself a web service
• Each SkyNode publishes
– Schema Web Service
– Database Web Service
Image
Cutout
SDSS
INT
SkyQuery
Portal
FIRST
2MASS
21
MyDB: eScience Workbench
• Prototype of bringing analysis to the data
• Everybody gets a workspace (database)
– Executes analysis at the data
– Store intermediate results there
– Long queries run in batch
– Results shared within groups
• Only fetch the final results
• Extremely successful – matches work patterns
22
Summary
• Computational Science
– Simulation
– Data Bases
– Analysis (organization and mining)
• needed by simulations and
• Experiments
– Visualization
• Each Science X
– Has a comp-X branch
– getting a X-info branch
– Objectifying that science: defining terms precisely
• This broadening is multi-disciplinary
– Pair: good domain scientist + good computer scientist
– Chemistry is important
• A concrete way to approach Grid-computing.
23
Outline
• Want to build a TerraServer for Hungary?
– Could be done inexpensively (if you have the data)
– Microsoft would license the software to you
• My view of eScience & Hungary
– Hungary can’t lead in hardware
– Hungary CAN lead in software
• Algorithms: data mining, analysis
• Tools: that implement the algorithms
• Systems: learn by doing
• Could start an industry, fits EU agenda.
24