- Microsoft Research

Download Report

Transcript - Microsoft Research

The Data Avalanche
Talk at
HP Labs/MSR: Research Day
July 2004
Jim Gray
Microsoft Research
[email protected]
http://research.microsoft.com/~Gray
How much information is there?
Yotta
• Almost everything is
recorded digitally.
• Most bytes are never seen
by humans.
• Data summarization,
trend detection
anomaly detection
are key technologies
See Mike Lesk:
How much information is there:
Everything
!
Recorded
All Books
MultiMedia
Zetta
Exa
Peta
All books
(words)
.Movi
e
Tera
Giga
http://www.lesk.com/mlesk/ksg97/ksg.html
See Lyman & Varian:
How much information
http://www.sims.berkeley.edu/research/projects/how-much-info/
A Photo
A Book
Mega
Kilo
Memex
As We May Think, Vannevar Bush, 1945
“A memex is a device in which an individual
stores all his books, records, and
communications, and which is mechanized
so that it may be consulted with exceeding
speed and flexibility”
“yet if the user inserted 5000 pages of
material a day it would take him hundreds
of years to fill the repository, so that he can
be profligate and enter material freely”
MyLifeBits The guinea pig
• Gordon Bell is digitizing his life
• Has now scanned virtually all:
–
–
–
–
–
–
–
•
•
•
•
Books written (and read when possible)
Personal documents (correspondence, memos, email, bills, legal,0…)
Photos
Posters, paintings, photo of things (artifacts, …medals, plaques)
Home movies and videos
CD collection
And, of course, all PC files
Recording: phone, radio, TV, web pages… conversations
Paperless throughout 2002. 12” scanned, 12’ discarded.
Only 30GB Excluding videos
Video is 2+ TB and growing fast
25Kday life ~ Personal Petabyte
Lifetime Storage
1PB
1000.
100.
10.
TB
1.
0.1
0.01
0.001
Msgs
web
pages
Tifs
Books
jpegs
1KBps
sound
music
Videos
Will anyone look at web pages in 2020?
Probably new modalities & media will dominate then.
Challenges
•
•
•
•
•
•
•
•
Capture: Get the bits in
Organize: Index them
Manage: No worries about loss or space
Curate/ Annotate: automate where possible
Privacy: Keep safe from theft.
Summarize: Give thumbnail summaries
Interface: how ask/anticipate questions
Present: show it in understandable ways.
80% of data is personal / individual.
But, what about the other 20%?
• Business
– Wall Mart online: 1PB and growing….
– Paradox: most “transaction” systems < 1 PB.
– Have to go to image/data monitoring for big data
• Government
– Government is the biggest business.
• Science
– LOTS of data.
Instruments: CERN – LHC
Peta Bytes per Year
Looking for the Higgs Particle
• Sensors: 1000 GB/s (1TB/s ~ 30 EB/y)
• Events
75 GB/s
• Filtered
5 GB/s
• Reduced
0.1 GB/s
~ 2 PB/y
• Data pyramid:
100GB : 1TB : 100TB : 1PB : 10PB
CERN Tier 0
Information Avalanche
• Both
– better observational instruments and
– Better simulations
are producing a data avalanche
• Examples
Image courtesy of C. Meneveau & A. Szalay @ JHU
– Turbulence: 100 TB simulation
then mine the Information
– BaBar: Grows 1TB/day
2/3 simulation Information
1/3 observational Information
– CERN: LHC will generate 1GB/s
10 PB/y
– VLBA (NRAO) generates 1GB/s today
– NCBI: “only ½ TB” but doubling each year, very rich dataset.
– Pixar: 100 TB/Movie
One Challenge: Move Data from CERN
to Remote Centers @ 1GBps
~PBps
Filter ~1 GBps
• Disk-to-Disk
Experiment
CERN
• gigabyte / second
Tier 1~5 GBps
data rates
• 80TB/day
~1 GBps
Tier 2
• 30 petabytes by 2008
~1 GBps
Tier 3 Physics
• 1 exabyte by 2014
data
cache
INP3
RAL
INFN
FNAL
…
Tier 2 Tier 2 Tier 2 Tier 2 Tier 2
Institute
Tier 4
.1 GBps
Institute
Institute
Institute
Workstations
Graphics courtesy of Harvey Newman @ Caltech
Current Status: CERN → Pasadena
• Multi Stream tpc/ip 7.1 Gbps
~900 MBps
– New speed record @ http://ultralight.caltech.edu/lsr-winhec/
• Single Stream tpc/ip 6.5 Gbps ~800 MBps
• File Transfer Speed
~450 MBps
mbps per second
7,000
6,000
5,000
4,000
3,000
2,000
1,000
0
2000
2001
2002
2003
2004
2005
The Evolution of Science
• Observational Science
– Scientist gathers data by direct observation
– Scientist analyzes data
• Analytical Science
– Scientist builds analytical model
– Makes predictions.
• Computational Science
– Simulate analytical model
– Validate model and makes predictions
• Data Exploration Science
Data captured by instruments
Or data generated by simulator
– Processed by software
– Placed in a database / files
– Scientist analyzes database / files
e-Science
• Data captured by instruments
Or data generated by simulator
• Processed by software
• Placed in a files or database
• Scientist analyzes files / database
• Virtual laboratories
– Networks connecting e-Scientists
– Strong support from funding agencies
• Better use of resources
– Primitive today
The Big Picture
Experiments &
Instruments
Other Archives
Literature
questions
facts
facts
?
answers
Simulations
The Big Problems
•
•
•
•
•
•
Data ingest
Managing a petabyte
Common schema
How to organize it?
How to reorganize it
How to coexist with others
• Query and Vis tools
• Support/training
• Performance
– Execute queries in a minute
– Batch query scheduling
FTP - GREP
• Download (FTP and GREP) are not adequate
–
–
–
–
You can GREP 1 MB in a second
You can GREP 1 GB in a minute
You can GREP 1 TB in 2 days
You can GREP 1 PB in 3 years.
• Oh!, and 1PB ~3,000 disks
• At some point we need
indices to limit search
parallel data search and analysis
• This is where databases can help
• Next generation technique: Data Exploration
– Bring the analysis to the data!
Next-Generation Data Analysis
• Looking for
– Needles in haystacks – the Higgs particle
– Haystacks: Dark matter, Dark energy
• Needles are easier than haystacks
• Global statistics have poor scaling
– Correlation functions are N2, likelihood techniques N3
• As data and computers grow at same rate,
we can only keep up with N logN
• A way out?
– Relax notion of optimal
(data is fuzzy, answers are approximate)
– Don’t assume infinite computational resources or memory
• Combination of statistics & computer science
Analysis and Databases
• Much statistical analysis deals with
–
–
–
–
–
–
–
–
–
Creating uniform samples –
data filtering
Assembling relevant subsets
Estimating completeness
censoring bad data
Counting and building histograms
Generating Monte-Carlo subsets
Likelihood calculations
Hypothesis testing
• Traditionally these are performed on files
• Most of these tasks are much better done inside a database
• Move Mohamed to the mountain, not the mountain to
Mohamed.
Virtual Observatory
http://www.astro.caltech.edu/nvoconf/
http://www.voforum.org/
• Premise: Most data is (or could be online)
• So, the Internet is the world’s best telescope:
–
–
–
–
It has data on every part of the sky
In every measured spectral band: optical, x-ray, radio..
As deep as the best instruments (2 years ago).
It is up when you are up.
The “seeing” is always great
(no working at night, no clouds no moons no..).
– It’s a smart telescope:
links objects and data to literature on them.
Why Astronomy Data?
IRAS 25m
•It has no commercial value
–No privacy concerns
–Can freely share results with others
–Great for experimenting with algorithms
2MASS 2m
•It is real and well documented
–High-dimensional data (with confidence intervals)
–Spatial data
–Temporal data
•Many different instruments from
many different places and
many different times
•Federation is a goal
•The questions are interesting
DSS Optical
IRAS 100m
WENSS 92cm
NVSS 20cm
–How did the universe form?
•There is a lot of it (petabytes)
ROSAT ~keV
GB 6cm
Time and Spectral Dimensions
The Multiwavelength Crab Nebulae
Crab star
1053 AD
X-ray,
optical,
infrared, and
radio
views of the nearby
Crab Nebula, which is
now in a state of
chaotic expansion after
a supernova explosion
first sighted in 1054
A.D. by Chinese
Astronomers.
Slide courtesy of Robert Brunner @ CalTech.
Estimating Cosmological Constant
CPU Time vs Memory
• CPU time is 5000xNXlog2N in memory
• For large data sets, split into M disk chunks
=> time goes as M2
• Have 80M objects now, time is 10 days with 32GB
– 4x1GHz CPU
decade
Memory in GB
100000.0
1 month
year
4
10000.0
32
256
CPU time in hrs
• Need to run this
many times with
different DB cuts
• more objects soon!
1
1000.0
100.0
1 week
10.0
1 day
1.0
0
10
20
30
40
50
60
No of galaxies in Millions
70
80
90
100
SkyServer.SDSS.org
• A modern archive
– Raw Pixel data lives in file servers
– Catalog data (derived objects) lives in Database
– Online query to any and all
• Also used for education
– 150 hours of online Astronomy
– Implicitly teaches data analysis
• Interesting things
–
–
–
–
–
–
Spatial data search
Client query interface via Java Applet
Query interface via Emacs
Popular -- 1% of Terraserver 
Cloned by other surveys (a template design)
Web services are core of it.
Demo of SkyServer
•
•
•
•
•
Shows standard web server
Pixel/image data
Point and click
Explore one object
Explore sets of objects (data mining)
Data Federations of Web Services
• Massive datasets live near their owners:
–
–
–
–
Near the instrument’s software pipeline
Near the applications
Near data knowledge and curation
Super Computer centers become Super Data Centers
• Each Archive publishes a web service
– Schema: documents the data
– Methods on objects (queries)
• Scientists get “personalized” extracts
• Uniform access to multiple ArchivesFederation
– A common global schema
Federation: SkyQuery.Net
• Combine 4 archives initially
• Just added 10 more
• Send query to portal,
portal joins data from archives.
• Problem: want to do multi-step data analysis
(not just single query).
• Solution: Allow personal databases on portal
• Problem: some queries are monsters
• Solution: “batch schedule” on portal server,
Deposits answer in personal database.
SkyQuery Structure
• Each SkyNode publishes
– Schema Web Service
– Database Web Service
• Portal is
– Plans Query (2 phase)
– Integrates answers
– Is itself a web service
Image
Cutout
SDSS
SkyQuery
Portal
FIRST
2MASS
INT
SkyQuery: http://skyquery.net/
• Distributed Query tool using a set of web services
• Four astronomy archives from
Pasadena, Chicago, Baltimore, Cambridge (England).
• Feasibility study, built in 6 weeks
– Tanu Malik (JHU CS grad student)
– Tamas Budavari (JHU astro postdoc)
– With help from Szalay, Thakar, Gray
• Implemented in C# and .NET
• Allows queries like:
SELECT o.objId, o.r, o.type, t.objId
FROM SDSS:PhotoPrimary o,
TWOMASS:PhotoPrimary t
WHERE XMATCH(o,t)<3.5
AND AREA(181.3,-0.76,6.5)
AND o.type=3 and (o.I - t.m_j)>2
MyDB added to SkyQuery
• Moves analysis to the data
• Users can cooperate
(share MyDB)
• Still exploring this
• Let users add personal DB
1GB for now.
• Use it as a workbook.
• Online and batch queries.
INT
Image
Cutout
SDSS
SkyQuery
Portal
MyDB
FIRST
2MASS
The Big Picture
Experiments &
Instruments
Other Archives
Literature
questions
facts
facts
?
answers
Simulations
The Big Problems
•
•
•
•
•
•
Data ingest
Managing a petabyte
Common schema
How to organize it?
How to reorganize it
How to coexist with others
• Query and Vis tools
• Support/training
• Performance
– Execute queries in a minute
– Batch query scheduling