how much information

Download Report

Transcript how much information

How much information?
Adapted from a presentation by:
Jim Gray
Microsoft Research
http://research.microsoft.com/~gray
Alex Szalay
Johns Hopkins University
http://tarkus.pha.jhu.edu/~szalay/
1
How much information is
there in the world
1.What can we store.
2.What is stored.
3.Why are we interested.
2
Infinite Storage?
• The Terror Bytes are Here
– 1 TB costs 1k$ to buy
– 1 TB costs 300k$/y to own
• Management & curation are expensive
– Searching 1TB takes minutes or hours
Yotta
Zetta
Exa
• Petrified by Peta Bytes?
Peta
• But… people can “afford” them so,We are here Tera
– Even though they can never actually be Giga
seen in your lifetime
– Automate the process
Mega
3
Kilo
How much information is there?
Yotta
• Soon everything can be
recorded and indexed
• Most bytes will never be
seen by humans.
• Data summarization,
trend detection
anomaly detection
are key technologies
See Mike Lesk:
How much information is there:
Everything
!
Recorded
All Books
MultiMedia
Zetta
Exa
Peta
All books
(words)
.Movi
e
Tera
Giga
http://www.lesk.com/mlesk/ksg97/ksg.html
See Lyman & Varian:
How much information
http://www.sims.berkeley.edu/research/projects/how-much-info/
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
A Photo
A Book
Mega
4
Kilo
First Disk 1956
• IBM 305 RAMAC
• 4 MB
• 50x24” disks
• 1200 rpm
• 100 ms access
• 35k$/y rent
• Included computer &
accounting software
(tubes not transistors)
6
Storage capacity
beating Moore’s law
• Improvements:
Capacity
60%/y
Bandwidth
40%/y
Access time 16%/y
• 1000 $/TB
today
• 100 $/TB in 2007
Moores law
58.70% /year
TB growth
112.30% /year since 1993
Price decline 50.70% /year since 1993
Most (80%) data is personal (not enterprise)
This will likely remain true.
Disk TB Shipped per Year
1E+7
1998 Disk Trend (Jim Porter)
http://www.disktrend.com/pdf/portrpkg.pdf.
ExaByte
1E+6
1E+5
disk TB
growth:
112%/y
Moore's Law:
58.7%/y
1E+4
1E+3
1988
1991
1994
1997
2000
9
Disk Storage Cheaper Than Paper
• File
Cabinet:
Cabinet (4 drawer)
Paper (24,000 sheets)
Space (2x3 @ 10€/ft2)
Total
0.03 $/sheet
3 pennies per page
• Disk:
disk (250 GB =)
250$
ASCII: 100 m pages
2e-6 $/sheet(10,000x cheaper)
micro-dollar per page
Image: 1 m photos
3e-4 $/photo (100x cheaper)
milli-dollar per photo
250$
250$
180$
700$
• Store everything on disk
Note: Disk is 100x to 1000x cheaper than RAM
11
Trying to fill a terabyte in a year
Item
Items/TB
Items/day
300 KB JPEG
3M
9,800
1 MB Doc
1M
2,900
1 hour 256 kb/s
MP3 audio
1 hour 1.5 Mbp/s
MPEG video
9K
26
290
0.8
14
Portable Computer: 2006?
• 100 Gips processor
•
1 GB RAM
•
1 TB disk
•
1 Gbps network
• “Some” of your software
finding things
is a data mining challenge
15
80% of data is personal / individual.
But, what about the other 20%?
• Business
– Wall Mart online: 1PB and growing….
– Paradox: most “transaction” systems < 1 PB.
– Have to go to image/data monitoring for big data
• Government
– Government is the biggest business.
• Science
– LOTS of data.
19
Q: Where will the Data Come From?
A: Sensor Applications
• Earth Observation
– 15 PB by 2007
• Medical Images & Information + Health Monitoring
– Potential 1 GB/patient/y  1 EB/y
• Video Monitoring
– ~1E8 video cameras @ 1E5 MBps
 10TB/s  100 EB/y
 filtered???
• Airplane Engines
– 1 GB sensor data/flight,
– 100,000 engine hours/day
– 30PB/y
• Smart Dust: ?? EB/y
http://robotics.eecs.berkeley.edu/~pister/SmartDust/
http://www-bsac.eecs.berkeley.edu/~shollar/macro_motes/macromotes.html
20
Premise:
DataGrid Computing
• Store exabytes twice
(for redundancy)
• Access them from anywhere
• Implies huge archive/data
centers
• Supercomputer centers
become super data centers
• Examples:
Google, Yahoo!, Hotmail,
BaBar, CERN, Fermilab,
SDSC, …
23
Thesis
• Most new information is digital
(and old information is being digitized)
• An Information Science Grand Challenge:
– Capture
– Organize
– Summarize
– Visualize
this information
• Optimize Human Attention as a resource
• Improve information quality
24
The Evolution of Science
• Observational Science
– Scientist gathers data by direct observation
– Scientist analyzes data
• Analytical Science
– Scientist builds analytical model
– Makes predictions.
• Computational Science
– Simulate analytical model
– Validate model and makes predictions
• Data Exploration Science
Data captured by instruments
Or data generated by simulator
– Processed by software
– Placed in a database / files
– Scientist analyzes database / files
26
Computational Science Evolves
• Historically, Computational Science = simulation.
• New emphasis on informatics:
–
–
–
–
–
Capturing,
Organizing,
Summarizing,
Analyzing,
Visualizing
• Largely driven by
observational science, but
also needed by simulations.
• Too soon to say if
comp-X and X-info
will unify or compete.
BaBar, Stanford
P&E
Gene Sequencer
From
http://www.genome.uci.edu/
27
Space Telescope
Next-Generation Data Analysis
• Looking for
– Needles in haystacks – the Higgs particle
– Haystacks: Dark matter, Dark energy
• Needles are easier than haystacks
• Global statistics have poor scaling
– Correlation functions are N2, likelihood techniques N3
• As data and computers grow at same rate,
we can only keep up with N logN
• A way out?
– Discard notion of optimal (data is fuzzy, answers are
approximate)
– Don’t assume infinite computational resources or memory
28
• Requires combination of statistics & computer science
Smart Data (active databases)
• If there is too much data to move around,
take the analysis to the data!
• Do all data manipulations at database
– Build custom procedures and functions in the database
• Automatic parallelism guaranteed
• Easy to build-in custom functionality
– Databases & Procedures being unified
– Example temporal and spatial indexing
– Pixel processing
• Easy to reorganize the data
– Multiple views, each optimal for certain types of analyses
– Building hierarchical summaries are trivial
• Scalable to Petabyte datasets
29
Challenge:
Make Data Publication & Access Easy
• Augment FTP with data query:
Return intelligent data subsets
• Make it easy to
– Publish: Record structured data
– Find:
• Find data anywhere in the network
• Get the subset you need
– Explore datasets interactively
• Realistic goal:
– Make it as easy as
publishing/reading web sites today.
31
Data Federations of Web Services
• Massive datasets live near their owners:
–
–
–
–
Near the instrument’s software pipeline
Near the applications
Near data knowledge and curation
Super Computer centers become Super Data Centers
• Each Archive publishes a web service
– Schema: documents the data
– Methods on objects (queries)
• Scientists get “personalized” extracts
• Uniform access to multiple Archives
– A common global schema
Federation
• Challenge:
– What is the object model for your science?
32
Web Services: The Key?
• Web SERVER:
– Given a url + parameters
– Returns a web page (often dynamic)
Your
program
Web
Server
• Web SERVICE:
– Given a XML document (soap msg)
– Returns an XML document
– Tools make this look like an RPC.
• F(x,y,z) returns (u, v, w)
– Distributed objects for the web.
– + naming, discovery, security,..
• Internet-scale
distributed computing
Your
program
Data
In your
address
space
Web
Service
33
Emerging technologies
• Look at science
• High end computation and storage
34