How much information is there
Download
Report
Transcript How much information is there
How much information?
Adapted from a presentation by:
Jim Gray
Microsoft Research
http://research.microsoft.com/~gray
Alex Szalay
Johns Hopkins University
http://tarkus.pha.jhu.edu/~szalay/
How much information is
there in the world
Infometrics - the measurement of
information
• What can we store
• What do we intend to store.
• What is stored.
• Why are we interested.
Infinite Storage?
• The Terror Bytes are Here
– 1 TB costs <100$ to buy
– 1 TB costs 300k$/y to own
• Management & curation are expensive
– Searching without indexing 1TB
takes minutes or hours
• Petrified by Peta Bytes?
• But… people can “afford” them so,
– They will be used.
• Solution: Automate processes
Yotta
Zetta
Exa
Peta
Tera
Giga
Mega
Kilo
Digital Information
Created, Captured, Replicated Worldwide
Exabytes
1,800
1,600
1,400
1,200
1,000
800
600
400
200
0
10-fold
Growth in 5
Years!
DVD
RFID
Digital TV
MP3 players
Digital cameras
Camera phones, VoIP
Medical imaging, Laptops,
Data center applications, Games
Satellite images, GPS, ATMs, Scanners
Sensors, Digital radio, DLP theaters, Telematics
Peer-to-peer, Email, Instant messaging, Videoconferencing,
CAD/CAM, Toys, Industrial machines, Security systems, Appliances
2006
Source: IDC, 2008
2007
2008
2009
2010
2011
Scale of things to come
• Information:
– In 2002, recorded media and electronic information
flows generated about 22 exabytes (1018) of
information
– In 2006, the amount of digital information created,
captured, and replicated was 161 EB
– In 2010, the amount of information added annually to
the digital universe will be about 988 EB (almost 1
ZB)
Digital Universe Environmental Footprint
•
•
In our physical universe, 98.5% of
the known mass is invisible,
composed of interstellar dust or
what scientists call “dark matter.”
In the digital universe, we have
our own form of dark matter — the
tiny signals from sensors and
RFID tags and the voice packets
that make up less than 6% of the
digital universe by gigabyte, but
account for more than 99% of the
“units,” information “containers,” or
“files” in it.
Tenfold growth of the digital
universe in five years will have a
measurable impact on the
environment, in terms of both
power consumed and electronic
waste.
How much information is there?
Yotta
• Soon most everything will
be recorded and indexed
• Most bytes will never be
seen by humans.
• Data summarization,
trend detection
anomaly detection
are key technologies
See Mike Lesk:
How much information is there:
http://www.lesk.com/mlesk/ksg97/ksg.html
See Lyman & Varian:
How much information
Everything
!
Recorded
All Books
MultiMedia
Exa
Peta
All books
(words)
.Movi
e
A Photo
http://www.sims.berkeley.edu/research/projects/how-much-info/
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
Zetta
Tera
Giga
Mega
Kilo
Digital Immortality
Bell, Gray, CACM, ‘01
Requirements for storing various media for a single
person’s lifetime at modest fidelity
What is Digital Immortality?
• Preservation and interaction of digitized
experiences for individuals and/or groups
– Preservation and access
– Active interaction with archives through
queries and/or an avatar (agents)
– Avatar interactions for group experiences
• Issues:
–
–
–
–
Archiving
Indexing
Veracity
Access
Information Census
Lesk
Varian & Lyman
EB
•
•
•
•
~10 Exabytes
PB
~90% digital
> 55% personal
TB
Print: .003% of bytes
5TB/y, but text has lowest entropy
• Email is
4PB/y and is 20% text
Media
• WWW is ~50TB
deep web ~50 PB
• Growth: 50%/y
(10 Bmpd)
(estimate by Gray)
Growth
Rate, %
TB/y
optical
50
70
paper
100
2
100,000
4
magnetic
1,000,000
55
total
1,100,150
50
film
Internet
First Disk 1956
• IBM 305 RAMAC
• 4 MB
• 50x24” disks
• 1200 rpm
• 100 ms access
• 35k$/y rent
• Included computer &
accounting software
(tubes not transistors)
1.6 meters
10 years later
30 MB
Now - Terabytes on your desk
Terabyte external
drive for
$200 - 20 cents a
gigabyte.
In 5 years, 1
cent/gigabyte, $10
for a terabyte?
Storage capacity
beating Moore’s law
• Improvements:
Capacity
60%/y
Bandwidth
40%/y
Access time 16%/y
• 1000 $/TB
today
• 100 $/TB in 2007
Moores law
58.70% /year
TB growth
112.30% /year since 1993
Price decline 50.70% /year since 1993
Most (80%) data is personal (not enterprise)
This will likely remain true.
Disk TB Shipped per Year
1E+7
1998 Disk Trend (Jim Porter)
http://www.disktrend.com/pdf/portrpkg.pdf.
ExaByte
1E+6
1E+5
disk TB
growth:
112%/y
Moore's Law:
58.7%/y
1E+4
1E+3
1988
1991
1994
1997
2000
Kilo
Mega
Giga
Tera
Peta
Exa
Disk Evolution
• Capacity:100x in 10 years
1 TB 3.5” drive in 2006
20 GB as 1” micro-drive
• System on a chip
• High-speed LAN
Zetta
Yotta
• Disk replacing tape
• Disk is super computer!
Disk Storage Cheaper Than Paper
• File
Cabinet:
Cabinet (4 drawer)
Paper (24,000 sheets)
Space (2x3 @ 10€/ft2)
Total
0.03 $/sheet
3 pennies per page
• Disk:
disk (250 GB =)
250$
ASCII: 100 m pages
2e-6 $/sheet(10,000x cheaper)
micro-dollar per page
Image: 1 m photos
3e-4 $/photo (100x cheaper)
milli-dollar per photo
250$
250$
180$
700$
• Store everything on disk
Note: Disk is 100x to 1000x cheaper than RAM
Low rent
min $/byte
Shrinks time
now or later
Shrinks space
here or there
Automate processing
knowbots
Immediate OR Time Delayed
Why Put Everything in Cyberspace?
Point-to-Point
OR
Broadcast
Locate
Process
Analyze
Summarize
Memex
As We May Think, Vannevar Bush, 1945
“A memex is a device in which an individual
stores all his books, records, and
communications, and which is mechanized
so that it may be consulted with exceeding
speed and flexibility”
“yet if the user inserted 5000 pages of
material a day it would take him hundreds
of years to fill the repository, so that he can
be profligate and enter material freely”
Trying to fill a terabyte in a year
Item
Items/TB
Items/day
300 KB JPEG
3M
9,800
1 MB Doc
1M
2,900
1 hour 256 kb/s
MP3 audio
1 hour 1.5 Mbp/s
MPEG video
9K
26
290
0.8
Projected Portable Computer for 2006
• 100 Gips processor
•
1 GB RAM
•
1 TB disk
•
1 Gbps network
• “Some” of your software
finding things
is a data mining challenge
The Personal Terabyte(s)
(All Your Stuff Online)
So you’ve got it – now what do you do with it?
• TREASURED
(what’s the one thing you would save in a fire?)
• Can you find anything?
• Can you organize that many objects?
• Once you find it will you know what it is?
• Once you’ve found it, could you find it again?
• Information Science Goal:
Have GOOD answers for all these Questions
How Will We Find Anything?
• Need Queries, Indexing, Pivoting,
Scalability, Backup, Replication,
Online update, Set-oriented access
If you don’t use a DBMS,
you will implement one!
• Simple logical structure:
– Blob and link is all that is inherent
– Additional properties (facets == extra tables)
and methods on those tables (encapsulation)
• More than a file system
• Unifies data and meta-data
SQL ++
DBMS
80% of data is personal / individual.
But, what about the other 20%?
• Business
– Wall Mart online: 1PB and growing….
– Paradox: most “transaction” systems < 1 PB.
– Have to go to image/data monitoring for big data
• Government
– Government is the biggest business.
• Science
– LOTS of data.
Q: Where will the Data Come From?
A: Sensor Applications
• Earth Observation
– 15 PB by 2007
• Medical Images & Information + Health Monitoring
– Potential 1 GB/patient/y 1 EB/y
• Video Monitoring
– ~1E8 video cameras @ 1E5 MBps
10TB/s 100 EB/y
filtered???
• Airplane Engines
– 1 GB sensor data/flight,
– 100,000 engine hours/day
– 30PB/y
• Smart Dust: ?? EB/y
http://robotics.eecs.berkeley.edu/~pister/SmartDust/
http://www-bsac.eecs.berkeley.edu/~shollar/macro_motes/macromotes.html
Instruments: CERN – LHC
Peta Bytes per Year
Looking for the Higgs Particle
• Sensors: 1000 GB/s (1TB/s ~ 30 EB/y)
• Events
75 GB/s
• Filtered
5 GB/s
• Reduced
0.1 GB/s
~ 2 PB/y
• Data pyramid:
100GB : 1TB : 100TB : 1PB : 10PB
CERN Tier 0
Thesis
• Most new information is digital
(and old information is being digitized)
• An Information Science Grand Challenge:
– Capture
– Organize
– Summarize
– Visualize
this information
• Optimize Human Attention as a resource
• Improve information quality
Access!
The Evolution of Science
• Observational Science
– Scientist gathers data by direct observation
– Scientist analyzes data
• Analytical Science
– Scientist builds analytical model
– Makes predictions.
• Computational Science
– Simulate analytical model
– Validate model and makes predictions
• Data Exploration Science
Data captured by instruments
Or data generated by simulator
– Processed by software
– Placed in a database / files
– Scientist analyzes database / files
Computational Science Evolves
• Historically, Computational Science = simulation.
• New emphasis on informatics:
–
–
–
–
–
Capturing,
Organizing,
Summarizing,
Analyzing,
Visualizing
• Largely driven by
observational science, but
also needed by simulations.
• Too soon to say if
comp-X and X-info
will unify or compete.
BaBar, Stanford
P&E
Gene Sequencer
From
http://www.genome.uci.edu/
Space Telescope
Next-Generation Data Analysis
• Looking for
– Needles in haystacks – the Higgs particle
– Haystacks: Dark matter, Dark energy
• Needles are easier than haystacks
• Global statistics have poor scaling
– Correlation functions are N2, likelihood techniques N3
• As data and computers grow at same rate,
we can only keep up with N logN
• A way out?
– Discard notion of optimal (data is fuzzy, answers are
approximate)
– Don’t assume infinite computational resources or memory
• Requires combination of statistics & computer science
Smart Data (active databases)
• If there is too much data to move around,
take the analysis to the data!
• Do all data manipulations at database
– Build custom procedures and functions in the database
• Automatic parallelism guaranteed
• Easy to build-in custom functionality
– Databases & Procedures being unified
– Example temporal and spatial indexing
– Pixel processing
• Easy to reorganize the data
– Multiple views, each optimal for certain types of analyses
– Building hierarchical summaries are trivial
• Scalable to Petabyte datasets
Data Mining in the Image Domain: Can We Discover
New Types of Phenomena Using Automated Pattern Recognition?
(Every object detection algorithm has its biases and limitations)
– Effective parametrization of source morphologies and environments
– Multiscale analysis
(Also: in the time/lightcurve domain)
Challenge:
Make Data Publication & Access Easy
• Augment FTP with data query:
Return intelligent data subsets
• Make it easy to
– Publish: Record structured data
– Find:
• Find data anywhere in the network
• Get the subset you need
– Explore datasets interactively
• Realistic goal:
– Make it as easy as
publishing/reading web sites today.
Information Science and Data
Generation Trends
• What does large amounts of information
provide?
– New opportunities for search!
– New discoveries
• Business opportunities?
• Research opportunities?
• Problems?