Query Processing, Resource Management and Approximate in a

Transcript Query Processing, Resource Management and Approximate in a

1. The Age
of
Infinite Storage
Section 1
#1
1. The Age
of
Infinite Storage
has begun
Many of us have enough money in our pockets right now to buy all
the storage we will be able to fill for the next 5 years.
So having the storage capacity is no longer a problem.
Managing it is a problem (especially when the volume gets large).
How much data is there?
Section 1
#2
Googi 10100
 Tera Bytes (TBs) are Here
 1 TB costs  1k$ to buy
 1 TB costs ~300k$/year to own
...
 Management and curation are the expensive part
 Searching 1 TB takes hours
 I’m Terrified by TeraBytes
We are here
 I’m Petrified by PetaBytes
 I’m completely Exafied by
ExaBytes
 I’m too old to ever be Zettafied by ZettaBytes, but you
may be in your lifetime.
 You may be Yottafied by YottaBytes.
 You may not be Googified by GoogiBytes,
Yotta
1024
Zetta
1021
Exa
1018
Peta
1015
Tera
1012
Giga
109
Mega
106
Kilo
103
but the next generation may be?
Section 1
#3
Yotta
How much information is there?
Zetta
 Soon everything can be
recorded and indexed.
 Most of it will never be
seen by humans.
 Data summarization,
trend detection, anomaly
detection, data mining,
are key technologies
Exa
Everything!
Recorded
Peta
All Books
MultiMedia
Tera
All books (words)
.Movie
Giga
A Photo
Mega
A Book
Kilo
10-24 Yocto, 10-21 zepto, 10-18 atto, 10-15 femto, 10-12 pico, 10-9 nano, 10-6 micro, 10-3 milli
Section 1
#4
First Disk, in 1956
 IBM 305 RAMAC
 4 MB
 50 24” disks
 1200 rpm
 100
(revolutions per minute)
milli-seconds (ms) access time
 35k$/year to rent
 Included computer &
accounting software
(tubes not transistors)
Section 1
#5
1.6 meters
10 years later
30 MB
Section 1
#6
In 2003, the Cost of Storage was about 1K$/TB.
It’s gone steadily down since then.
12/1/1999
Price vs disk capacity 9/1/2000
Price vs disk capacity
9/1/2001
Price vs disk capacity
$
IDE
SCSI
9.0
20
25
8.0
15
20
$
y = 7.2x
y = 13x
0
6.09.0
= 2.0x
80
200
054.08.09.0
0 7.0 10
03.0 8.0
2.0
06.07.0
$
200
y=x
0
0
10.0
50
100
150
Raw Disk
unit Size
50
100
150GB
200
Raw Disk unit Size GB
20
rawSCSI 6
raw
IDE k$/TB
20 k$/TB
GB
30
40
50
40
Disk unit size GB
200
250
5.0
4.0
0
3.04.0
2.03.0
1.02.0
1.0
0.0
0.0
0
60
60
80
SCSI
6.0
0.0
50
100
150
Raw Disk unit Size GB
y = 2x
0 0
5
10
5.0
1.05.0
IDE
y=x
400
10.0
7.0
IDE
raw
k$/TB
11/4/2003
y
20
40
60
Raw Disk unit Size GB
SCSI
SCSI
10
15
y = 6.7x
SCSI
IDE
raw
k$/TB
10.0
25
30
IDE
$
$
200
y = 17.9x
SCSI
SCSI
0
800
600
k$/TB
30
35
Price vs
disk capacityy = 6x
IDE
SCSI 20 IDE
y = 3.8x
GB
40
60
$
400
35
40
4/1/2002
Price vs disk capacity
800 200
600
40
$
$
$
$
1000
900
1000
800
900
700
800
1400 600
700
500
1200 600
400
500
300
14001000 400
200
800 300
100
12001400200
600 0
100
10001200 0 0
400
0
1000
50
SCSI
IDE
100
150
Disk unit size GB
200
IDE
0
50
50
100
150
200
Disk unit size GB
Disk100
unit size150
GB
Section 1
#7
200
250
Kilo
Mega
Disk Evolution
Giga
Tera
Peta
Exa
Zetta
Yotta
Section 1
#8
Memex
As We May Think, Vannevar Bush, 1945
“A memex is a device in which an
individual stores all his books, records,
and communications, and which is
mechanized so that it may be consulted
with exceeding speed and flexibility”
“yet if the user inserted 5000 pages of
material a day it would take him
hundreds of years to fill the repository,
so that he can enter material freely”
Section 1
#9
Can you fill a terabyte in a
year?
Item
Items/TB
Items/day
a 300 KB JPEG image
3M
9,800
a 1 MB Document
1M
2,900
a 1 hour, 256 kb/s MP3
audio file
9K
26
a 1 hour 1 MPEG video
290
0.8
Section 1
# 10
On a Personal Terabyte,
How Will We Find Anything?
 Need Queries, Indexing, Data Mining,
Scalability, Replication…
 If you don’t use a DBMS, you will
implement one of your own!
 Need for Data Mining, Machine Learning is
more important then ever!
Of the digital data in existence today,
 80% is personal/individual
DBMS
 20% is Corporate/Governmental
Section 1
# 11
We’re awash with data!

Network data:


10 exabytes by 2010
~ 1019 Bytes
10 zettabytes by 2015
~ 1022 Bytes
WWW (and other text collections)


~ 1016 Bytes
Sensor data from sensors (including Micro & Nano -sensor networks)


15 petabytes by 2007
National Virtual Observatory (aggregated astronomical data)


~ 1013 Bytes
US EROS Data Center archives Earth Observing System (near Soiux Falls SD)
Remotely Sensed satellite and aerial imagery data


10 terabytes by 2004
10 yottabytes by 2020
~ 1025 Bytes
Genomic/Proteomic/Metabolomic data (microarrays, genechips, genome sequences)

10 gazillabytes by 2030
~ 1028 Bytes?
I made up these Name! Projected data sizes are
overrunning our ability to name their orders of magnitude!

Stock Market prediction data (prices + all the above?)

10 supragazillabytes by 2040 ~ 1031 Bytes?
Useful information must be teased out of these large volumes of raw data.
AND these are some of the 1/5th of Corporate or Governmental data collections.
The other 4/5ths of data sets are personnel!
Section 1
# 12
 Parkinson’s Law (for data)
 Data expands to fill available storage
 Disk-storage version of Moore’s Law
 Available storage doubles every 9 months!
 How do we get the information we need
from the massive volumes of data we will
have?
 Querying (for the information we know is there)
 Data mining (for the answers to questions we
don't know to ask precisely).
Section 3
# 13
Thank
you.
Section 3
#1

Query Processing, Resource Management and Approximate in a

Transcript Query Processing, Resource Management and Approximate in a

Directory