01 big_data_and_the_age_of_infinite_storage_audio

Download Report

Transcript 01 big_data_and_the_age_of_infinite_storage_audio

CSCI 765
Big Data
and
Infinite Storage
One new idea introduced in this course is the emerging idea of structuring
data into vertical structures and processing across those vertical structures.
This is in contrast to the traditional method of structuring data into
horizontal structures and processing down those horizontal structures
(horizontal structures are often called records, e.g., an employee file
containing horizontal employee records which are made up of fields such as
Name, Address, Salary, Phone, etc.)
Thus, horizontal processing of vertical data (HPVD) will be introduced as an
alternative to the traditional vertical processing of horizontal data (VPHD).
Why do we need to structure and process data differently than we have in the
past?
What has changed?
Data (digital data) has gotten really BIG!!
How big is BIG DATA these days and how big will it get?
An Example: The US Library of Congress is storing EVERY
tweet sent since Twitter launched in 2006.
Each tweet record contains fifty fields.
Let's assume each of those horizontal tweet records is about 1000
bits wide.
Let's estimate approximately 1 trillion tweets from 1 billion tweeters,
to 1 billion tweetees over 10 years of tweeting?
As a full data file that's 1030 data items (1012 *109 * 109)
That's BIG! Is it going to get even bigger?
Yes.
Let’s look at how the definition of “big data” has evolved just over my work lifetime.
My first job in this industry was as THE technician at the St. John’s University IBM
1620 Computer Center. I did the following:
1.
2.
3.
4.
5.
6.
7.
I turned the 1620 switch on.
I waited for the ready light bulb to come on (~15 minutes)
I put the Op/Sys punch card stack on the card reader (~4 inches high)
I put the FORTRAN compiler card stack on the reader (~3 inches)
I put the FORTRAN program card stack on the reader (~2 inches)
The 1620 produced an object code stack which I read in (~1 inch)
I read in the object stack and a 1964 BIG DATA stack (~40 inches)
The 1st FORTRAN upgrade allowed for a “continue” card so that the data stack could
be read in segments (and I could sit down).
How high would a 2013 BIG DATA STACK reach today if it
were put on punch cards?
Let's be conservative and assume an exabyte (218 bytes) of data on cards
How high is an exabyte punch card stack? Take a guess.................?
Keep in mind that we're being conservative because the US LoC tweet
database may be ~1030 bytes or more soon (if it's fully losslessly stored).
That exabyte stack
of
punch cards would reach
to
JUPITER!
So, in my work lifetime, BIG DATA has gone from 40 inches
high all the way to Jupiter!
What will happen to BIG DATA over your work lifetime?
I must deal with a data file that would reach Jupiter as a punch card stack, but I can
replace it losslessly by 1000 extendable vertical pTrees and write programs to
process across those 1000 vertical structures horizontally.
You may have to deal with a data file that would reach the end of space (if on cards),
but you can replace it losslessly by 1000 extendable vertical pTrees and write
programs to process across those 1000 vertical structures horizontally.
The next generation may have to deal with a data file that creates new space, but can
replace it losslessly by 1000 extendable vertical pTrees and write programs to
process across those 1000 vertical structures horizontally.
You will be able to use my code!
The next generation will be able to use my code too!
It seems clear that DATA WILL HAVE TO BE COMPRESSED and that data will
have to be VERTICALLY structured.
Let's take a quick look at how one might organize and compressed vertical data (more
on that later too).
predicate Trees = pTrees: slice by column (4 vertical structures).
vertically slice off each bit position (12 vertical structures)
then compress each bit slice into a tree using a predicate
Traditional Vertical Processing of Horizontal Data (VPHD)
=2
e.g., find the number of occurences of 7 0 1 4
(We will walk thru the compression of R11 into pTree, P11 )
A Vertical Data Structuring
using vertical pTrees find number occurrences of
Base 10
7 0 1 4
R(A1 A2 A3 A4)
for
Horizontally
structured,
recordoriented data,
one scans
vertically
2
6
3
2
3
2
7
7
7
7
7
7
2
2
0
0
6
6
5
5
1
1
1
1
1
0
1
7
4
5
4
4
=
010
011
010
010
011
010
111
111
Imagine an excillion records, not
just 8 (We need speed!).
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
pure1? false=0
pure1? true=1
0
0
0
0
0
0
1
1
R11
R[A1] R[A2] R[A3] R[A4]
Base 2
pure1? false=0
pure1? false=0
Record truth of predicate:
"purely 1-bits" in a tree,
recursively on halves, until the
half is pure.
pure1?
false=0
More typically, we compress
strings of bits not single bits (eg,
64 bit strings or strides).
1. Whole thing pure1? false  0
2. Left half pure1?
false  0
P11
3. Right half pure1? false  0
4. Left half of rt half ? false0
5. Rt half of right half? true1
But it's pure0 so
this branch ends
0
0
0
01
010
011
010
010
011
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
R11 R12 R13 R21 R22 R23 R31 R32 R33
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
1
0
0
1
0
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
R41 R42 R43
0
0
0
1
1
1
1
1
0
0
0
1
0
0
0
0
P
P11
11 P12 P13
1
0
0
0
0 0
0
0
01
1
1
0
1
1
0
1
0
0
P21 P22 P23 P31 P32 P33 P41 P42 P43
0
0
0
0
0
0
0
0
0
0
0 0 1 0 1 0
0 01 0 0 0 0 1 0 1 0 0 0 0
10 10
10 01
00 00
0001
0100
^
^
^
^ ^
^
^ 10
01
01
01 10
01
01
7
4
0
1
0
*23
0 0 *22 =2
To count (7,0,1,4)s use 111000001100
0 1 *21
0
P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 *2
=
The age of
Big Data
is upon us
and so is
the age of
Infinite Storage.
Many of us have enough money in our pockets right now to buy
all the storage we will be able to fill for the next 5 years.
So having adequate storage capacity is no longer much of a
problem.
Managing our storage is a problem (especially managing BIG
DATA storage).
How much data is there?
Googolplex
10Googol
Googol
10100


Tera Bytes (TBs) are certainly here already.
 1 TB may cost << 1k$ to buy
 1 TB may cost >> 1k$ to own
 Management and curation are the expensive part
 Searching 1 TB takes a long time.
I’m Terrified by TeraBytes
I’m Petrified by PetaBytes



I’m Exafied by ExaBytes
I’m Zettafied by ZettaBytes
You could be Yottafied by YottaBytes.
(undecillion)
10 36
(decillion)
1033
You may not be Googified by GoogolBytes,
but the next generation may be?
(nontillion)
1030
(octillion)
1027
Yotta
(septillion)
1024
Zetta
(sextillion)
1021
Exa
(quintillion)
1018
Peta
(quadrillion)
1015
Tera
(trillion)
1012
Giga
(billion)
109
Mega
(million)
106
Kilo
(thousand)
103


We are here
...
(tredecillion)
1042
(duodecillion)
10 39
Yotta
How much information is there?
 Soon everything may be
recorded.
Zetta
Everything!
Recorded
Exa
 Most of it will never be
seen by humans.
 Data summarization,
Vertical Structuring,
Compression, trend
detection, anomaly
detection, data mining,
are key technologies
All Books
MultiMedia
Peta
Tera
All books (words)
.Movie
Giga
A Photo
A Book
10-24 Yocto, 10-21 zepto, 10-18 atto, 10-15 femto, 10-12 pico, 10-9 nano, 10-6 micro, 10-3 milli
Mega
Kilo
First Disk, in 1956
 IBM 305 RAMAC
 4 MB
 50 24” disks
 1200 rpm
 100
(revolutions per minute)
milli-seconds (ms) access time
 35k$/year to rent
 Included computer &
accounting software
(tubes not transistors)
1.6 meters
10 years later
30 MB
Kilo
Mega
Giga
Tera
Peta
Exa
Zetta
Yotta
Disk Evolution
Memex
As We May Think, Vannevar Bush, 1945
“A memex is a device in which an
individual stores all his books, records,
and communications, and which is
mechanized so that it may be consulted
with exceeding speed and flexibility”
“yet if the user inserted 5000 pages of
material a day it would take him
hundreds of years to fill the repository,
so that he can enter material freely”
Can you fill a terabyte in a year?
Item
Items/TB
Items/day
a 300 KB JPEG image
3M
10,000
a 1 MB Document
1M
3,000
a 1 hour, 256 kb/s MP3
audio file
10 K
26
a 1 hour video
300
0.8
On a Personal Terabyte,
how Will We Find Anything?
 Need Queries, Indexing, Vertical
Structuring?, Compression, Data Mining,
Scalability, Replication…
 If you don’t use a DBMS, you will
implement one of your own!
 Need for Data Mining, Machine Learning is
more important then ever!
Of the digital data in existence today,
 80% is personal/individual
 20% is Corporate/Governmental
DBMS
Parkinson’s Law (for data)
 Data expands to fill available storage
Disk-storage version of Moore’s Law
 Available storage doubles every 9 months!
How do we get the information we need from the massive
volumes of data we will have?



Vertical Structuring and Compression
Querying (for the information we know is there)
Data mining (for answers to questions we don't know to ask
precisely
Moore’s Law with respect to processor performance seems to be
over (processor performance doubles every x months…). Note that the
processors we find in our computers today are the same as the ones we
found a few years ago. That’s because that technology seems to have
reached a limit (minaturizing). Now the direction is to put multiple
processor on the same chip or die and to use other types of processor to
increase performance.
Thank
you.