intro2research - University of Utah

Download Report

Transcript intro2research - University of Utah

Dealing with MASSIVE Data
Feifei Li
[email protected]
Dept Computer Science, FSU
Sep 9, 2008
Brief Bio
• B.A.S. in computer engineering from
Nanyang Technological University in
2002
• Ph.D. in computer science from Boston
University in 2007
• Research Interns/Visitors at AT&T Labs,
IBM T. J. Watson Research Center,
Microsoft Research.
• Now: Assistant Professor in CS
Department at FSU
2
Research Areas
Database Applications
indexing
spatial
databases
query processing
Algorithms and Data structures
computational I/O-efficient
geometry algorithms misc.
Geographic
Information
Systems
Probabilistic Data
data streams
streaming
algorithms
data security and
privacy
3
Massive Data
• Massive datasets are being collected everywhere
• Storage management software is billion-$ industry
Examples (2002):
• Phone: AT&T 20TB phone call
database, wireless tracking
• Consumer: WalMart 70TB
database, buying patterns
• WEB: Web crawl of 200M
pages and 2000M links,
Google’s huge indexes
• Geography: NASA satellites
generate 1.2TB per day
4
Example: LIDAR Terrain Data
• Massive (irregular) point sets (1-10m resolution)
– Becoming relatively cheap and easy to collect
• Appalachian Mountains between 50GB and 5TB
• Exceeds memory limit and needs to be stored on disk
5
Example: Network Flow Data
• AT&T IP backbone generates 500 GB per day
• Gigascope: A data stream management system
– Compute certain statistics
• Can we do computation without storing the data?
6
Traditional Random Access Machine Model
R
A
M
• Standard theoretical model of computation:
– Infinite memory (how nice!)
– Uniform access cost
• Simple model crucial for success of computer industry
7
How to Deal with MASSIVE Data?
when there is not enough memory
Solution 1: Buy More Memory
• Expensive
• (Probably) not scalable
– Growth rate of data is higher than the growth of memory
9
Solution 2: Cheat! (by random sampling)
• Provide approximate solution for some problems
– average, frequency of an element, etc.
• What if we want the exact result?
• Many problems can’t be solved by sampling
– maximum, and all problems mentioned later
10
Solution 3: Using the Right Computation Model
• External Memory Model
• Streaming Model
• Probabilistic Model (brief)
Computation Model for Massive Data (1):
External Memory Model
Internal memory is limited but fast
External memory is unlimited but slow
Memory Hierarchy
L
1
L
2
R
A
M
• Modern machines have complicated memory hierarchy
– Levels get larger and slower further away from CPU
– Block sizes and memory sizes are different!
• There are a few attempts to model the hierarchy but not successful
– They are too complicated!
13
Slow I/O
• Disk access is 106 times slower than main memory access
track
read/write head
read/write arm
“The difference in speed
between modern CPU and
disk technologies is
analogous to the difference
in speed in sharpening a
pencil using a sharpener on
magnetic surface
one’s desk or by taking an
airplane to the other side of
the world and using a
– Disk systems try to amortize large access
time transferring
sharpener
on someonelarge
else’s
contiguous blocks of data (8-16Kbytes)
desk.” (D. Comer)
• Important to store/access data to take advantage of blocks (locality)
14
Puzzle #1: Majority Counting
b a e c a d a a d a a e a b a a f a g b
• A huge file of characters stored on disk
• Question: Is there a character that appears > 50% of the time
• Solution 1: sort + scan
– A few passes (O(logM/B N)): will come to it later
• Solution 2: divide-and-conquer
– Load a chunk in to memory: N/M chunks
– Count them, return majority
– The overall majority must be the majority in >50% chunks
– Iterate until < M
– Very few passes (O(logM N)), geometrically decreasing
• Solution 3: O(1) memory, 2 passes (answer to be posted later)
15
External Memory Model [AV88]
D
Block I/O
M
N = # of items in the problem instance
B = # of items per disk block
M = # of items that fit in main memory
I/O: Move block between memory and disk
Performance measure: # of I/Os performed by
algorithm
P
We assume (for convenience) that M >B2
16
Sorting in External Memory
•
•
•
•
Break all N elements into N/M chunks of size M each
Sort each chunk individually in memory
Merge them together
Can merge <M/B sorted lists (queues) at once
M/B blocks in main memory
17
Sorting in External Memory
• Merge sort:
– Create N/M memory sized sorted lists
– Repeatedly merge lists together Θ(M/B) at a time
N
( M
)
N M
( M
/ B)
N
( M
/( MB ) 2 )
1
 O(log M B
N
M
) phases using O( N B) I/Os each  O( NB log M
B
N
)
B
I/Os
18
External Searching: B-Tree
•
•
•
•
Each node (except root) has fan-out between B/2 and B
Size: O(N/B) blocks on disk
Search: O(logBN) I/Os following a root-to-leaf path
Insertion and deletion: O(logBN) I/Os
19
Fundamental Bounds
• Scanning:
• Sorting:
• Searching:
•
•
•
•
•
•
Internal
N
N log N
External
N
B
log 2 N
List ranking
Minimal spanning tree
Offline union-find
Interval searching
Rectangle enclosure
R-tree search
N
B
log M B
N
B
log B N
More Results
N
N log N
N
log N + T
log N + T
N T
N
B
N
B
log M B
log M B
N
B
N
B
log log B
N
B
log M B NB
logBN + T/B
log N + T/B
N
B
T
B
20
Does All the Theory Matter?
D

Thrashing!
running time
• Programs developed in RAM-model
still runs even there is not enough memory
– Run on large datasets because
M
OS moves blocks as needed
• OS utilizes paging and prefetching strategies
P
– But if program makes scattered accesses even good OS cannot
take advantage of block access
data size
21
Toy Experiment: Permuting
• Problem:
– Input: N elements out of order: 6, 7, 1, 3, 2, 5, 10, 9, 4, 8
* Each element knows its correct position
– Output: Store them on disk in the right order
• Internal memory solution:
– Just scan the original sequence and move every element in the
right place!
– O(N) time, O(N) I/Os
• External memory solution:
– Use sorting
– O(N log N) time, O( NB log M NB ) I/Os
B
22
A Practical Example on Real Data
• Computing persistence on large terrain data
23
•
•
•
•
Takeaways
Need to be very careful when your program’s space
usage exceeds physical memory size
If program mostly makes highly localized accesses
– Let the OS handle it automatically
If program makes many non-localized accesses
– Need I/O-efficient techniques
Three common techniques (recall the majority counting
puzzle):
– Convert to sort + scan
– Divide-and-conquer
– Other tricks
24
Want to know more about I/O-efficient
algorithms?
A course on I/O-efficient algorithms is offered as
CIS5930 (Advanced Topics in Data Management)
Computation Model for Massive Data (2):
Streaming Model
Cannot
Don’tYou
wantgot
to tostore
andelement
do further
processing
look data
at each
only
once!
Can’t wait to
26
Streaming Algorithms: Applications
What are the top (most frequent) 1000 (source, dest)
pairs seen over the last month?
Back-end Data Warehouse
DBMS
(Oracle, DB2)
How many distinct (source, dest) pairs have
been seen?
Off-line analysis –
slow, expensive
Set-Expression Query
Network Operations
Center (NOC)
SELECT COUNT (R1.source, R2.dest)
FROM R1, R2
WHERE R1.dest = R2.source
Peer
SQL Join Query
Enterprise
Networks
DSL/Cable
Networks
PSTN
Other applications:
• Sensor networks
• Network security
• Financial applications
• Web logs and clickstreams
27
Puzzle #2: Find Missing Card
Mahjong tile
• How to find the missing tile by making one pass over everything?
– Assuming you can’t memorize everything (of course)
• Assign a number to each type of tiles:
= 8,
= 14,
= 22
• Compute the sum of all remaining tiles
– (1+…+9+11+…+19+21+…+29)*4 – sum = missing tile!
28
A Research Problem: Count # Distinct Elements
b a e c a d a a d a a e a b a a f a g b
# distinct elements = 7
• Unfortunately, there is a lower bound saying you can’t do this
without using Ω(n) memory
• But if we allow some errors, then can approximate it well
29
Solution: FM Sketch [FM85, AMS99]
• Take a (pseudo) random hash function h : {1,…,n}  {1,…,2d},
where 2d > n
• For each incoming element x, compute h(x)
– e.g., h(5) = 10101100010000
– Count how many trailing zeros
– Remember the maximum number of trailing zeroes in any h(x)
• Let Y be the maximum number of trailing zeroes
– Can show E[2Y] = # distinct elements
* 2 elements, “on average” there is one h(x) with 1 trailing zero
* 4 elements, “on average” there is one h(x) with 2 trailing zeroes
* 8 elements, “on average” there is one h(x) with 3 trailing zeroes
*…
30
Counting Paintballs
• Imagine the following
scenario:
– A bag of n paintballs is
emptied at the top of a long
stair-case.
– At each step, each paintball
either bursts and marks the
step, or bounces to the next
step. 50/50 chance either
way.
Looking only at the pattern of
marked steps, what was n?
Counting Paintballs (cont)
• What does the distribution
1st
of paintball bursts look
like?
– The number of bursts at
each step follows a binomial
distribution.
– The expected number of
bursts drops geometrically.
– Few bursts after log2 n steps
B(n,1/2)
B(n,1/4)
2nd
B(n,1/2 Y)
Y th
B(n,1/2 Y)
Solution: FM Sketch [FM85, AMS99]
• So 2Y is an unbiased estimator for # distinct elements
• However, has a large variance
– Use O(1/ε2 ∙ log(1/δ)) copies to guarantee a good estimator that
has probability 1–δ to be within relative error ε
• Applications:
– How many distinct IP addresses used a given link to send their
traffic from the beginning of the day?
– How many new IP addresses appeared today that didn’t appear
before?
33
Finding Heavy Hitters
• Which elements appeared in the stream more than 10% of the time?
• Applications:
– Networking
* Finding IP addresses sending most traffic
– Databases
* Iceberg queries
– Data mining
* Finding “hot” items (item sets) in transaction data
• Solution
– Exact solution is difficult
– If allow approximation of ε
* Use O(1/ε) space and O(1) time per element in stream
34
Streaming in a Distributed World
Query site
Network
Operations
Center (NOC)
S1
Query
Q(S1 ∪ S2 ∪…)
S3
1
1 1
1
S2
1 0
0
1
0
1
1
0
1
1 0
S6
S4
0
1
1
1
0
S5
1
0
• Large-scale querying/monitoring: Inherently distributed!
– Streams physically distributed across remote sites
E.g., stream of UDP packets through subset of edge routers
• Challenge is “holistic” querying/monitoring
– Queries over the union of distributed streams Q(S1 ∪ S2 ∪ …)
– Streaming data is spread throughout the network
35
Streaming in a Distributed World
Query site
Network
Operations
Center (NOC)
S1
Query
Q(S1 ∪ S2 ∪…)
S3
1
1 1
1
S2
1 0
0
1
0
1
1
0
1
1 0
S6
S4
0
1
1
1
0
S5
1
0
• Need timely, accurate, and efficient query answers
• Additional complexity over centralized data streaming!
• Need space/time- and communication-efficient solutions
– Minimize network overhead
– Maximize network lifetime (e.g., sensor battery life)
– Cannot afford to “centralize” all streaming data
36
Want to know more about streaming algorithms?
A graduate-level course on streaming algorithms will
be approximately offered
in the next next next semester with an error guarantee of 5%!
Or, talk to me tomorrow!
Top-k Queries
• Extremely useful in information retrieval
– top-k sellers, popular movies, etc.
– google
tuple score
t1
t2
t3
t4
t5
65
30
100
80
87
top-2 = {t3, t5}
tuple score
t3
100
t5
87
t4
80
t1
65
t2
30
Threshold Alg
RankSQL
Top-k Queries on Uncertain Data
tuple score confidence
t3
100
0.2
t5
87
0.8
t4
80
0.9
t1
65
0.5
t2
30
0.6
top-k answer depends on
the interplay between
score and confidence
(sensor reading, reliability)
(page rank, how well match query)
Top-k Definition: U-Topk
The k tuples with the maximum probability
of being the top-k
tuple score confidence
t3
100
0.2
t5
87
0.8
t4
80
0.9
t1
65
0.5
t2
30
0.6
{t3, t5}:
0.2*0.8 = 0.16
{t3, t4}:
0.2*(1-0.8)*0.9 = 0.036
{t5, t4}:
(1-0.2)*0.8*0.9 = 0.576
...
Potential problem: top-k could be very different from top-(k+1)
Top-k Definition: U-kRanks
The i-th tuple is the one with the maximum
probability of being at rank i, i=1,...,k
Rank 1:
t3: 0.2
tuple score confidence
t5: (1-0.2)*0.8 = 0.64
t3
100
0.2
t4: (1-0.2)*(1-0.8)*0.9 = 0.144
t5
87
0.8
...
t4
80
0.9
Rank 2:
t1
65
0.5
t3: 0
t2
30
0.6
t5: 0.2*0.8 = 0.16
t4: 0.9*(0.2*(1-0.8)+(1-0.2)*0.8)
= 0.612
Potential problem: duplicated tuples in top-k
Uncertain Data Models
• An uncertain data model represents a probability distribution of
database instances (possible worlds)
• Basic model: mutual independence among all tuples
• Complete models: able to represent any distribution of possible worlds
– Atomic independent random Boolean variables
– Each tuple corresponds to a Boolean formula, appears iff the
formula evaluates to true
– Exponential complexity
Uncertain Data Model: x-relations
Each x-tuple represents a discrete probability distribution of tuples
x-tuples are mutually independent, and disjoint
single-alternative
multi-alternative
U-Top2: {t1,t2}
U-2Ranks: (t1, t3)
Want to know more about uncertainty data
management?
A graduate-level course on uncertainty data management
will be (likely probably) offered
in the next next next next next semester
Or, talk to me tomorrow!
Recap
• External memory model
– Main memory is fast but limited
– External memory slow but unlimited
– Aim to optimize I/O performance
• Streaming model
– Main memory is fast but small
– Can’t store, not willing to store, or can’t wait to store data
– Compute the desired answers in one pass
• Probabilistic data model
– Can’t store, query exponential possible instances of possible
worlds
– Compute the desired answers in the succinct representation of
the probabilistic data (efficiently!! Possibly allow some errors)
45
Thanks!
Questions?