Mining Data Streams

Download Report

Transcript Mining Data Streams

Mining of Massive Datasets
Ch4. Mining Data Streams
Outline
• What is data stream
• The stream data model
• Example of stream sources
• Stream queries : standing queries & ad-hoc queries
• Sampling data in a stream
• Obtaining a Representative Sample
• Varying the sample size
• Filtering streams
• Bloom filtering
What is data stream
• Database
• Data is available when and if we want it
• Data stream
• Data arrives in a stream
• Stream is composed of elements/tuples
• If it is not processed immediately, then it is lost forever
(0, 7, k),(1,5,n),(0,3,d)
The stream data model
• Streams need not have the same
data rates or data types
• Working storage : main memory/disk
• Problem : cannot store all the data
from all the streams
Example of stream sources
• Sensor data
• Temperature sensor in the ocean
• Give the sensor a GPS unit : report surface height
• Image data
• Satellites
• Surveillance cameras
• Internet and web traffic
• Google – hundred million search queries per day
• Yahoo! – billion of clicks per day
Stream queries
• Standing queries
• permanently executing
• Ad-hoc queries
• produce outputs at appropriate times
Standing queries
Ad-hoc queries
• A question asked once about the current state of the stream
• We do not store all streams
=>we cannot answer arbitrary queries
• Solution : store a sliding window
• Elements or time
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
Past
Future
Example
Sampling data in a stream
• We cannot store all streams in main memory
• Solution : to get a approximate answer than an exact solution
• We ask queries about the sampled data
Example
• Scenario: Search engine query stream
• Stream of tuples: (user, query, time)
• Answer questions such as: How often did a user run the same query in a
single days
• Wish to store 1/10th of query stream
• Obvious solution:
• Generate a random integer in [0...9] for each query
• Store the query if the integer is 0, otherwise discard
• This solution is wrong
• Suppose each user issues x queries once and d queries twice
(total of x+2d queries)
• Correct answer: d/(x + d)
• Proposed solution: We keep 10% of the queries
• Sample will contain x/10 of the singleton queries and
2d/10 of the duplicate queries at least once
• But only d/100 pairs of duplicates : d/100 = 1/10 ∙ 1/10 ∙ d
• Of d “duplicates” 18d/100 appear exactly once
• 18d/100 = ((1/10 ∙ 9/10)+(9/10 ∙ 1/10)) ∙ d
• So the sample-based answer is
• d/(x + d) ≠
𝒅
𝟏𝟎𝒙+𝟏𝟗𝒅
𝑑
100
𝑥
𝑑 18𝑑
+
+
10 100 100
=
𝒅
𝟏𝟎𝒙+𝟏𝟗𝒅
Obtaining a Representative Sample
• Pick 1/10th of the users and take all of their searches
1. Store a list of users
• Generate a random integer between 0 and 9
• 0 =>value : in ; others =>value : out
2. Hash function
• Hash each user name to one of ten buckets,0 through 9
The general sampling problem
• The stream consists of tuples with n components
• A subset of components are the key
• If the key consists of more than one component, the hash
function needs to combine the values to make a single hash-value
Stream of tuples: (user, query, time)
Varying the sample size
• Because the sample will grow
• Values 0,1,2,…,B-1
• Threshold t
• We sample the tuples whose key K satisfies h(K) ≦ t
• Lower t to t-1, if the samples exceeds the allotted space
t=4
Filtering streams
• We want to accept those tuples that meet a crierion.
• Accept tuples are passed to another process, while others are dropped
• Email spam filtering
• Suppose we have a set S of one billion allowed email address
• Email address is 20 bytes or more
• We have 1 GB of available main memory
Bloom filtering
• Use the main memory as a bit array B : B = [0,1,0,0,1,0,1,…,0]
• 1 GB main memory => 8 billion bits
• All the bit is 0 in the beginning : B = [0,0,0,0,…,0]
• Hash each member of S to a bit, and set that bit to 1 : B[h(s)] = 1
• Approximately 1/8th of the bits will be 1
• When a element a arrives, we hash its email address
• If B[h(a)] == 1, we let it through ; If B[h(a)] == 0, we drop this email
• Approximately 1/8th of the spam email will get through
Analysis of Bloom filtering
• Suppose we have x targets and y darts
• The probability that a given dart will not hit a given target is (x-1)/x
• The probability that none of the darts will hit a given target is ((𝑥 − 1)/𝑥) 𝑦
1 𝑥(𝑦)
𝑦
• Rewrite ((𝑥 − 1)/𝑥) = (1 − 𝑥) 𝑥
1
𝜖
• Because (1 − 𝜖) =
1
𝑒
for small 𝜖 =>(1
1 𝑥(𝑦)
− ) 𝑥
𝑥
𝑦
−
• The probability of a false positive is 1 - 𝑒
𝑥
𝑦
=𝑒
−𝑥
Example
• Consider the spam email
• 8 billion bits => x = 8 × 109 targets
• 1 billion members of S => y = 109 darts
1
−8
• The probability of a 1 is (1 - 𝑒 ) ≒ 0.1175
0.2
• Set S has m members
• The array has n bits
• There are k hash functions
• The number of targets is x = n
• The number of darts is y = km
0.18
𝑘𝑚
−𝑛
𝑒
• The probability of a 1 is (1 • Optimal” value of k: n/m ln(2)
False positive prob.
0.16
)
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
2
4
6
8
10
12
14
16
18
20
Number of hash functions, k