A Practical Data Sketching Solution for Mining Intersection of Streams

Download Report

Transcript A Practical Data Sketching Solution for Mining Intersection of Streams

Crossroads: A Practical Data
Sketching Solution for Mining
Intersection of Streams
Jun Xu, Zhenglin Yu (Georgia Tech)
Jia Wang, Zihui Ge, He Yan (AT&T Labs Research)
Ashwin Lall (Denison University)
Overview
•Problem statement
•Motivation
•Our approach
•Data collection architecture
•Evaluation
•Conclusions
Problem Statement
• Anonymously mine the logs of cellular data traffic to
rapidly detect network performance anomalies
• Can be converted to an association-rule mining problem if
we store every cellular packet into a static database
Motivation
• Complexity of the network ecosystem
(sophisticated phones, tablets; application server; a tremendous variety of apps and
online services)
• The performance issues can be introduced by different
causes or combinations
(device, app, network, app server)
• Infeasible to store all combinations of such data
(especially when it is collected in real-time)
Problem Statement (detailed)
An example event
“Device = Magic Phone 7” &
“OS = Magic OS 88.8” &
“Application = FunContent.app” &
“Source = Metropolis downtown location” &
“Destination = FunContent.com” &
“Time = τ”
⇒ “unusually long RTT”.
Main challenge
Cannot afford to store all the combinations
(since the number of different attribute combinations is huge)
Our Goal
Asymptotic reduction in space usage while keeping
accuracy loss small when detecting anomalous values
Contribution of Paper
An intersection scheme that can significantly reduce the
storage cost while keep accuracy loss small.
Our approach
● Based on data sketching solution (inspired by the tugof-war sketch)
● A sketch is constructed to succinctly summarize the
performance metrics (e.g., average RTT) of all data
items
Our approach
•Partition attributes to 2 groups
•Example:
group attributes to 2 groups: Ai , Bj
Our approach
•Ai :the set of packets that match
(“Source = the Metropolis downtown location” & “Destination =
FunContent.com” & “Time = τ”)
•Bj:the set of packets that match
“Device = Magic Phone 7” & “OS = Magic OS 88.8” &
“Application = FunContent.app”)
Our approach
•We can compute functions on the intersection of arbitrary
sets Ai and Bj
Our approach
● Use sketches to store summary statistic (e.g., mean,
variance) for Ai and Bj
● Derive the performance metrics of the data by
intersecting the sketches ( Ai ∩ Bj )
Storage saving
• Reduce the storage cost from O(n) to O(√n)
• For example, number is in trillions (∼1012) for joint value combinations of all
these attributes
• Each subset may only be in millions (∼106)
3-Way Intersection
● How about 3-way intersection?
● An impossibility result:
Data Collection Architecture
Real data collection*
* Note: No personally
identifiable information
(PII) was gathered or
used in conducting this
study. To the extent any
data was analyzed, it was
anonymous and/or
aggregated data.
Evaluation
8 different attributes in our real data*. Partition scheme:
● 1st group:
RNC, service category, handheld device speed category, day of week
● 2nd group:
handheld device manufacturer/model, content provider, access point
network, hour
* Note: No personally identifiable information (PII) was gathered or used in conducting this study. To
the extent any data was analyzed, it was anonymous and/or aggregated data.
Evaluation
• 1.4 million distinct combinations for the 1st group
• 1.5 million distinct combinations for the 2st group
• Storage cost of maintaining the value of every combination:
1.4M × 1.5M × 4 bytes= 7.5 TB
Evaluation
• Using our intersection scheme, and using 4096 counters
in each sketch, the space cost is
(1.4M + 1.5M ) × 4096 × 4 bytes = 45 GB
• Much less than 7.5 TB
• Relative error will be about 10 %
Evaluation
● As buckets number(memory usage) increases, average relative error
will decrease.
● As intersection ratio increase, average relative error will decrease.
Evaluation
Results of mean relative errors when varying memory(buckets number) respectively for situations that
intersection ratio = 0.01, 0.02, 0.05, 0.10
Conclusions
• We provide an intersection scheme for estimating
arbitrary summary statistics on large data sets
• We show how to reduce storage cost from O(n) to O(√n)
• We demonstrate efficacy using both synthetic and real
data