Data Mining Algorithms for Large
Download
Report
Transcript Data Mining Algorithms for Large
Data Mining Algorithms for
Large-Scale Distributed
Systems
Presenter: Ran Wolff
Joint work with Assaf Schuster
2003
What is Data Mining?
Data mining problems all deal with
the automatic analysis of large
database
The outcome of a data mining
algorithm is a model which
uncovers the nature of the data
Main Data Mining Problems
Association rules
Classification
Clustering
Source IP begins with
132.68 packets per
connection > 1000
Source IP begins with 132.68
and TTL < 5 will be dropped
There are three types of packets
coming from 132.68: Simple,
Heavy load, and Malicious.
In data mining the answer precedes the question
Why Data Mine an LSD
System?
Data mining is good, when properly used
data mining yields money
It is otherwise difficult to monitor an LSD
system: lots of data, spread across the
system, impossible to collect
Many interesting phenomena are
inherently distributed (e.g., DDoS), it is
not enough to just monitor a few nodes
Our Work
We developed an association rule mining
algorithm that works well in LSD Systems
Local and therefore scalable
Asynchronous and therefore fast
Dynamic and therefore incremental and robust
Accurate – you get what you expect
Anytime – you get early results fast
In a Tea Spoon
A distributed data mining algorithm can be
described as a series of distributed decisions
Those decisions are reduced to a majority
vote
We developed a majority voting protocol
which has all those good qualities
The outcome is an LSD association rule
mining (still to come: classification)
Main Results
By the time the database is scanned once, in parallel, the average
node has discovered 95% of the rules, and has less than 10% false
rules.