Decision Tree Induction in Hierarchic Distributed Systems

Download Report

Transcript Decision Tree Induction in Hierarchic Distributed Systems

Decision Tree Induction in
Hierarchic Distributed Systems
With:
Amir Bar-Or, Ran Wolff, Daniel Keren
Motivation
Large distributed computation is costly
• Especially data intensive and synchronization
intensive ones; e.g., data mining
• Decision tree induction:
– Collect global statistics (thousands) for every attribute
(thousands) in every tree node (hundreds)
– Global statistics – global synchronization
Motivation
Hierarchy Helps
• Simplifies synchronization
– Synchronize on each level
• Simplifies communication
• An “industrial strength” architecture
– The way real systems (including grids) are
often organized
Motivation
Mining highly dimensional data
• Thousands of sources
• Central control
• Examples:
– Genomically enriched healthcare data
– Text repositories
Objectives of the Algorithm
• Exact results
– Common approaches would either
• Collect a sample of the data
• Build independent models at each site and then
use centralized meta-learning atop of them
• Communication efficiency
– Naive approach: collect exact statistics for
each tree node would result in GBytes of
communication
Decision tree in a Teaspoon
• A tree were at each level the learning
samples are splitted according to one
attribute’s value
• Hill-climbing heuristic is used to induce the
tree
– The attribute that maximized a gain function is
taken
– Gain functions: Gini or Information Gain
• No real need to compute the gain
Main Idea
• Infer deterministic bounds on the gain of
each attribute
• Improve bounds until best attribute is
provenly better than the rest
• Communication efficiency is achieved
because bounds require just limited data
– Partial statistics for promising attributes
– Rough bound on irrelevant attributes
Hierarchical Algorithm
• At each level of the hierarchy
– Wait for reports from all descendants
• Contain upper and lower bounds on the gain of
each attribute, number of samples from each class
– Use descendant's report to compute
cumulative bounds
– If no clear separation, request descendants to
tighten bounds by sending more data
– At worst, all data is gathered
Deterministic Bounds
• Upper bound
GP   ni  GPi  /  ni 
• Lower bound
G P1 
G P  
1
2
1+ n2 / n1  1+ n2 / minn1 , n1 


Performance Figures
• 99% reduction in
communication
bandwidth
• Out of 1000 SNP, only
~12 were reported to
higher levels of the
hierarchy
• Percent declines with
hierarchy level
Performance Figures
• 99% reduction in
communication
bandwidth
• Out of 1000 SNP, only
~12 were reported to
higher levels of the
hierarchy
• Percent declines with
hierarchy level
More Performance Figures
• Larger datasets
require lower
bandwidth
• Outlier noise is not a
big issue
– White noise even
better
More Performance Figures
• Larger datasets
require lower
bandwidth
• Outlier noise is not a
big issue
– White noise even
better
Future Work
•
•
•
•
•
Text mining
Incremental algorithm
Accommodation of failure
Testing on a real grid system
Is this a general framework?