Andrew`s slides - Computer Science at Rutgers

Download Report

Transcript Andrew`s slides - Computer Science at Rutgers

Learning Relational Probability
Trees
Jennifer Neville
David Jensen
Lisa Friedland
Michael Hay
Presented by Andrew Tjang
About the authors

David Jensen – Director of KDL, research asst.
prof @ umass
–
Research focuses on the statistical aspects and architecture
of data mining systems and the evaluation of those systems
for government and business applications
Authors, cont’d



Lisa Friedland – graduate student, KDL,
interest in DB/data mining
Michael Hay – graduate student, KDL
Jennifer Neville –
–
masters student @ Umass in Knowledge Discovery Lab
(KDL investigates how to find useful patterns in large and
complex databases.)
Learning Relational Probability
Trees (RPTs)

Classification trees used in data mining: easily
interpretable
–
–
Conventional: homogeneous data, statistically
independent
Much research gone into classification trees ->
relational data, so variances can be safely “ignored”
and relational effects can be on the front burner
Why is relational data so
complicated?


Traditional assumption of instance independence is
violated.
3 characteristics of relational data
–
–
–

Concentrated linkage
Degree disparity
Relational autocorrelation
All 3 potentially produce overly complex models with
excess structure
–
This leads to factually incorrect relations and more
space/computation requirements.
What did they do?

Conventional techniques/modeling assume
independence
–

RPT and class of algorithms to adjust for relational
characteristics. RPTs the only statistical model to
adjust for these relational characteristics
Applied these algorithms to a set of databases
(IMDb, Cora, WebKB, human genome)
–
Then evaluated accuracies of their RPT applications
Examples of relational data


Social networks, genomic data, data on
interrelated people, places, things/events from
text documents.
First part was to apply RPT to WebKB
–
–
–
(a collection of websites classified into categories –
course, faculty staff, student, research project)
Appx 4000 pages, 8000 hyperlinks
DB contains attributes associated with each
page/hyperlink (path/domain, direction of hyperlink)
Probability (sub) tree of WebKB
RBTs


Estimate probability distros over att values
These algs differ from regular – they consider
the effect of related object
–

Movie example
Use Aggregation used to “proportionalize”
relational data.
–
E.g. some movies have 10 actors, some have 1000

Aggregate as avg age. (either preprocess and or
dynamically during learning process
The Algorithm


RPT algorithm takes collection of subgraphs as
input (single target object ot be classified +
other objects/links in relat’l neighborhood
Construct a probability estimation tree (like
previous fig) to predict target class label given:
–
–
–
The atts of target object
The atts of other objects in neighborhood
The degree atts counting object and neighborhood
objects
Algorithm (cont’d)

Searches over space of binary relat’l features
to “split data”
–


MODE(actor.gender)=female
Constructs features based on atts of different
objects in subgraphs and aggregation of
values.
Feature scores calculated from class after split
using chi squared analysis
Algorithm Part trois


The p-value from the chi squared will reveal
which features are significantly correlated. The
max of these is included in tree
Does this alg recursively greedily until class
distros don’t change significantly
–
“Bonferroni adjustment” + Laplace correction to
improve probability estimates
Relational features

Traditional features
–
Genre=comedy


ID att, operator, value
Relational features
–
Movie(x),Y = {y|ActedIn(x,y)}:Max(Age(Y))>65

ID relation that links an object x to a set of other object Y.
Feature Selection Biases


Concentrated linkage and autocorrelation
combine to reduce effective sample size and
increase variance.
Degree disparity can cause sprious elements
and miss useful elements.
Randomization Test

Used on relational data sets where hypothesis testing
not as useful.
–

Computationally intensive
–
–

Used to account for biases
Pseudo samples generated by permuting variables
In this case permute att values before aggregation
This new permuted vector preserves intrinsic
links/relations but removes correlations btw atts and
class label
Tests






Use IMDb to predict $2million box office receipts on
opening weekend
1) IMDb with class labels only correlated with degree
features.
2)Use structure of IMDb as well as attributes
3)From Cora : if paper is of topic “neural networks”
4) Genome project: if gene is in nucleus
5) WebKB: is student web page
Tests (cont’d)

Each of these tasks a set of classification
techniques were applied:
–
–
–
–

RPT w/ Chi squared (CT)
RPT with randomized testing (RT)
RPT (pseudo) w/o randomized testing (C4.5)
Baseline Relat’l Bayesian Classifier (RBC)
* entries statistically significant differences
Results
Results explained


None except for random showed that their
algorithm performed any better than other
algorithms
For random they show:
–
–
–
RPTs can adjust for linkage and autocorrelation.
RTs perfrom better than other 3 models
Biased models select random attributes that hinder
performance.
Conclusions (their’s and mine)





Possible to extend conventional methods for estimation
tree algorithms
RPT w/randomization and RPT w/o randomization
perform about the same (smaller trees in the former)
Randomization adjusts for biases (but is this so
important if they perform the same?)
Non selective don’t have biases, but lose
interpretablilty.
Future work enrichments to RPT
–
Better ways of calc’ing ranking scores