Document 676237

Download Report

Transcript Document 676237

Statistical Inference for Large
Directed Graphs with
Communities of Interest
Deepak Agarwal
Outline
• Communities of Interest : overview
•
•
•
•
Why a probabilistic model?
Bayesian Stochastic Blockmodels
Example
Ongoing work
Communites of interest
• Goal: understand calling behavior of every TN on ATT
LD network: massive graph
• Corinna, Daryl and Chris invented COI’s to scale
computation using Hancock (Anne Rogers and
Kathleen Fisher)
• Definition: COI of TN X is a subgraph centered around X
– Top k called by X + other
– Top k calling X + other
COI
signature
Other
inbound
X
Other
outbound
• Entire graph union of COI’s
• Extend a COI by recursively growing the spider
– Captures calling behavior more accurately
• Definition for this work:
– Grow the spider till depth 3. Only retain depth 3 edges
that are between depth 2 nodes.
Extended
COI
x
other
me
other
X
Enhancing a COI !!
• Missed calls:
– Local calls where TN’s not ATT local
– Outbound OCC calls
– Calls to/from the bin “other”
• Big outbound and inbound TNs
– Dominate the COI, lot of clutter.
– Need to down weight their calls.
• Other issues
Want to quantify things like tendency to call, tendency of
being called, tendency of returning calls for every TN.
Our approach so far
• COI -> social network
• Want a statistical model that estimates
missing edges, add desired ones and
remove (or down weight) undesired ones.
me
COI from top
probability edges of a
statistical model.
The model adds new
edges. (brown
arrows)
Removes undesired
ones.
Getting a sense of data
Some descriptive statistics
based on a random sample
of 500 residential COI’s.
density = 100*ne/(g(g-1))
ne = number of edges
g = number of nodes
Under random
Average conditional
on out -degrees
Under random:
Conditional on outdegrees
Under random:
Conditional on indegrees
Distribution of “Other"
Representing the Data
• Collection of all edges with activity
• Matrix with no diagonal entries
• Collection of several 2x2 contingency
tables
COI: gxg matrix without diagonal entries
COI: collection of 2x2 tables.
• Data matrix a collection of g(g-1)/2 2x2 tables
(called dyads).
present
present
Column
total
absent
aij
mij
i->j
absent
j-> i
aji
nij
pji
1-pji
Row total
pij
1-pij
1
More probabilities than edges.
Need to express them in terms of fewer
parameters which could be learned from
data.
likelihood  C exp( M  w     i wi     j w j 
i
j
s  si wi   r  rj w j    zij wij )
i
j
i, j
Optimizer goes crazy due to presence of so many zero degrees
Do regularization, known as “shrinkage estimation” in statistics.
Incur bias for small degree nodes but get reduction in variance.
All Greek letters to be estimated from data
Computation: 2 minutes for a typical COI on fry
Likelihood, gradient and Hessian computed using C, optimizer
in R.
Meaning of parameters
• Node i:
– αi: expansiveness (tendency to call)
– βi: attractiveness (tendency of being called)
• Global parameters:
– θ: density of COI (reduces with increasing
sparseness)
– ρ: reciprocity of COI (tendency to return calls)
– λs: “caller” specific effect
– λr: “cal lee” specific effect
– γ: “call” specific effect
Differential reciprocity
• Different reciprocity for each node:
– Add another parameter ηi to node i
– Replace ρM by ρM + Σ iηi Mi in the likelihood
– Called “differential reciprocity” model
– Computationally challenging, have
implemented it.
Missing edges?
• Can estimate all parameters as long as we
have some observed edges in data matrix
– for each row (to estimate expansiveness)
– for each column (to estimate attractiveness)
• Missing local calls -> o.k.
• OCC -> problem, entire row missing.
– Impute data using reasonable assumptions m
times (typically m=3 o.k.) and combine
results. Working on it.
Incorporating edge weights
• Edge weights binned into k bins using a random sample
of 500 COI’s. Weights in ith bin assigned a score i.
tij unknown,
w’s weights on
dyad (i,j). tij
Row total
wij
tij
imputed using
k - wij
Hyper geometric
Column total
wji
k - wji
k
Example
• COI with 117 nodes, 172 edges.
• 14 missing edges, local calls from14 non ATT local
customers to seed node (local list provided by Gus).
• One edge attribute: number of common “buddies”
between TN i and TN j
• Tried Bizocity, “Localness to seed” for caller and cal lee
effects, eventually settled with one caller effect viz
localness to seed, no cal lee effect.
Parameter estimates.
• θ = -6.28; ρ=2.76 (higher side)
• λs=.29 (TN’s local to seed have a higher
tendency to call)
• γ=.41 (common acquaintances between two
TN’s increase their tendency to call each other)
Pruning the big (red) nodes
• Down weight expansiveness/attractiveness based on
proportion of volume going to “other”, higher value get
down weighted more by adding “offset”
– Renormalize the new probability matrix to have the same mass as
the original one.
• Offset function used:
f (other)  0 if other  a
  p log( 1  tan(. 5a)  tan(. 5other) ) if other  a
Matrix obtained by taking
union of top 50 data edges,
top 50 edges from original model,
top 50 edges from pruned model.
Where to from here?
• Estimate missing OCC calls :multiple imputation.
• Scale the algorithm to get parameter estimates
for every TN, maybe on a weekly basis, enrich
customer signature.
• Can compute Hellinger distance between two
COIs in closed form. Could be useful in
supervised learning tasks like tracking Repetitive
debtors.