Mining and modelling character networks

Download Report

Transcript Mining and modelling character networks

Complex Systems Seminar
Queen Mary University of London
Mining and Modeling
Character Networks
Anthony Bonato
Ryerson University
Complex networks in the era of Big Data
• web graph, social networks, biological networks, internet
networks, …
Character networks
2
What is a complex network?
• no precise definition
• however, there is general consensus on the
following observed properties
1. large scale
2. evolving over time
3. power law degree distributions
4. small world properties
Character networks
3
Examples of complex networks
• technological/informational: web graph, router
graph, AS graph, call graph, e-mail graph
• social: on-line social networks (Facebook,
Twitter, LinkedIn,…), collaboration graphs, coactor graph
• biological networks: protein interaction
networks, gene regulatory networks, food
networks, connectomes
Character networks
4
Other properties
• densification power law (Leskovec,Kleinberg,Faloutsos,05):
– |(E(Gt)| ≈ |V(Gt)|a where 1 < a ≤ 2: densification exponent
• community structure
• spectral expansion
Character networks
5
Game of Thrones?
Character networks
6
Network of Thrones
(Beveridge,Shan,16)
• considered social network within A Storm of Swords by G.R.R.
Martin
– 3rd book in A Song of Ice and Fire series (literary origin of the HBO drama Game
of Thrones)
• social network inferred by co-occurrence
–
–
nodes: characters
edges: two characters within 15 words or less in the text
• graph extraction
– automated, with some manual adjustments
• social network analysis
– centrality measures
Character networks
7
Visualization
Character networks
8
Centrality measures
Character networks
9
Main characters revealed
Character networks
10
Quantative methods
in literary analsysis
•
(Reagan,Mitchell,Kiley,Danforth,Dodds,16+)
– emotional arcs of stories are dominated by six basic shapes (cf. Vonnegut)
•
(Ribeiro,Vosgerau,Andruchiw,Pinto,16)
– structural properties, such as assortivity and transitivity, of communities in the
social network of J.R.R. Tolkien’s The Lord of the Rings+The Hobbit+Silmarillion
•
(Agarwal,Corvalan,Jensen,Rambow,12)
– analyze the dynamic network for Alice in Wonderland, defined by the mining of
the ten chapters independently of each other
•
(Sack,12)
– structural balance theory as generative model for characters
Character networks
11
Character networks
•
cultural work:
– fictional works such novels or short stories, movies,
biographies, historical works, religious texts
•
character networks:
– nodes: characters or persona in a cultural work
– edges: co-occurrence
• edges may be weighted
Character networks
12
E.g.: Marvel universe
•
•
•
•
Character networks
10K nodes
diameter 9
10 communities
average degree 41
13
Approaches
• mining
– social network analysis
– community extraction
– top influencers
• modeling
– uncover generative mechanisms
Character networks
14
Our approach
• mining/microscopic:
– analysis of three novel data sets
– community and top influencer extraction
• modeling/macroscopic:
– model selection for 800 cultural works
– contrast and compare existing models
Character networks
15
Novels
• Twilight by Stephanie Meyer
• Harry Potter and the Goblet of Fire by
J.K. Rowlings
• The Stand by Stephen King
Character networks
16
Twilight: visualized
Character networks
17
Twilight: centrality measures
Character networks
18
Observations
• the three Twilight communities can be labeled as:
– vampires
– high school students
– characters close to Charlie
• successfully picked out Edward, Bella, and Alice as top
influencers
Character networks
19
Other visualizations: The Stand
Character networks
20
Other visualizations:
Harry Potter and the Goblet of Fire
Character networks
21
“All models are wrong, but some are more useful.”
– G.P.E. Box
Character networks
22
Complex network models
•
Preferential Attachment (PA)
– nodes born over time, initially with degree m
– new nodes have higher probability to join to high degree nodes
•
Binomial Random Graph G(n,p) / Erdős-Rényi (ER)
– edges chosen independently with probability p among distinct pairs of n nodes
•
Chung-Lu (CL) model
– generalizes the binomial random graph model to non-uniform edge probabilities
•
Configuration Model (CFG)
– select graph uniformly from the set of graphs which exactly match the target
degree distribution.
Character networks
23
Machine learning
• machine learning is a branch of AI
that infers structure from data
• examples:
– spam filters
– Netflix recommender systems
– text and image categorization
• especially useful when the data or
number of decisions are too large for
humans to process
Character networks
24
Model selection in
complex networks
•
(Middendorf,Ziv,Wiggins,05)
– used ADTs and motifs for model selection in protein networks
– predicted duplication/mutation model
•
(Memišević,Milenković,Pržulj,10)
– model selection predicting random geometric graphs as best fit
for protein networks
•
(Janssen,Hurshman,Kalyaniwalla,12)
– ADT with motif classifiers predict PA and SPA models best fit
Facebook 100 graphs
Character networks
25
Sec. 15.1
Support Vector Machine (SVM)
• SVM maximizes the margin around
the separating hyperplane
support vectors
• solving SVMs is a quadratic
programming problem
• successful text and image
classification method
Character networks
maximizes
margin
26
moviegalaxies.com data
moviegalaxies.com,
catalogues the
social networks in
800+ movies
Character networks
27
Motifs/profiles/graphlets
Character networks
28
Results
Character networks
29
Conclusions
• CL best fitting model for 800+ samples
– tested with four ML algorithms
– tested also via comparisons with eigenvalue
spectrum
• presence of small cliques is one separator
Character networks
30
Why CL?
• character networks we considered were small
– less likely to exhibit power law degree distributions or dimensionality found in
OSNs like Facebook
– PA or spatial models may be less relevant for character networks
• author may intuit a hierarchy of character influence (via degrees),
then randomly generate the social ties
– eg Rowlings may have decided the main triad was Harry, Hermione and Ron,
and then gradually added lesser characters revolving around this triad
• CL model has 4-node subgraph counts that more accurately model
character networks
Character networks
31
Future directions
• larger data character network sets, and larger
set of samples
• networks over time
• new models
• predictive tools
Character networks
32
PageRank over time for Dune
character network
Character networks
33
Dominating sets
• dominating set S:
– each node outside S is joined to a node in S
• NP-hard (Kann,92)
•
(B,Lozier,Mitsche,Perez-Gimenez,Prałat,15)
– domination in FB100 graphs
– 10% of nodes dominate (provably true in MGEO-P model)
• dominating sets in character networks appear to reveal top
influencers
Character networks
34
MGEO-P model
(Bonato,Gleich,Mitsche,Prałat,Tian,Young,14)
• place n points u.a.r. in the hypercube
• assign ranks from via a random permutation σ
• for each pair i > j, ij is an edge if j is in the ball
of volume
σ(i)–αn-β
Character networks
35
Properties of the MGEO-P model
(BGMPTY,14)
• a.a.s. the MGEO-P model generates graphs with the
following properties:
– power law degree distribution with exponent
b = 1+1/α
– average degree d = (1+o(1))n(1-α-β)/21-α
• densification
– diameter D = nΘ(1/m)
Character networks
36
Dimension of OSNs
• given the order of the network n and
diameter D, we can calculate m
• gives formula for dimension of OSN:
log n
m
log D
Character networks
37
Logarithmic Dimension
Hypothesis
In an OSN of order n and diameter D, the
dimension of its Blau space is
log n
log D
• posed independently by (Leskovec,Kim,11),
(Frieze, Tsourakakis,11)
Character networks
38
Dimensionality in
FB and LinkedIn
Character networks
39
Dimensionality
of character networks?
Character networks
40