Language_networksx
Download
Report
Transcript Language_networksx
SI/EECS 767
Yang Liu
January 29, 2010
LANGUAGE NETWORKS
INTRODUCTION
Human language described as a complex
network
(Sole et al, 05)
INCENTIVES
Analyzing statistical properties
Building models to explain the patterns
Studying the origins and evolution of human
language
Statistical approaches to natural language
processing
CATEGORIZATION
Words as vertices
Sentences as vertices
Co-occurrence networks
(Dorogovtsev & Mendes, 2001;
Masucci & Rodgers, 2008)
Semantic networks
(Steyvers, Tenenbaum, 2005)
Syntactic networks
(Cancho et al., 2004)
(Erkan & Radev, 2004)
Documents as vertices
(Menzer, 2004)
CO-OCCURENCE AND SYNTACTIC NETWORKS
SEMANTIC NETWORKS
Language as an evolving word
web
(Dorogovtsev & Mendes, 2001)
INTRODUCTION
Propose a theory of how language evolves
Treat human language as a complex network of
distinct words
Words are connected with nearest neighbors
(co-occurrence networks)
Papers of Ferrer & Sole (2001, 2002)
degree distribution consists of two power-law
parts with different exponent
THE MODEL
Preferential attachment
Provide
power-law degree distribution
Average degree does not change
The total number of connections increases
more rapidly than the number of vertices and
the average degree grows
THE MODEL
At each time step,
a
new vertex (word) is added;
t is the total number of vertices, plays the role of
time;
connect it with some old one i with the probability
proportional to its degree ki;
ct new edges emerge between old words
(c is a
constant coefficient)
These new edges emerge between vertices i and j
with the p ~ ki kj
DATA
Two word webs by Ferrer and Sole (2001,
2002)
Obatain ¾ of a million words of the British
National Corpus
470 000 vertices
Average degree = 72
SOLVING THE MODEL
Continuum approximation
k(s,t) : the average degree of the vertices born at time
s and observed at time t
Ct ≈ 70 >>1
SOLVING THE MODEL
The degree distribution has two regions
separated by the crossover point
SOLVING THE MODEL
Below this point,
stationary degree
distribution
Above this point,
Non-stationanry degree
distribution
Empty and filled circles show the degree
distributions for two word webs by Ferrer
and Sole (2001, 2002)
DISCUSSION
Interested only in degree distribution
Clustering coefficients not match
The total number of words of degree greater
than kcross does not change
The size of kernel lexicon does not depend on
the total number of distinct words in language
Network properties of written
human language
(Masucci & Rodgers, 2008)
TOPOLOGY OF THE NETWORK
The words (include punctuations) are vertices
and two vertices are linked if they are
neighbors.
Directed network
NETWORK STATISTICS
8992 vertices, 117687 edges, mean degree
<k> = 13.1
P(k) ∝k -1.9
Zipf’s law slope -1.2
GROWTH PROPERTIES
The number of edges between words grows
faster than the number of vertices.
N(t) ∝ t 1.8
NEAREST NEIGHBOR’S PROPERTIES
The mean cluster coefficient <c> = 0.19
REPEATED BINARY STRUCTURES OF WORDS
Reproduce by local PA
THE MODELS (D-M MODEL)
Starts with a chain of 20 connected vertices
At each time add a new vertex and connect it to
some vertex i with p ∝ ki
m(t) -1 new edges emerge between old words
with p ∝ ki kj
D-M MODEL
<c(k)> = 0.16
Catches the average clustering and the global growth behavior
Misses the internal structure
MODEL 2
Include local PA
P(t) ≈ 0.1t0.16
Start with a chain of 20 connected vertices
At each time add a new vertex and connect it to
some vertex i (not nearest neighbors) with p ∝ ki
m(t) -1 times, with probability p(t) link the last
vertex to an old vertex i (in its nearest
neighborhood) through local PA (p ∝ ki); with 1 –
p(t), link an old vertex i (not part of its nearest
neighborhood) with global PA
MODEL 2
<c> = 0.08
Catches the global and nearest neighbor
behavior but not the average cluster coeffient
MODEL 3
Different words in written human language
display different statistical distributions,
according to their functions
MODEL 3
Start with a chain of 20 connected vertices
At each time add a new vertex and connect it to some
vertex i (not nearest neighbors) with p ∝ ki
m(t) -1 times, with probability q= 0.05, link the last
linked vertex to one of the three fixed vertices;
with probability p(t) link the last vertex to an old vertex i
(in its nearest neighborhood) through local PA (p ∝ ki);
with 1 – p(t) – 3q, link an old vertex i (not part of its
nearest neighborhood) with global PA.
MODEL 3
<c> = 0.20
CONCLUSIONS
New growth mechanisms:
1.local PA
2.the allocation of a set of preselected vertices
The large-scale structure of semantic
networks: Statistical analyses and a
model of semantic growth
(Steyvers & Tenebaum, 2005)
INTRODUCTION
There are general principles governing the
structure of network representations for natural
language semantics
The small-world structure arise from a scalefree organization
MODEL
Concepts enter the network early are expected to
show higher connectivity
One aspect of semantic development – growth of
semantic networks by differentiations of existing
nodes
The model grows through a process of
differentiation analogous to mechanisms of
mechanic development which allows it to produce
both small-world and scale-free structure.
ANALYSIS OF SEMANTIC NETWORKS
Free association norms
WordNet
Roget’s thesaurus
METHODS
Associative networks
Created
two networks: directed, undirected
ROGET’S THESAURUS
Bipartite graph
Word
nodes and semantic category nodes
A connection is made between a word and category
node when the word falls into the semantic
category
Convert to a simple graph for calculating cc
( one-mode projection)
WORDNET
120,000+ word forms
99,000+ word meanings
Links between forms and forms, meanings and
meaning, forms and meanings
Treat as an undirected graph
RESULTS
ZIPF’S “LAW OF MEANING”
GROWING NETWORK MODEL
Previous models
BA
model: low cc
WS model: no scale-free structure
MODEL A: UNDIRECTED
At each time step, a new node with M links is
added to the network by randomly choosing
some existing node i for differentiation,
and then connecting the new node to M
randomly chosen nodes in the semantic
neighborhood of node i.
TWO PROBABILITY DISTRIBUTION
Set n equal to the size of the target network
Set M equal to ½ <k>
MODEL B: DIRECTED
Assume the direction of each arc is chosen
randomly and independently of the other arcs
Point toward old node with probability α, point
toward new node with probability 1-α
RESULTS
Only test on association networks with Model A
and B
set α = 0.95
Average of 50 simulations
Patterns in syntactic
dependency networks
(Ferrer et al., 2004)
INTRODUCTION
Co-occurrence networks fail in capturing the
characteristic long-distance correlations of
words in sentences
The proportion of incorrect syntactic
dependency links is high
Require a precise definition of syntactic link
THE SYNTACTIC DEPENDENCY NETWORK
Defined according to the dependency grammar
formalism
Vertices are words, links go from the modifier to
its head
CORPORA
1. Czech corpus
Proportion of links is about 0.65 (missing links between
function words)
Performed by hand
2. Romanian corpus
Performed by hand
3. German corpus
Proportion of links is about 0.16 (obey no regularity)
Performed automatically
NETWORK PROPERTIES
Small world structure
Heterogeneity
C(k) ~ k -θ
Betweenness centrality
Power-law degree distribution
Hierarchical organization
Small average path length D and high cluster coefficient C
P(g) ~ g –η
Assortativeness
RESULTS
RESULTS
RESULTS
RESULTS
RESUTS
GLOBAL VS SENTENCE-LEVEL PATTERNS
DISCUSSIONS
(1) Disassortative mixing tells us that labor is divided
in human language. Linking words tend to avoid
connections among them.
(2) Hierarchical organization tells us that syntactic
dependency networks not only define the
syntactically correct links but also a top down
hierarchical organization that is the basis of
phrase structure formalisms.
(3) Small worldness is a necessary condition for
recursion.
Lexrank: graph-based lexical centrality as
salience in text summarization
(Erkan & Radev, 2004)
INTRODUCTION
Graph-based methods in NLP
Random walks on sentence-based graphs help
in Text Summarization (TS)
Extractive summarization VS abstractive
summarization
Assess the centrality of each sentence in a
cluster and extract the most important ones to
include the summary
Centrality measures: Degree, LexRank with
threshold, and continuous Lexrank
Vertices represent sentences and edges are
defined in terms of the similarity relation
between pairs of sentences
Toolkit MEAD
Test data DUC 2003, 2004
CENTROID-BASED SUMMARIZATION
Centroid of the document cluster in a vector
space
The centroid is a pseudo-document which
consists of words that have tf*idf scores above
a predefined threshold
The sentences that contain more words from
the centroid are considered as central
CENTRALITY-BASED SENTENCE SALIENCE
Hypothesis: the sentences that are similar to
many of the other sentences are more
central/salient to the topic
Cosine similarity between two sentences:
DEGREE CENTRALITY
Significantly similar sentences are connected
to each other
Choice of cosine threshold
EIGENVECTOR CENTRALITY AND LEXRANK
PageRank
Sum of neighbor’s divided prestige
d: dumping factor; set to 0.85
CONTINUOUS LEXRANK
Improve by using the strength of the similarity
links
CENTRALITY VS CENTROID
1. centrality accounts for information
subsumption among sentences
2. it prevents unnaturally high idf scores from
boosting up the score of a sentence that is
unrelated to the topic
EXPERIMENT
Data set: DUC 2003 and 2004
Evaluation method: ROGUE
MEAD toolkit
The
feature extraction
Centroid,
position and length
Relative weight
Combiner
reranker
RESULTS AND DISCUSSION
Effects of threshold
COMPARISON OF CENTRALITY MEASURES
EXPERIMENT ON NOISY DATA
Evolution of document
networks
(Menczer, 2004)
BACKGROUD
Content similarity
Link probability approximated by link similarity
metric (Jaccard coefficient)
JOINT DISTRIBUTION MAPS
DEPENDENCY OF THE WEB’S LINK TOPOLOGY
ON CONTENT
Conditional probability that the link
neighborhood between two web pages is above some
threshold λ, given that the two pages have some content
similarity κ, as a function of κ :
Phase transition around κ*
For κ>κ*, the probability that two pages are neighbors
does not seem to depend on their content similarity; for
κ<κ *, the probability decreases according to a powerlaw Pr(λ |κ) ~ κγ
MODEL
At each step t
one new page t is added, and m new links are
created from t to m existing pages, each
selected from {i, i<t} with probability:
(m, κ*, and γ are constants and c is a nomorlization factor)
VALIDATING PRIOR MODELS
Look for a model capable of predicting both the degree
distribution and the similarity distributions among linked
documents
DEGREE-SIMILARITY MIXTURE MODEL
At each step, one new document is added, and m new
links or references are created from it to existing
documents.
At time t the probability that the i th document is
selected and linked from the tth document is:
α is a preferential attachment parameter
VALIDATE THE MODEL
CONCLUSION
Page content cannot be neglected when we try
to understand the evolution of document
networks.
The tension between referring to popular
versus related documents
Questions?