Language_networksx

Download Report

Transcript Language_networksx

SI/EECS 767
Yang Liu
January 29, 2010
LANGUAGE NETWORKS
INTRODUCTION

Human language described as a complex
network
(Sole et al, 05)
INCENTIVES
Analyzing statistical properties
 Building models to explain the patterns
 Studying the origins and evolution of human
language
 Statistical approaches to natural language
processing

CATEGORIZATION

Words as vertices







Sentences as vertices


Co-occurrence networks
(Dorogovtsev & Mendes, 2001;
Masucci & Rodgers, 2008)
Semantic networks
(Steyvers, Tenenbaum, 2005)
Syntactic networks
(Cancho et al., 2004)
(Erkan & Radev, 2004)
Documents as vertices

(Menzer, 2004)
CO-OCCURENCE AND SYNTACTIC NETWORKS
SEMANTIC NETWORKS
Language as an evolving word
web
(Dorogovtsev & Mendes, 2001)
INTRODUCTION
Propose a theory of how language evolves
 Treat human language as a complex network of
distinct words
 Words are connected with nearest neighbors
(co-occurrence networks)
 Papers of Ferrer & Sole (2001, 2002)
degree distribution consists of two power-law
parts with different exponent

THE MODEL

Preferential attachment
 Provide
power-law degree distribution
 Average degree does not change

The total number of connections increases
more rapidly than the number of vertices and
the average degree grows
THE MODEL

At each time step,
a
new vertex (word) is added;
 t is the total number of vertices, plays the role of
time;
 connect it with some old one i with the probability
proportional to its degree ki;
 ct new edges emerge between old words
(c is a
constant coefficient)
 These new edges emerge between vertices i and j
with the p ~ ki kj
DATA
Two word webs by Ferrer and Sole (2001,
2002)
 Obatain ¾ of a million words of the British
National Corpus
 470 000 vertices
 Average degree = 72

SOLVING THE MODEL

Continuum approximation
k(s,t) : the average degree of the vertices born at time
s and observed at time t

Ct ≈ 70 >>1


SOLVING THE MODEL



The degree distribution has two regions
separated by the crossover point
SOLVING THE MODEL
Below this point,
stationary degree
distribution
 Above this point,
Non-stationanry degree
distribution


Empty and filled circles show the degree
distributions for two word webs by Ferrer
and Sole (2001, 2002)
DISCUSSION
Interested only in degree distribution
 Clustering coefficients not match
 The total number of words of degree greater
than kcross does not change
 The size of kernel lexicon does not depend on
the total number of distinct words in language

Network properties of written
human language
(Masucci & Rodgers, 2008)
TOPOLOGY OF THE NETWORK
The words (include punctuations) are vertices
and two vertices are linked if they are
neighbors.
 Directed network

NETWORK STATISTICS
8992 vertices, 117687 edges, mean degree
<k> = 13.1
 P(k) ∝k -1.9
 Zipf’s law slope -1.2

GROWTH PROPERTIES
The number of edges between words grows
faster than the number of vertices.
 N(t) ∝ t 1.8

NEAREST NEIGHBOR’S PROPERTIES

The mean cluster coefficient <c> = 0.19
REPEATED BINARY STRUCTURES OF WORDS
Reproduce by local PA
THE MODELS (D-M MODEL)
Starts with a chain of 20 connected vertices
 At each time add a new vertex and connect it to
some vertex i with p ∝ ki
 m(t) -1 new edges emerge between old words
with p ∝ ki kj


D-M MODEL



<c(k)> = 0.16
Catches the average clustering and the global growth behavior
Misses the internal structure
MODEL 2
Include local PA
 P(t) ≈ 0.1t0.16
 Start with a chain of 20 connected vertices
 At each time add a new vertex and connect it to
some vertex i (not nearest neighbors) with p ∝ ki
 m(t) -1 times, with probability p(t) link the last
vertex to an old vertex i (in its nearest
neighborhood) through local PA (p ∝ ki); with 1 –
p(t), link an old vertex i (not part of its nearest
neighborhood) with global PA

MODEL 2
<c> = 0.08
 Catches the global and nearest neighbor
behavior but not the average cluster coeffient

MODEL 3

Different words in written human language
display different statistical distributions,
according to their functions
MODEL 3



Start with a chain of 20 connected vertices
At each time add a new vertex and connect it to some
vertex i (not nearest neighbors) with p ∝ ki
m(t) -1 times, with probability q= 0.05, link the last
linked vertex to one of the three fixed vertices;
with probability p(t) link the last vertex to an old vertex i
(in its nearest neighborhood) through local PA (p ∝ ki);
with 1 – p(t) – 3q, link an old vertex i (not part of its
nearest neighborhood) with global PA.
MODEL 3

<c> = 0.20
CONCLUSIONS
New growth mechanisms:
 1.local PA
 2.the allocation of a set of preselected vertices

The large-scale structure of semantic
networks: Statistical analyses and a
model of semantic growth
(Steyvers & Tenebaum, 2005)
INTRODUCTION
There are general principles governing the
structure of network representations for natural
language semantics
 The small-world structure arise from a scalefree organization

MODEL
Concepts enter the network early are expected to
show higher connectivity
 One aspect of semantic development – growth of
semantic networks by differentiations of existing
nodes
 The model grows through a process of
differentiation analogous to mechanisms of
mechanic development which allows it to produce
both small-world and scale-free structure.

ANALYSIS OF SEMANTIC NETWORKS
Free association norms
 WordNet
 Roget’s thesaurus

METHODS

Associative networks
 Created
two networks: directed, undirected
ROGET’S THESAURUS

Bipartite graph
 Word
nodes and semantic category nodes
 A connection is made between a word and category
node when the word falls into the semantic
category
 Convert to a simple graph for calculating cc
( one-mode projection)
WORDNET
120,000+ word forms
 99,000+ word meanings
 Links between forms and forms, meanings and
meaning, forms and meanings
 Treat as an undirected graph

RESULTS
ZIPF’S “LAW OF MEANING”
GROWING NETWORK MODEL

Previous models
 BA
model: low cc
 WS model: no scale-free structure
MODEL A: UNDIRECTED
At each time step, a new node with M links is
added to the network by randomly choosing
some existing node i for differentiation,
 and then connecting the new node to M
randomly chosen nodes in the semantic
neighborhood of node i.

TWO PROBABILITY DISTRIBUTION
Set n equal to the size of the target network
 Set M equal to ½ <k>

MODEL B: DIRECTED
Assume the direction of each arc is chosen
randomly and independently of the other arcs
 Point toward old node with probability α, point
toward new node with probability 1-α

RESULTS
Only test on association networks with Model A
and B
 set α = 0.95
 Average of 50 simulations

Patterns in syntactic
dependency networks
(Ferrer et al., 2004)
INTRODUCTION
Co-occurrence networks fail in capturing the
characteristic long-distance correlations of
words in sentences
 The proportion of incorrect syntactic
dependency links is high
 Require a precise definition of syntactic link

THE SYNTACTIC DEPENDENCY NETWORK
Defined according to the dependency grammar
formalism
 Vertices are words, links go from the modifier to
its head

CORPORA

1. Czech corpus
Proportion of links is about 0.65 (missing links between
function words)
 Performed by hand


2. Romanian corpus


Performed by hand
3. German corpus
Proportion of links is about 0.16 (obey no regularity)
 Performed automatically

NETWORK PROPERTIES

Small world structure


Heterogeneity


C(k) ~ k -θ
Betweenness centrality


Power-law degree distribution
Hierarchical organization


Small average path length D and high cluster coefficient C
P(g) ~ g –η
Assortativeness
RESULTS
RESULTS
RESULTS
RESULTS
RESUTS
GLOBAL VS SENTENCE-LEVEL PATTERNS
DISCUSSIONS
(1) Disassortative mixing tells us that labor is divided
in human language. Linking words tend to avoid
connections among them.
(2) Hierarchical organization tells us that syntactic
dependency networks not only define the
syntactically correct links but also a top down
hierarchical organization that is the basis of
phrase structure formalisms.
(3) Small worldness is a necessary condition for
recursion.
Lexrank: graph-based lexical centrality as
salience in text summarization
(Erkan & Radev, 2004)
INTRODUCTION
Graph-based methods in NLP
 Random walks on sentence-based graphs help
in Text Summarization (TS)
 Extractive summarization VS abstractive
summarization
 Assess the centrality of each sentence in a
cluster and extract the most important ones to
include the summary

Centrality measures: Degree, LexRank with
threshold, and continuous Lexrank
 Vertices represent sentences and edges are
defined in terms of the similarity relation
between pairs of sentences
 Toolkit MEAD
 Test data DUC 2003, 2004

CENTROID-BASED SUMMARIZATION
Centroid of the document cluster in a vector
space
 The centroid is a pseudo-document which
consists of words that have tf*idf scores above
a predefined threshold
 The sentences that contain more words from
the centroid are considered as central

CENTRALITY-BASED SENTENCE SALIENCE
Hypothesis: the sentences that are similar to
many of the other sentences are more
central/salient to the topic
 Cosine similarity between two sentences:

DEGREE CENTRALITY
Significantly similar sentences are connected
to each other
 Choice of cosine threshold

EIGENVECTOR CENTRALITY AND LEXRANK

PageRank
Sum of neighbor’s divided prestige
 d: dumping factor; set to 0.85

CONTINUOUS LEXRANK

Improve by using the strength of the similarity
links
CENTRALITY VS CENTROID
1. centrality accounts for information
subsumption among sentences
 2. it prevents unnaturally high idf scores from
boosting up the score of a sentence that is
unrelated to the topic

EXPERIMENT
Data set: DUC 2003 and 2004
 Evaluation method: ROGUE
 MEAD toolkit

 The
feature extraction
 Centroid,
position and length
 Relative weight
 Combiner
 reranker
RESULTS AND DISCUSSION

Effects of threshold
COMPARISON OF CENTRALITY MEASURES
EXPERIMENT ON NOISY DATA
Evolution of document
networks
(Menczer, 2004)
BACKGROUD

Content similarity

Link probability approximated by link similarity
metric (Jaccard coefficient)
JOINT DISTRIBUTION MAPS
DEPENDENCY OF THE WEB’S LINK TOPOLOGY
ON CONTENT

Conditional probability that the link
neighborhood between two web pages is above some
threshold λ, given that the two pages have some content
similarity κ, as a function of κ :

Phase transition around κ*
For κ>κ*, the probability that two pages are neighbors
does not seem to depend on their content similarity; for
κ<κ *, the probability decreases according to a powerlaw Pr(λ |κ) ~ κγ

MODEL
At each step t
 one new page t is added, and m new links are
created from t to m existing pages, each
selected from {i, i<t} with probability:


(m, κ*, and γ are constants and c is a nomorlization factor)
VALIDATING PRIOR MODELS

Look for a model capable of predicting both the degree
distribution and the similarity distributions among linked
documents
DEGREE-SIMILARITY MIXTURE MODEL



At each step, one new document is added, and m new
links or references are created from it to existing
documents.
At time t the probability that the i th document is
selected and linked from the tth document is:
α is a preferential attachment parameter
VALIDATE THE MODEL
CONCLUSION
Page content cannot be neglected when we try
to understand the evolution of document
networks.
 The tension between referring to popular
versus related documents


Questions?