The Topics Model for Semantic Representation

Download Report

Transcript The Topics Model for Semantic Representation

Latent Semantic Analysis
Probabilistic Topic Models
&
Associative Memory
The Psychological Problem

How do we learn semantic structure?
 Covariation between words and the contexts they
appear in (e.g. LSA)

How do we represent semantic structure?
 Semantic Spaces (e.g. LSA)
 Probabilistic Topics
Latent Semantic Analysis
(Landauer & Dumais, 1997)
high dimensional space
SVD
word-document
counts
STREAM
RIVER
BANK
MONEY


Each word is a single point in semantic space
Similarity measured by cosine of angle between word
vectors
Critical Assumptions of Semantic Spaces
(e.g. LSA)

Psychological distance should obey three axioms

Minimality

Symmetry

Triangle inequality
d (a, b)  d (a, a)  d (b, b)  0
d (a, b)  d (b, a)
d (a, b)  d (b, c)  d (a, c)
For conceptual relations, violations of
distance axioms often found

Similarities can often be asymmetric
“North-Korea” is more similar to “China” than vice versa
“Pomegranate” is more similar to “Apple” than vice versa

Violations of triangle inequality:
AC
AB
BC
Euclidian distance:
AC  AB + BC
Triangle Inequality in Semantic Spaces might
not always hold
THEATER
w1
w2
PLAY
Euclidian distance:
AC  AB + BC
Cosine similarity:
cos(w1,w3) ≥ cos(w1,w2)cos(w2,w3) – sin(w1,w2)sin(w2,w3)
w3
SOCCER
Nearest neighbor problem
(Tversky & Hutchinson (1986)
• In similarity data, “Fruit”
is nearest neighbor in 18
out of 20 fruit words
• In 2D solution, “Fruit” can
be nearest neighbor of at
most 5 items
• High-dimensional
solutions might solve this
but these are less
appealing
Probabilistic Topic Models

A probabilistic version of LSA: no spatial constraints.

Originated in domain of statistics & machine learning
 (e.g., Hoffman, 2001; Blei, Ng, Jordan, 2003)

Extracts topics from large collections of text

Topics are interpretable unlike the arbitrary dimensions of
LSA
Model is Generative
Find parameters that
“reconstruct” data
DATA
Corpus of text:
Word counts for each document
Topic Model
Probabilistic Topic Models

Each document is a probability distribution over topics
(distribution over topics = gist)

Each topic is a probability distribution over words
Document generation as
a probabilistic process
1.
for each document, choose
a mixture of topics
2.
For every word slot,
sample a topic [1..T]
from the mixture
3.
sample a word from the
topic
TOPICS MIXTURE
TOPIC
...
TOPIC
WORD
...
WORD
Example
DOCUMENT 1: money1 bank1 bank1 loan1 river2 stream2 bank1
money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1
money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1
money1 bank1 loan1 bank1 money1 stream2
.8
.2
TOPIC 1
.3
.7
DOCUMENT 2: river2 stream2 bank2 stream2 bank2 money1 loan1
river2 stream2 loan1 bank2 river2 bank2 bank1 stream2 river2 loan1
bank2 stream2 bank2 money1 loan1 river2 stream2 bank2 stream2
bank2 money1 river2 stream2 loan1 bank2 river2 bank2 money1
bank1 stream2 river2 bank2 stream2 bank2 money1
TOPIC 2
Mixture
components
Mixture
weights
Bayesian approach: use priors
Mixture weights
~ Dirichlet( a )
Mixture components ~ Dirichlet( b )
Inverting (“fitting”) the model
?
TOPIC 1
DOCUMENT 1: money? bank? bank? loan? river? stream? bank?
money? river? bank? money? bank? loan? money? stream? bank?
money? bank? bank? loan? river? stream? bank? money? river? bank?
money? bank? loan? bank? money? stream?
?
DOCUMENT 2: river? stream? bank? stream? bank? money? loan?
river? stream? loan? bank? river? bank? bank? stream? river? loan?
bank? stream? bank? money? loan? river? stream? bank? stream?
bank? money? river? stream? loan? bank? river? bank? money? bank?
stream? river? bank? stream? bank? money?
?
TOPIC 2
Mixture
components
Mixture
weights
Application to corpus data

TASA corpus: text from first grade to college
 representative sample of text

26,000+ word types (stop words removed)
37,000+ documents
6,000,000+ word tokens


Example: topics from an educational
corpus (TASA)
• 37K docs, 26K words
• 1700 topics, e.g.:
PRINTING
PAPER
PRINT
PRINTED
TYPE
PROCESS
INK
PRESS
IMAGE
PRINTER
PRINTS
PRINTERS
COPY
COPIES
FORM
OFFSET
GRAPHIC
SURFACE
PRODUCED
CHARACTERS
PLAY
PLAYS
STAGE
AUDIENCE
THEATER
ACTORS
DRAMA
SHAKESPEARE
ACTOR
THEATRE
PLAYWRIGHT
PERFORMANCE
DRAMATIC
COSTUMES
COMEDY
TRAGEDY
CHARACTERS
SCENES
OPERA
PERFORMED
TEAM
GAME
BASKETBALL
PLAYERS
PLAYER
PLAY
PLAYING
SOCCER
PLAYED
BALL
TEAMS
BASKET
FOOTBALL
SCORE
COURT
GAMES
TRY
COACH
GYM
SHOT
JUDGE
TRIAL
COURT
CASE
JURY
ACCUSED
GUILTY
DEFENDANT
JUSTICE
EVIDENCE
WITNESSES
CRIME
LAWYER
WITNESS
ATTORNEY
HEARING
INNOCENT
DEFENSE
CHARGE
CRIMINAL
HYPOTHESIS
EXPERIMENT
SCIENTIFIC
OBSERVATIONS
SCIENTISTS
EXPERIMENTS
SCIENTIST
EXPERIMENTAL
TEST
METHOD
HYPOTHESES
TESTED
EVIDENCE
BASED
OBSERVATION
SCIENCE
FACTS
DATA
RESULTS
EXPLANATION
STUDY
TEST
STUDYING
HOMEWORK
NEED
CLASS
MATH
TRY
TEACHER
WRITE
PLAN
ARITHMETIC
ASSIGNMENT
PLACE
STUDIED
CAREFULLY
DECIDE
IMPORTANT
NOTEBOOK
REVIEW
Polysemy
PRINTING
PAPER
PRINT
PRINTED
TYPE
PROCESS
INK
PRESS
IMAGE
PRINTER
PRINTS
PRINTERS
COPY
COPIES
FORM
OFFSET
GRAPHIC
SURFACE
PRODUCED
CHARACTERS
PLAY
PLAYS
STAGE
AUDIENCE
THEATER
ACTORS
DRAMA
SHAKESPEARE
ACTOR
THEATRE
PLAYWRIGHT
PERFORMANCE
DRAMATIC
COSTUMES
COMEDY
TRAGEDY
CHARACTERS
SCENES
OPERA
PERFORMED
TEAM
GAME
BASKETBALL
PLAYERS
PLAYER
PLAY
PLAYING
SOCCER
PLAYED
BALL
TEAMS
BASKET
FOOTBALL
SCORE
COURT
GAMES
TRY
COACH
GYM
SHOT
JUDGE
TRIAL
COURT
CASE
JURY
ACCUSED
GUILTY
DEFENDANT
JUSTICE
EVIDENCE
WITNESSES
CRIME
LAWYER
WITNESS
ATTORNEY
HEARING
INNOCENT
DEFENSE
CHARGE
CRIMINAL
HYPOTHESIS
EXPERIMENT
SCIENTIFIC
OBSERVATIONS
SCIENTISTS
EXPERIMENTS
SCIENTIST
EXPERIMENTAL
TEST
METHOD
HYPOTHESES
TESTED
EVIDENCE
BASED
OBSERVATION
SCIENCE
FACTS
DATA
RESULTS
EXPLANATION
STUDY
TEST
STUDYING
HOMEWORK
NEED
CLASS
MATH
TRY
TEACHER
WRITE
PLAN
ARITHMETIC
ASSIGNMENT
PLACE
STUDIED
CAREFULLY
DECIDE
IMPORTANT
NOTEBOOK
REVIEW
Three documents with the word “play”
(numbers & colors  topic assignments)
A Play082 is written082 to be performed082 on a stage082 before a live093
audience082 or before motion270 picture004 or television004 cameras004 ( for
later054 viewing004 by large202 audiences082). A Play082 is written082
because playwrights082 have something ...
He was listening077 to music077 coming009 from a passing043 riverboat. The
music077 had already captured006 his heart157 as well as his ear119. It was
jazz077. Bix beiderbecke had already had music077 lessons077. He
wanted268 to play077 the cornet. And he wanted268 to play077 jazz077...
Jim296 plays166 the game166. Jim296 likes081 the game166 for one. The
game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180
and jim296 read254 the game166 book254. The boys020 see a game166 for
two. The two boys020 play166 the game166....
No Problem of Triangle Inequality
TOPIC 1
TOPIC 2
SOCCER
FIELD
MAGNETIC
Topic structure easily explains violations of triangle inequality
Applications
Enron email data
500,000 emails
5000 authors
1999-2002
Enron topics
TEXANS
WIN
FOOTBALL
FANTASY
SPORTSLINE
PLAY
TEAM
GAME
SPORTS
GAMES
GOD
LIFE
MAN
PEOPLE
CHRIST
FAITH
LORD
JESUS
SPIRITUAL
VISIT
ENVIRONMENTAL
AIR
MTBE
EMISSIONS
CLEAN
EPA
PENDING
SAFETY
WATER
GASOLINE
FERC
MARKET
ISO
COMMISSION
ORDER
FILING
COMMENTS
PRICE
CALIFORNIA
FILED
POWER
CALIFORNIA
ELECTRICITY
UTILITIES
PRICES
MARKET
PRICE
UTILITY
CUSTOMERS
ELECTRIC
STATE
PLAN
CALIFORNIA
DAVIS
RATE
BANKRUPTCY
SOCAL
POWER
BONDS
MOU
PERSON1
PERSON2
2000
May 22, 2000
Start of California
energy crisis
2001
2002
TIMELINE
2003
Applying Model to
Psychological Data
Network of Word Associations
BAT
BALL
BASEBALL
GAME
PLAY
STAGE
THEATER
(Association norms by Doug Nelson et al. 1998)
Explaining structure with topics
BAT
BASEBALL
topic 1
BALL
GAME
PLAY
topic 2
STAGE
THEATER
Modeling Word Association

Word association modeled as prediction

Given that a single word is observed, what future other
words might occur?

Under a single topic assumption:
Pwn1 | w    Pwn1 | z Pz | w 
z
Response
Cue
Observed associates for the cue
“play”
HUMANS
TOPICS (T=500)
LSA (
Word
P( word )
FUN
.141
BALL
.134
GAME
.074
WORK
.067
GROUND
.060
MATE
.027
CHILD
.020
ENJOY
.020
WIN
.020
ACTOR
.013
FIGHT
.013
HORSE
.013
KID
.013
MUSIC
.013
Word
P( word )
BALL
.041
GAME
.039
CHILDREN
.019
ROLE
.014
GAMES
.014
MUSIC
.009
BASEBALL
.009
HIT
.008
FUN
.008
TEAM
.008
IMPORTANT .006
BAT
.006
RUN
.006
STAGE
.005
Wo
KICKB
VOLLE
GAM
COSTU
DRA
RO
PLAYW
FU
ACT
REHEA
GAM
ACT
CHEC
MOLI
Model predictions
HUMANS
TOPICS (T=500)
LSA (5
Word
P( word )
FUN
.141
BALL
.134
GAME
.074
WORK
.067
GROUND
.060
MATE
.027
CHILD
.020
ENJOY
.020
WIN
.020
ACTOR
.013
FIGHT
.013
HORSE
.013
KID
.013
MUSIC
.013
Word
P( word )
BALL
.041
GAME
.039
CHILDREN
.019
ROLE
.014
GAMES
.014
MUSIC
.009
BASEBALL
.009
HIT
.008
FUN
.008
TEAM
.008
IMPORTANT .006
BAT
.006
RUN
.006
STAGE
.005
Wor
KICKB
VOLLEY
GAME
COSTU
DRAM
ROL
PLAYWR
FUN
RANK 9 ACTO
REHEAR
GAM
ACTO
CHECK
MOLIE
Median rank of first associate
40
Best LSA cosine
Best LSA inner product
1700 topics
1500 topics
1300 topics
1100 topics
900 topics
700 topics
500 topics
300 topics
35
30
25
Median Rank
20
15
10
5
0
1
Recall: example study List
STUDY:
Bed, Rest, Awake, Tired, Dream, Wake, Snooze,
Blanket, Doze, Slumber, Snore, Nap, Peace, Yawn,
Drowsy
FALSE RECALL: “Sleep” 61%
Recall as a reconstructive
process

Reconstruct study list based on the stored “gist”

The gist can be represented by a distribution over topics

Under a single topic assumption:
Pwn1 | w    Pwn1 | z Pz | w 
z
Retrieved word
Study list
Predictions for the “Sleep” list
0
STUDY
LIST
EXTRA
LIST
(top 8)
0.02
0.04
0.06
0.08
BED
REST
TIRED
AWAKE
WAKE
NAP
DREAM
YAWN
DROWSY
BLANKET
SNORE
SLUMBER
PEACE
DOZE
0.1
0.12
0.14
0.16
0.18
Pwn1 | w 
SLEEP
NIGHT
ASLEEP
MORNING
HOURS
SLEEPY
EYES
AWAKENED
0.2