Concept and Theme Discovery through Probabilistic Models

Download Report

Transcript Concept and Theme Discovery through Probabilistic Models

Concept and Theme Discovery
through Probabilistic Models
and Clustering
Qiaozhu Mei
Oct. 12, 2005
Concepts and Themes

Language units in biology literature mining:





Terms
Phrases
Entities
Concepts (tight groups of terms/entities
representing semantics: e.g. Gene Synonyms)
Themes (loose groups of terms representing
topic/subtopics)
Theme Discovery

What we’ve got now:

A Generative Model to extract k themes from a collection
k
P( w : d )  b * P( w | B)  (1  b) *   d ,i P( w | i )
i 1



Each theme as a language model, represented by top
probability words in a theme language model
KL Divergence to model the distance/similarity between
themes;
retrieve most similar themes to a term group
Theme Discovery (cont.)

What we’ve got now (cont.):



Use HMM to segment the whole collection with the theme
extracted
Use MMR to find most representative and least redundant
phrases to represent a theme (currently using n-gram prob.
as and edit distance as similarity, performance to be tuned..)
Results: http://ucair.cs.uiuc.edu/qmei2/ThemeNavigation.html
Some justifications

Fly collection:










Cluster 0: circadian
Cluster 1: adh, evolution
Cluster 2: a mixture of two topics, apoptosis and promoters
Cluster 6: brain development
Cluster 8: cell division
Cluster 12: drosophila immunity
Cluster 13: nervous systems
Cluster 14: hedgehog segment Polarity gene
Cluster 16: Histone, Polycomb
Cluster 17: visual system
Theme Discovery (cont.)

Problems:


How to select k? (how many themes do we believe are
there in the collection: bee collection should have smaller k
than fly collection)
Can we find themes in a hierarchical manner?


How to represent a theme?




This can solve the former problem…however, when to cutoff?
Top words sometimes difficult to tell the semantics
Phrases?
Sentences?
Other possible approaches to extract theme? (LDAs,
Clustering methods)
Hierarchical Theme Discovery

A straightforward approach (top
down splitting):






Discover k themes from the initial
collection
Segment the collection by the k
themes
For each theme, build a subcollection with the segments in
previous step
For each sub-collection, extract k’
themes
Do these processes iteratively
Problem: When to stop splitting
iteration?
Collection
Theme1
Theme2
Theme2.1
Theme2.2
……
Theme3
Theme2.3
Hierarchical Theme Discovery (results)
A bee collection with 929 documents
Level1: 5
themes
…
…
Level2: 3 sub-themes for each higher level theme
…
Hierarchical Theme Discovery (results)
african
jelly
royal
european
venom
population
africanized
sting
kda
feral
m
reward
subspecies
proteins
patients
discrimination
naja
cue
characters
areas
queen
workers
worker
signal
jh
vibration
pheromone
gland
eggs
signals
hormone
juvenile
anarchistic
queens
egg
iridaceae
policing
ixia
behavioral
age
pollinator
plants
pollination
flowers
plantae
spermatophyta
angiospermae
dicotyledones
pollen
seed
fruit
angiosperms
spermatophytes
vascular
dicots
crop
plant
flower
pollinators
species
learning
brain
conditioning
olfactory
neural
neurons
mushroom
memory
sucrose
nervous
coordination
dopamine
extension
antennal
odor
system
proboscis
bodies
lobe
kenyon
varroa
mite
mites
jacobsoni
acarina
brood
parasite
colonies
host
control
chelicerata
chelicerates
hygienic
viruses
infestation
destructor
pest
infested
parasitology
mortality
Hierarchical Theme Discovery (results)
african
jelly
royal
european
venom
population
africanized
sting
kda
feral
m
reward
subspecies
proteins
patients
discrimination
naja
cue
characters
areas
queen
pollinatorvenom
workers
plants reward
patients
worker
pollination
signal
flowers naja
larvae
jh microorganisms
plantae kda
proteins
vibration
spermatophyta
gram
wasp
bacteria
pheromone
angiospermae
protein
0
gland
dicotyledones
diptera
colonies
eggs
pollen pla2
royal
vespula
signals
seed
queen
primates
hormone
fruit
jelly
hominidae
juvenile
angiosperms
eubacteria
chordata
anarchistic
spermatophytes
non
vertebrata
workers
queens
vascularmug
egg queens
dicots sting
production
iridaceae
crop
sperm
2
dose
policing
plant
nest
ixia italian
flower quality
behavioral
pollinators
5
age fraction
species
nestmates
learning
brain
conditioning
olfactory
neural
neurons
mushroom
memory
sucrose
nervous
coordination
dopamine
extension
antennal
odor
system
proboscis
bodies
lobe
kenyon
varroa
mite
african
mites
european
jacobsoni
population
acarina
populations
patterns
brood
patternparasite
genetic
colonies
discrimination
host
mitochondrial
control
studies
chelicerata
information
are chelicerates
contrast
hygienic
green viruses
two infestation
bees
destructor
have
pest
derived
africa infested
parasitology
subspecies
mortality
Hierarchical Theme Discovery (results)
african
jelly
royal
european
mammals
vertebrates
venom
venom
population
nonhuman
africanized
l
sting
ml
kda
models
feral
model
m
chordates
beeswax
reward
mug
subspecies
omega
proteins
embryo
patients
mammalia
discrimination
vertebrata
naja
has
cue
chordata
nurse
characters
coloured
areas
vg
queen
workers
worker
signal
jh
vibration
pheromone
gland
eggs
signals
hormone
juvenile
anarchistic
queens
egg
iridaceae
policing
ixia
behavioral
age
pollinatorfood
learning
plants foragers
brain
dance
pollination
conditioning
flowers transfer
olfactory
plantae enzyme
neural
biosynthesis
spermatophyta
neurons
receivers
angiospermae
mushroom
contrast
dicotyledones
memory
nectar
pollen flight
sucrose
source
seed
nervous
flow
fruit
coordination
water
angiosperms
dopamine
information
spermatophytes
extension
rates
vascularddt
antennal
dicots rj
odor
crop
caucasian system
visual
plant
proboscis
flower green
bodies
pollinators
lobe
species
kenyon
varroa
mite
queen
workermites
jacobsoni
workers
acarina
colonies
pollen brood
vibration
parasite
eggs colonies
foraging
host
development
brood control
signal chelicerata
queenschelicerates
bees hygienic
anarchistic
viruses
behavioral
infestation
iridaceae
destructor
larvae
egg pest
infested
pheromone
may parasitology
mortality
Hierarchical Theme Discovery (results)
african
jelly
royal
european
pollen
eep
venom
honeybees
population
mating
africanized
bumblebees
sting
sp
kda
hive
feral
bacteria
m
scent
mimosa
reward
brazil
subspecies
undertakers
proteins
chromatography
patients
marks
discrimination
recently
naja
gram
cue
eubacteria
caraway
characters
microorganisms
areas
propolis
queen
workers
worker
signal
jh
vibration
pheromone
gland
eggs
signals
hormone
juvenile
anarchistic
queens
egg
iridaceae
policing
ixia
behavioral
age
pollinator
plants
pollination
flowers
plantae
spermatophyta
angiospermae
dicotyledones
pollen
seed
fruit
angiosperms
spermatophytes
vascular
dicots
crop
plant
flower
pollinators
species
learning
ecology
brainis
conditioning
species
olfactory
environmental
neural
sciences
flowering
neurons
floral
mushroom
terrestrial
memory
pollinator
sucrose
visiting
nervous
reproduction
coordination
plants
dopamine
c
cashew
extension
self
antennal
odoranimalia
food
system
insects
proboscis
faba
bodies
size
lobe
kenyon
varroa
mite
seed
mites
per
jacobsoni
crop
acarina
sunflower
number
brood
cruciferae
parasite
fruit
colonies
hybrid
host
agriculture
control
seeds
chelicerata
quality
chelicerates
cultivar
weight
hygienic
helianthus
viruses
oilseed
infestation
compositae
destructor
annuus
pest
yield
infested
pollination
parasitology
set
mortality
Hierarchical Theme Discovery (results)
african
jelly
royal
european
imidacloprid
current
venom
memory
population
mushroom
africanized
neurons
sting
1
kda
expressed
4feral
m
cells
antennal
reward
mb
subspecies
bodies
proteins
currents
patients
nervous
discrimination
brain
naja
mv
cue
kinase
receptors
characters
term
areas
protein
queen
pollinator
workersbees
plants
sucrose
worker
pollination
conditioning
signal
flowers
response
jh
plantae
learning
vibration
extension spermatophyta
pheromone
proboscis angiospermae
gland pollen
dicotyledones
eggs foragers pollen
signalsperformanceseed
between
hormone
fruit
thresholds
juvenilehoneybees angiosperms
anarchistic
spermatophytes
solution
queensdiscrimination
vascular
egg strain
dicots
rate
iridaceae
crop
foraging
policing
plant
concentration
ixia
flower
low
behavioral
pollinators
age
species
learning
brain
conditioning
olfactory
neural
neurons
mushroom
memory
sucrose
nervous
coordination
dopamine
extension
antennal
odor
system
proboscis
bodies
lobe
kenyon
varroa
mite
dopamine
mites
levels
jacobsoni
development
acarina
age
binding
brood
pupal
parasite
brain
colonies
octopamine
host
division
control
adult
chelicerata
colonies
chelicerates
labor
glass
hygienic
treated
viruses
colony
infestation
ryr
destructor
pigmentation
pest
chromosomes
infested
arolium
parasitology
da
mortality
Hierarchical Theme Discovery (results)
african
jelly
royal
european
pollen
bees
venom
foragers
population
their
africanized
or
sting
ta
kda
heat
atferal
m
hygienic
foraging
reward
protein
subspecies
activity
proteins
behaviour
patients
increased
discrimination
response
naja
blood
cue
flight
strips
characters
metabolic
areas
removal
queen
pollinator
learning
workersmite
plants
brain
viruses
worker varroa
pollination
conditioning
larvae
microorganis
signal mites
flowers
olfactory
brood
ms neural
jh
plantae
vibrationjacobsoni spermatophyta virusneurons
acarina
bacteria
pheromone
angiospermae
mushroom
colonies
animal
gland parasite dicotyledones paenibacillus
memory
eggs for
pollen
sucrose
infection
signals worker
seed
nervous
molecular
control
pathogen
hormone
fruit
coordination
juvenilea
angiosperms eubacteria
dopamine
drone
gram
anarchistic
spermatophytes
extension
formic
forming
queens populationvascular
antennal
endospore
egg
dicots
odor
acid
positives
iridaceae
crop
system
host
p
policing0
plant
apv proboscis
cells
entomopathog
ixia
flower
bodies
treatment pollinators
en lobe
behavioral
age
species
kenyon
varroa
mite
mites
jacobsoni
acarina
brood
parasite
colonies
host
control
chelicerata
chelicerates
hygienic
viruses
infestation
destructor
pest
infested
parasitology
mortality
Phrase Representations:
african
jelly
royal
european
venom
population
africanized
sting
kda
feral
m
reward
subspecies
proteins
patients
discrimination
naja
cue
characters
areas
queen
workers
worker
signal
jh
vibration
pheromone
gland
eggs
signals
hormone
juvenile
anarchistic
queens
egg
iridaceae
policing
ixia
behavioral
age
biochemistry and molecular biophysics
pollinator
learning
varroa
endocrine system chemical coordination and homeostasis
plants genetics biochemistry
brain and molecular
mite
molecular
biophysics
pollination
conditioning
mites
sense organs sensory reception
flowers
olfactory
jacobsoni
animals
arthropods chordates
insects invertebrates
mammals
system
chemical coordination
plantae
neuraland homeostasis
acarina
vertebrata
chordata animalia
spermatophyta
neurons
brood
honey
bee
angiospermae
mushroom
parasite
behavior terrestrial ecology
dicotyledones
memory
colonies
mammalia vertebrata chordata animalia
pollenhormone
sucrose
host
juvenile
seed
nervous
control
queen
fruit mammalia vertebrata
coordination
chelicerata
rodentia
chordata animalia
worker
laid eggs
angiosperms
dopamine
chelicerates
vibration
signal
spermatophytes
extension
hygienic
genetics
biochemistry and
molecular biophysics
vascular
antennal
viruses
dufour s gland
dicots
odor
infestation
mammals nonhuman mammals
crop
system
destructor
workers
proboscis
pest
eggplant
laying
flower
infested
queen
mandibular gland bodies
pheromone
nonhuman
vertebrates lobe
pollinators
parasitology
iridaceae
ixia
species
kenyon
mortality
arthropoda invertebrata animalia muridae
aves vertebrata chordata animalia
mug ml
Hierarchical Theme Discovery (cont.)

A bottom up agglomerative approach:



Find many micro-themes
Group similar micro-themes into larger ones
Borrow strategy from data mining:



BIRCH: incrementally form many micro-clusters,
organized in a tree structure
Macro-clustering based on micro-clusters.
Problem: Again, when to stop?
Hierarchical Theme Discovery (cont.)

Model-based approach:



Hofmann, IJCAI 99.
Assume we know the collection is generated from
a hierarchical structure, use a generative model to
learn the themes. (e.g. make use of GO
hierarchies)
Problem: in most cases we don’t know the
hierarchies.
Other Research Problems

Represent a theme:





Using top words: where to cut
Using phrases: have to tune the MMR (many
possible strategies and parameter tuning)
Using sentence? Like summarization
Themes are interesting… but how to make
use of the themes?
How to evaluate themes??
Concept Extraction

What we have now:




N-gram algorithm (actually 2-gram): iteratively group a pair
of terms which are most likely to be replaceable
considering the context of one term before/after it.
Time Complexity: O(N3), Space Complexity: now O(N2).
Beespace server can deal with <= 9000 terms now (2.4g
memory). (performance not evaluated due to the small data
size acceptable).
Problem: based on Mutual Information, preferring 2-grams
with low frequency. Doesn’t make use of farther context.
Will removing stop words help or turn down the
performance?
Some finding:


A small dataset: (200+ abstracts containing gene
synonyms)
Only 600 iterations (merge 600 times)

Most of them are reasonable, but not really useful

E.g. head-to-head tail-to-tail
E.g. within-locus between-locus




FBgn0000017: Dsrc Dabl
FBgn0000078: amylase-null AMY-null
Problem: doc-set too small, n-gram too sparse to
find useful concepts.
Concept Extraction (cont.)

Other Possible strategy:



Lin et al, KDD 02: Use feature vector to represent
terms, the weights are the mutual information
between term and context feature. Thus more
flexible than n-gram. (if only consider 2-gram as
context features, this will be similar to what we
have)
Use committee to represent a cluster, thus
assures the clusters are tight and robust.
Problem: not sure how to select features
Summary

Theme Extraction:





Generally performs well, if we can find a good k.
Hierarchical Clustering can solve this problem, but still
need to find a reasonable stop criteria.
Representation is an interesting problem: MMR phrase
extraction should be further tuned
Difficult to evaluate other than expert justification
Concept extraction:


N-gram has space constraints: haven’t really tested the
performance… Generally, the performance should be
better on large data sets
Other clustering algorithms can be explored.