Co-occurrence of - Universitetet i Oslo

Download Report

Transcript Co-occurrence of - Universitetet i Oslo

The Wortschatz Project
Language-independent Methods for
Enriching Corpora
Chris Biemann
NLP Department, University of Leipzig
[email protected]
Universitetet i Oslo, 11/10/2005
Project Subjectives
• Effective processing large amounts of plain text
data
• Availability for virtually any language
- ASCII and UTF-8 Coding
• Applying language-independent methods to
enrich the data
• Providing corpora of standard sizes for as many
languages as possible via our Website and Web
services
• Project Leader since 1993: Uwe Quasthoff
http://www.wortschatz.uni-leipzig.de
2
http://corpora.informatik.uni-leipzig.de
http://corpora.informatik.uni-leipzig.de
3
Sources of Texts
• free online-newspapers
• not-copyright-protected literature
• website content
Ensuring Monolinguality:
• Sentence-based Language Detector
- Rejection rate ~5%
- False prediction rate ~0.0x %
Avoiding Copyright problems:
• Sentence is biggest unit in database
• Documents cannot be restored
4
Standard Processing Steps
• Preprocessing
• Language Verification
• Indexing words and multiterms
• Co-occurrences
Biemann, Chr.; Bordag, S.; Heyer, G.; Quasthoff, U.; Wolff, Chr.(2004): Language-independent
Methods for Compiling Monolingual Lexical Data, Proceedings of CicLING 2004, Seoul, Korea and
Springer LNCS 2945, pp. 215-228, Springer Verlag Berlin Heidelberg
5
The Wortschatz Engine
Text Input
Storage Layer
Statistical & Linguistic Analysis
Applications
Domain specific
Linguistic Database
Text
DLDB
Classification
HTML
ASCII
XML
...
Text
Database
Statistical
Analysis
Linguistic
Analysis
Indexing
GLDB
General Linguistic Database
Visualisation
6
Automatic Dictionary Entries
7
... augmented with Manual Resources
8
Semantic neighbourhoods
9
Poisson Distribution for
Measuring Associations
We can calculate the probability of multiple joint occurrences of
independent events as follows:
Given two independent events to be observed in an experiment with
probabilities pa and pb, resp., the probability of their joint
occurrence is papb .
Next we repeat the experiment n times, and we are interested in k
joint occurrences. Using λ= n pa pb we get the probability
1 k 
pk     e .
k!
For at least k joint occurrences we get the probability


l k
1 l 
 e .
l!
To measure the surprise for the joint occurrence for non-independent events
we just calculate the probability as if they were independent. Next we are
surprised to see such a rare event.
10
Significance Measure for Co-occurrences
The cooccurrence measure of the two words A and B is defined as the
negative logarithm of the above probability divided by log n. For λ=ab/n we
get
k 1

1 i 


 log 1  e


i! 
i 0

.
sig(A, B) 
log n

Approximations: If (k+1)/λ>10 (this is typically true) and, moreover
k>10 we get:
sig(A, B) 
  k  log   log k!
log n
and
sig(A, B) 
k  (log k  log   1)
.
log n
Quasthoff, U.; Wolff, Chr.: „The Poisson Collocation Measure and its Applications”.In: Proc. Second
International Workshop on Computational Approaches to Collocations, Wien, Juli 2002
11
Norwegian Politics
significant left neighbours of
Stoltenberg:
Jens (1091), Thorvald (137),
Regjeringen (64), statsminister (52),
Fredrikke (26), regjeringen (24),
regjeringa (24), Holmboe (24), Robert
(22), Mathias (20), Nini (17), Vincent
(15), ] (7), sier (6)
significant right neighbours of
Stoltenberg:
har (25), og (15), , (13), snakker (12),
mener (11), forsvarer (11),
understreket (10), skriver (10), sa (9),
skrev (8), g (8), vil (7), støtter (7), . (7),
må (6), forteller (6), holder (5)
significant left neighbours of Bondevik:
Magne (1553), statsminister (167),
regjeringen (165), Odd (137), Statsminister
(128), Regjeringen (69), sier (37), Margreta
(37), Kjell-Magne (31), sa (19), at (18),
Biskop (18), Regjeringa (17), regjeringa (15),
Stein (12), Jarle (9), Dagsavisen (9),
Marianne (8), Hilde (8), ( (8), ] (6), Kanskje
(5)
significant right neighbours of Bondevik:
II (60), har (44), II-regjeringen (34), og (26),
Spre (24), vil (20), delte (19), , (19), sier (15),
mener (15), ( (15), uttalte (14), overtok (13),
orienterte (13), grep (13), Foreleser (13), &
(13), Statsminister (12), senker (11),
oppnevnte (11), gikk (11), varsler (10), dreper
(10), undertegnet (9), Midt (9), var (7), tok (6),
skrev (6), kjem (6), foreslår (6), eller (6), . (6),12
stiller (5), måtte (5), føler (5)
Visualisation of Co-occurrences
Cluster
ambiguous
words
Simulated annealing to
visualise significant cooccurrences
13
Word Sense Discrimination
• 3 (or more) words are either from one
topic, or do not fit together
14
WSD algorithm
1.
2.
3.
4.
5.
6.
7.
Inputword w
Take first 15 co-occurrences of w, ordered by cooccurrence significance
Generate all combinations of w and two of the 15
co-occurrences (105 different triples of words)
Retrieve intersections of co-occurrences
cluster intersections
If there are still co-occurrences above threshold
left, go to step 2.
Else stop
S. Bordag, Sentence Co-occurrences as Small-World Graphs: A solution to Automatic
Lexical Disambiguation, A. Gelbukh (Ed.): CICLing 2003, LNCS 2588, pp. 329-332,
Springer-Verlag Berlin Heidelberg, 2003
15
Example output
16
Small World Graphs
• SW graphs as optimum between two extremes:
Graph
regular
random
small-world
path length
long
short
short
clustering
coefficient
high
low
high
• Cluster coefficient: Anywhere in the graph are local
clusters which scale – very much like a fractal
• Short path lengths explain the small world effect in
conversation (“Oh, you know him too!”) and why
natural searching works so well.
D.J. Watts and S.H. Strogatz, (1998). Collective dynamics of ‘small world’ networks. Nature,
(393), p440-442.
A.L. Barabasi et al (2000). Scale-free characteristics of random networks: the topology of the
World-wide web. Physica A (281)70-77.
17
Small World Graphs (ctd)
• SW graphs with the same properties and even
almost the same numbers can be found in a
variety of different applications:
–
–
–
–
power line network
neural net structure of living creatures
language, e.g. co-occurrence graphs
world wide web
M. Steyvers, J. B. Tenenbaum. The large-scale structure of semantic networks: statistical analyses
and a model of semantic growth. M. Steyvers, J. B. Tenenbaum, Cognitive Science, 2002
18
Word of the day
• Task: Extraction of important keywords from
newspaper on a daily basis
• Method: compile daily corpus and compare it
to large reference corpus
• Needed: 5-10 downloadable online
newspapers in target language. Alternatively
use allTheWeb.com with enhanced options
• Measures used: relative frequency ratio in
combination with statistics of surprise
• Classification of keywords into categories: 5
minutes of daily workload
Richter, M. (2005): Analysis and Vizualisation for Daily Newspaper Corpora, Proceedings of RANLP05, Borovets, Bulgaria
19
Frequency bands
• absolute frequency of today
–
–
–
–
minimum frequency needed because of Zipf's law
maximum frequency needed for eleiminating stop words
now we use a lower threshold of 10
and an upper theshold of 60
• relative frequency factor compared to reference
corpus
– large factors indicate "importantness" in today's news, small
fluctuations are noise.
– threshold value (empirically determined): 6.
• absolute frequency in reference corpus
– abs. frequency is measure for "well-knownness". This is a
binary decision - people know about a concept or not.
– Now: threshold is a count of 20 in reference corpus
– new words are excluded from this calculation
20
Word of the day: 24/09/2005
http://wortschatz.informatik.uni-leipzig.de/wort-des-tages
21
Word stock market for 1-hit-wonders:
Katrina in New Orleans
22
Word Stock market for periodic events:
Formula 1
11/09/2005: Alonso did not succeed in obtaining the
world championship before the end of the season
23
German Parliament elections 19/09/2005:
predicting the future?
Analyzing text
Counting the votes
Polls
24
Co-occurrences of higher Orders
• (sentence-based) co-ocurrences of first order:
words that co-occur significantly often together in sentences
• co-occurrences of second order:
words that co-occur significantly often in collocation sets of first order
• co-occurrences of n-th order:
words that co-occur significantly often in collocation sets of (n-1)th order
When calculating a higher order, the significance values of the
preceding order are not relevant. A co-occurrence set
consists of the N highest ranked co-occurrences of a word.
Biemann, C.; Bordag, S.; Quasthoff, U. (2004): Automatic Acquisition of Paradigmatic Relations using
Iterated Co-occurrences, Proceedings of LREC2004, Lisboa, Portugal
25
Examples: Co-occurrences of
higher orders
Order
TOP-10 collocations
N2
Reference
word
wine
S10
wine
wines, grape, sauvignon, chardonnay, noir,
pinot, cabernet, spicy, bottle, grapes
S1
ringing
phone, bells, phones, hook, bell, endorsement,
distinctive, ears, alarm, telephone
S2
ringing
rung, Centrex, rang, phone, sounded, bell, ring,
FaxxMaster, sound, tolled
S4
ringing
sounded, rung, rang, tolled, tolling, sound, tone,
toll, ring, doorbell
S10
pressing
Ctrl, Shift, press, keypad, keys, key, keyboard,
you, cursor, menu, PgDn, keyboards, numeric,
Alt, Caps, CapsLock, NUMLOCK, NumLock,
Scroll
wines, champagne, beer, water, tea, coffee,
Wine, alcoholic, beers, cider
26
Intersection of Co-occurrence Sets:
resolving ambiguity
HerzBube
bedient - folgenden - gereizt Karo-Buben - Karo-Dame - KaroKönig - Karte - Karten - Kreuz-Ass
- Kreuz-Dame - Kreuz-Hand Kreuz-König - legt - Mittelhand Null ouvert - Pik - Pik-Ass - PikDame - schmiert - Skat - spielt Spielverlauf - sticht - übernimmt zieht -
Becker
Stich
Alleinspieler - Herz Herz-Dame - HerzKönig - Hinterhand Karo - Karo-As - KaroBube - Kreuz-As Kreuz-Bube - Pik-As Pik-Bube - Pik-König Vorhand -
Agassi - Australian Open - Bindewald Boris - Break - Chang - Dickhaut - gewann - Ivanisevic - Kafelnikow - Kiefer
- Komljenovic - Leimen - Matchball Michael Stich - Monte Carlo - Prinosil Sieg - Spiel - spielen - Steeb - Teamchef
Achtelfinale Aufschlag - Boris Becker
- Daviscup - Doppel - DTB –
Edberg - Finale - Graf - Haas Halbfinale - Match - Pilic - Runde Sampras - Satz - Tennis - Turnier Viertelfinale - Weltrangliste - Wimbledon
Becker - Courier - Einzel - Elmshorn - French Open Herz-As - ins - Kafelnikow - Karbacher - Krajicek Kreuz-As - Kreuz-Bube - Michael Stich - Mittelhand
- Pik-As - Pik-Bube - Pik-König
Stich
27
Detection of X-onyms
synonyms, antonyms, (co)-hyponyms...
• Idea: Intersection of co-occurrence sets of two X-onyms as
reference words should contain X-onyms
• lexical ambiguity of one reference word does not deteriorate the
result set
• Method:
- Detect word class for reference words
- calculate co-occurrences for reference words
- filter co-occurrences w.r.t the word class of the reference
words (by means of POS tags)
- perform disjunction of the co-occurrence sets
- output result
• ranking can be realized over significance values of the
co-occurrences
28
Mini-Evaluation
• Experiments for different data sources, NB-collocations of 2nd and
3rd order
• fraction of X-onyms in TOP 5 higher than in TOP 10  ranking
method makes sense
• disjunction of 2nd-order and 3rd-order co-occurrences almost
always empty  different orders exhibit different relations
• satisfactory quantity, more through larger corpora
• quality: for unsupervised extension not precise enough
29
Word Sets for Thesaurus
Expansion
Application: thesaurus expansion
start set: [warm, kalt] [warm, cold]
result set: [heiß, wärmer, kälter, erwärmt, gut, heißer, hoch,
höher, niedriger, schlecht, frei] [hot, warmer, colder, warmed,
good, hotter, high, higher, lower, bad, free]
start set: [gelb, rot] [yellow, red]
result set: [blau, grün, schwarz, grau, bunt, leuchtend, rötlich,
braun, dunkel, rotbraun, weiß] [blue, green, black, grey, colorful,
bright, reddish, brown, dark, red-brown, white]
start set: [Mörder, Killer] [murderer, killer]
result set: [Täter, Straftäter, Verbrecher, Kriegsverbrecher,
Räuber, Terroristen, Mann, Mitglieder, Männer, Attentäter]
[offender, delinquent, criminal, war criminal, robber, terrorists,
man, members, men, assassin
30
More Examples in English
Intersection of N2-Order co-occurrence sets
31
Dictionary Acquisition
Using Parallel Text and Co-occurrence Statistics
Given:
• certain amounts of sentence-aligned parallel texts
Not available:
• morphology, grammar, semantic etc. information
• string similarity for cognates
• bilingual dictionary
Wanted:
• bilingual dictionaries
• alignment on word level
Biemann, C. and Quasthoff, U. (2005): Dictionary Acquisition Using Parallel Text and Cooccurrence Statistics (2005): Proceedings of NODALIDA-05, Joensuu, Finland (to appear)
32
Trans-co-occurrences
Translingual co-occurrences
‘normal‘ co-occurrences:
• Calculaton performed on sentence basis
• Co-occurrents can be found frequently together in
sentences
Trans-co-occurrences:
• Calculaton performed on bilingual sentence pairs
• Co-occurrents can be found frequently together in
bilingual sentence pairs
• Hypothesis: significant co-occurrences between
words of different languages (= trans-co-occurrences)
are translation equivalents
33
Example:
Gesellschaft@de society@en
Die@de drogenfreie@de Gesellschaft@de wird@de es@de aber@de nie@de
geben@de .@de But@en there@en never@en will@en be@en a@en drugfree@en society@en .@en
Unsere@de Gesellschaft@de neigt@de leider@de dazu@de ,@de Gesetze@de
zu@de umgehen@de .@de Unfortunately@en ,@en our@en society@en
is@en inclined@en to@en skirt@en round@en the@en law@en .@en
Zum@de Glück@de kommt@de das@de in@de einer@de demokratischen@de
Gesellschaft@de selten@de vor@de .@de Fortunately@en ,@en in@en
a@en democratic@en society@en this@en is@en rare@en .@en
Herr@de Präsident@de !@de Wir@de leben@de in@de einer@de paradoxen@de
Gesellschaft@de .@de Mr@en President@en ,@en we@en live@en in@en
a@en paradoxical@en society@en .@en
Ich@de sprach@de vom@de Paradoxon@de unserer@de Gesellschaft@de .@de
I@en mentioned@en what@en is@en paradoxical@en in@en society@en
.@en
Zeit@de ist@de Macht@de in@de unserer@de Gesellschaft@de .@de Time@en
is@en power@en in@en our@en society@en .@en .
In all sentence pairs, Gesellschaft@de and
society@en occur together.
34
top-ranked trans-co-occurrences
Gesellschaft: society@en (12082), social@en (342), our@en (274), in@en (237),
societies@en (226), Society@en (187), women@en (183), as@en a@en whole@en
(182), of@en our@en (168), open@en society@en (165), democratic@en (159),
company@en (137), modern@en (134), children@en (120), values@en (120),
economy@en (119), of@en a@en (111), knowledge-based@en (110), European@en
(105), civil@en society@en (102)
society: Gesellschaft@de (12082), unserer@de (466), einer@de (379),
gesellschaftlichen@de (328), Wissensgesellschaft@de (312), Menschen@de (233),
gesellschaftliche@de (219), Frauen@de (213), Zivilgesellschaft@de (179),
Gesellschaften@de (173), Informationsgesellschaft@de (161), modernen@de (157),
sozialen@de (155), Wirtschaft@de (132), Leben@de (119), Familie@de (118),
Gesellschaftsmodell@de (108), demokratischen@de (108), soziale@de (98),
Schichten@de (97)
kaum: hardly@en (825), scarcely@en (470), little@en (362), barely@en (278), hardly@en
any@en (254), very@en little@en (186), almost@en (88), difficult@en (68), unlikely@en
(63), virtually@en (53), scarcely@en any@en (51), impossible@en (47), or@en no@en
(40), there@en is@en (38), hardly@en ever@en (37), any@en (32), hardly@en
anything@en (32), surprising@en (31), hardly@en a@en (29), hard@en (28)
hardly: kaum@de (825), wohl@de kaum@de (138), schwerlich@de (64), nicht@de (51),
verwunderlich@de (43), kann@de (37), wenig@de (37), wundern@de (25), man@de
(21), dürfte@de (17), gar@de nicht@de (17), auch@de nicht@de (16), gerade@de (16),
überrascht@de (15), fast@de (14), überraschen@de (14), praktisch@de (13), ist@de
(12), schlecht@de (12), verwundern@de (12)
35
Evaluation for German-English
on Europarl & dict.uni-chemnitz
Precision from 1st to 3rd translaton, Freq >100, en->de
sim=1
sim>0.8
sim>0.6
[Sahlgren 2004]
60
Precision
50
40
30
20
10
0
1
2
3
# candidate translation
36
Comparison with [Sahlgren 2004]
Precision for 9 frequency ranges, en->de
sim1=1
sim1>0.6
[Sahlgren 2004]
1
0,9
0,8
Precision
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
1
10
100
1000
10000
100000
1000000
Frequency
Sahlgren, M. (2004): Automatic Bilingual Lexicon Acquisition Using Random Indexing of Aligned
Bilingual Data, Proceedings of LREC-2004, Lisboa, Portugal
37
Alignment
Given:
• Bilingual sentence pair
Wanted:
• Which word corresponds with which?
Method:
• Scan sentence 1 word by word and link it to
the highest ranked word in the trans-cococcurrences that can be found in sentence
2.
38
Alignment: Example 1
Die Landwirtschaft stellt nur 5,5 % der Arbeitsplätze der Union .
1
21
13
13
3
1
2
Agriculture only provides 5.5 % of employment in (the Union) .
Agriculture only provides 5.5 % of employment in (the Union) .
15
1
1
4
2 1
Die Landwirtschaft stellt nur 5,5 % der Arbeitsplätze der Union .
Red Words: No alignment
Blue Arrows: Errors
Arrow Index: rank in trans-co-occurrences
39
Alignment: Example 2
Indem wir den Mitgliedstaaten für die Umsetzung der Richtlinie kein spezifisches
Datum setzen ,
1
1
1
4
1
2
1
7
1
1
15
By not setting a specific date (for the) Member States (to implement) the directive
1
1
sondern ihnen einen Zeitraum von drei Monaten nach Inkrafttreten der Richtlinie zugestehen ,
1
1 1
1
1
1
1
1,2,3
and instead giving them a period of three months after its (entry into force) ,
1 1
1
4
1 1
führen wir eine Flexibilitätsklausel ein ,
1
1
2
1
1
we are introducing a flexibility clause
1
1
1
1
die eine unverzügliche Umsetzung gewährleistet .
5
4
4
4
which ensures that the directive will be implemented without delay .
40
Grey Arrows: Multiple alignments for frequent words.
Manual Evaluation
on 1000 words random samples
Manual Evaluation: Correct and partially correct for 1st translation
candidate
correct
partially correct
1
Precision
0,8
0,6
0,4
0,2
0
de-en
en-de
da-en
en-da
sv-en
en-sv
nl-en
en-nl
Language Pair
Manual Evaluation: Correct and partially correct for 2nd
translation candidate
correct
partially correct
Better results:
- no domaindependent
deficiency of
dictionary
- no problems
with inflection
1
Precision
0,8
0,6
0,4
0,2
0
de-en
en-de
da-en
en-da
sv-en
en-sv
nl-en
en-nl
Language Pairs
41
Nextlinks and Findlinks
Applying Co-occurence Statistics to Co-citation Web Graphs
• Co-occurrence of words: words that appear
together in sentences more often than to be
expected
• Co-occurrence of (external) links on
websites: links that appear together on
websites more often than t.b.e.
• This is called co-citation; the co-citation web
graph: websites are nodes, an edge means
that two websites are linked on another
website.
Heyer, G. and Quasthoff, U. 2004): Calculating Communities by Link Analysis of URLs. P=roceedings
of IICS-04, Guadalajara, Mexico and Springer LNCS 3473, pp. 151-156 Springer Verlag Berlin
42
Heidelberg
External Links
External Links are linking to a different web server. In
contrast, internal links are for navigation inside a web
site.
External links are explicit associations:
• They refer to something external.
• Usually, they are optional to follow.
• Usually, they are hand made and elaborate.
In contrast, internal links are often structural links. They
are usually machine generated for navigation.
About 10% of web pages have more than one external
link
43
Application: NextLinks Surf Guide
Input: Current URL shown in the
browser, here:
www.uio.no
www.leipzig.de
Output: Top 10 cooccurrences of the
input URL
Often returns the most popular
websites of similar interest.
Download it:
44
http://wortschatz.uni-leipzig.de/nextlinks
The Crawling Problem
Crawling the Web
Today, Google knows 8 168 684 336 web pages.
Crawling them in 30 days means approx. 280 mill.
pages/day (or 3150 pages/sec).
Observation:
• Crawling gives a lot of input data
• Extracting only links produces only few output.
Hence, we try distributed crawling!
45
FindLinks: Distributed Client
• At the moment, we receive 500,000 pages/day per broadband client.
• Hence, we need about 600 clients for the whole web.
Help us and download the client:
http://wortschatz.uni-leipzig.de/nextlinks/findlinks.html
46
47
48
Abstract
The goals of the Wortschatz Project (University of
Leipzig) is to process and to provide large,
unannotated corpora for a variety of languages. The
focus is on langage-independent methods to enrich
those plain text corpora with structure without using
manually developed resources or languagedependent preprocessing.
Mainly building on an efficient implementation of cooccurrence statistics, approaches for acquiring
knowledge from text range from word sense
discrimination over trend mining and time series
analysis to thesaurus expansion and bilingual
dictionary acquisition. Finally, the framework is
applied to web graph analysis.
49