Transcript Slide 1
Googleology is bad science
Adam Kilgarriff
Lexical Computing Ltd
Universities of Sussex, Leeds
1
Web as language resource
Replaceable or replacable?
check
2
Very very large
Most languages
Most language types
Up-to-date
Free
Instant access
3
How to use the web?
Google
or other commercial search engines (CSEs)
not
4
Using CSEs
No setup costs
Start querying today
Methods
Hit counts
‘snippets’
Metasearch engines, WebCorp
Find pages and download
5
Googleology
CSE hit counts for language modelling
36 queries to estimate freq(fulfil, obligation) to each
of Google and Altavista (Keller & Lapata 2003)
finding noun-noun relations
“we issue exact phrase Google queries of type
noun2 THAT * noun1”
Nakov and Hearst 2006
Small community of researchers
Corpora mailing list
Very interesting work
Intense interest in query syntax
Creativity and person-years
6
The Trouble with Google
not enough instances
max 1000
not enough queries
max 1000 per day with API
not enough context
10-word snippet around search term
ridiculous sort order
search term in titles and headings
untrustworthy hit counts
limited search syntax
No regular expressions
linguistically dumb
lemmatised
aime/aimer/aimes/aimons/aimez/aiment …
not POS-tagged
not parsed not
7
Appeal
Zero-cost entry, just start googling
Reality
High-quality work: high-cost methodology
8
Also:
No replicability
Methods, stats not published
At mercy of commercial corporation
9
Also:
No replicability
Methods, stats not published
At mercy of commercial corporation
Bad science
10
The 5-grams
A present from Google
All
1-, 2-, 3-, 4-, 5-grams
with fr>=40
in a terabyte of English
A large dataset
11
Prognosis
Next 3 years
Exciting new ideas
Dazzlingly clever uses
Drives progress in NLP
12
Prognosis
Next 3 years
Exciting new ideas
Dazzlingly clever uses
After 5+ years
A chain round our necks
Cf Penn Treebank (others? Brickbats?)
Resource-led vs. ideas-led research
13
How to use the web?
Google
or other commercial search engines (CSEs)
not
14
Language and the web
Web is mostly linguistic
Text on web << whole web (in GB)
Not many TB of text
Special hardware not needed
We are the experts
15
Community-building
ACL SIGWAC
WAC Kool Ynitiative (WaCKY)
Mailing list
Open source
WAC workshops
WAC1, Birmingham 2005
WAC2, Trento (EACL), April 2006
WAC3, Louvain, Sept 15-16 2007
16
Proof of concept: DeWaC, ItWaC
1.5 B words each, German and Italian
Marco Baroni, Bologna (+ AK)
17
What is out there?
What text types?
some are new: chatroom
proportions
is it overwhelmed by porn? How much?
Hard question
18
What is out there
The web
a social, cultural, political phenomenon
new, little understood
a legitimate object of science
mostly language
we are well placed
a lot of people will be interested
Let’s
study the web
source of language data
apply our tools for web use (dictionaries, MT)
use the web as infrastructure
19
How to do it:
Components
1. web crawler
2. filters and classifiers
de-duplication
3. linguistic processing
•
Lemmatise, pos-tag, parse
4. Database
•
•
Indexing
user interface
20
1. Crawling
How big is your hard disk?
When will your sysadmin ban you?
DeWaC/ItWaC
Open source crawler: heritrix
21
1.1 Seeding the crawl
Mid-frequency words
Spread of text types
Formal and informal, not just newspaper
DeWaC
Words from newspaper corpus
Words from list with “kitchen” vocab
Use Google to get seeds for crawls
22
2. Filtering
non ‘running-text’ stripping
Function word filtering
Porn filtering
De-duplication
23
2.1 Filtering: Sentences
What is the text that we want?
Lists?
Links?
Catalogues?
…
For linguistics, NLP
in sentences
Use function words
24
2.2 Filtering: CLEANEVAL
“Text cleaning”
Lots to be done, not glamorous
Many kinds of dirt needing many kinds of filter
Open Competition/shared task
Who can produce the cleanest text?!
Input: arbitrary web pages
“gold standard”
paragraph-marked plain text
Prepared by people
Workshop Sept 2007. do join us!
http://cleaneval.sigwac.org.uk
25
3.
Linguistic processing
Lemmatise, POS-tag, parse
Find leading NLP group for each
language
Be nice to them
Use their tools
26
Database, interface
Solved problem (at least for 1.5 BW)
Sketch Engine
27
“Despite all the disadvantages, it’s
still so much bigger”
28
How much bigger?
Method
Sample words
30
Mid-to-high freq
Not common words in other major lgs
Min 5 chars
Compare freqs, Google vs ItWaC/DeWaC
29
Google results (Italian)
Arbitrariness
Repeat identical searches
9/30: > 10% difference
6/30: > 100% difference
API: typically 1/18th ‘manual’ figure
Language filter
mista bomba clima
mostly non-Italian pages
use MAX and MIN of 6 lg-filtered results
30
Clima=
Computational logic in multi-agent systems
Centre for Legumes in Mediterranean
Agriculture
(5-char limit too short)
31
Ratios, Google:DeWaC
WORD
MAX
MIN
RAW
CLEAN
-------------------------------------------------------------besuchte
10.5
3.8
81840
18228
stirn
3.38
0.62
32320
11137
gerufen
7.14
3.72
66720
27187
verringert
6.86
3.46
52160
15987
bislang
24.4
11.6 239000
90098
brach
4.36
2.26
44520
19824
-------------------------------------------------------------MAX/MIN: max/min of 6 Google values (millions)
RAW:
DeWaC document frequency before filters, dedupe
CLEAN:
DeWaC document frequency after filters, dedupe
32
ItWaC:Google ratio, best estimate
For each of 30 words
Calculate ratio, max:raw
Calculate ratio, min:raw
Take mid-point and average: 1:33 or 3%
Calculate raw:vert
Average = 4.4
half (for conservativeness/uncertainty) = 2.2
3% x 2.2 = 6.6%
ItWaC:Google = 6.6%
33
Italian web size
ItWaC = 1.67b words
Google indexes 1.67/.066 =
25 bn words
sentential non-dupe Italian
34
German web size
Analysis as for Italian
DeWaC: 3% Google
DeWaC = 1.41b words
Google indexes 1.41/.03 =
44 bn words
sentential non-dupe German
35
Effort
ItWac, DeWac
Less than 6 person months
Developing the method
(EnWaC: in progress)
36
Plan
ACL adopts it (like ACL Anthology) (LDC?)
Say: 3 core staff, 3 years
Goals could be:
English: 2% G-scale (still biggest part)
6 other major languages: 30% G-scale
30 other languages: 10% G-scale
Online for
Searching as in SkE
Specifying, downloading subcorpora for
intensive NLP
“corpora on demand”
Don’t quote me
37
Logjams
Cleaning
See CLEANEVAL
Text type
“what kind of page is it?”
Critical but under-researched
WebDoc proposal
(with Serge Sharoff, Tony Hartley)
(a different talk)
38
Moral
Google, CSEs are wonderful
Start today but
bad science
Not
Good science, reliable counts
We (the NLP community) have the skills
With collective effort, mid-sized project
Google-scale is achievable
39
Thank you
http://www.sketchengine.co.uk
40
Scale and speed, LSE
Commercial search engines
banks of computers
highly optimised code
but this is for performance
no downtime
instant responses to millions of queries
This proposal
crawling: once a year
downtime: acceptable
not so many users
41
…but it’s not representative
The web is not representative
but nor is anything else
Text type variation
under-researched, lacking in theory
Atkins Clear Ostler 1993 on design brief for BNC;
Biber 1988, Baayen 2001, Kilgarriff 2001
Text type is an issue across NLP
Web: issue is acute because, as against BNC or
WSJ, we simply don’t know what is there
42
Oxford English Corpus
Method as above
Whole domains chosen and
harvested
control over text type
1 billion words
Public launch April 2006
Loaded into Sketch Engine
43
Oxford English Corpus
44
Oxford English Corpus
45
Examples
DeWaC, ItWaC
Baroni and Kilgarriff, EACL 2006
Serge Sharoff, Leeds Univ UK
English Chinese Russian English French
Spanish, all searchable online
Oxford English corpus
46
Options for academics
Give up
Niche markets, obscure languages
Leave the mainstream to the big guys
Work out how to work on that scale
Web is free, data availability not a
problem
47