Transcript Slide 1

Googleology is bad science
Adam Kilgarriff
Lexical Computing Ltd
Universities of Sussex, Leeds
1
Web as language resource
 Replaceable or replacable?
 check
2






Very very large
Most languages
Most language types
Up-to-date
Free
Instant access
3
How to use the web?
 Google
or other commercial search engines (CSEs)
 not
4
Using CSEs
No setup costs
Start querying today
Methods
 Hit counts
 ‘snippets’
 Metasearch engines, WebCorp
 Find pages and download
5
Googleology
 CSE hit counts for language modelling
 36 queries to estimate freq(fulfil, obligation) to each
of Google and Altavista (Keller & Lapata 2003)
 finding noun-noun relations
“we issue exact phrase Google queries of type
noun2 THAT * noun1”
Nakov and Hearst 2006
 Small community of researchers
 Corpora mailing list
 Very interesting work
 Intense interest in query syntax
 Creativity and person-years
6
The Trouble with Google
 not enough instances
 max 1000
 not enough queries
 max 1000 per day with API
 not enough context
 10-word snippet around search term
 ridiculous sort order
 search term in titles and headings
 untrustworthy hit counts
 limited search syntax
 No regular expressions
 linguistically dumb
 lemmatised
 aime/aimer/aimes/aimons/aimez/aiment …
 not POS-tagged
 not parsed not
7
 Appeal
 Zero-cost entry, just start googling
 Reality
 High-quality work: high-cost methodology
8
Also:
 No replicability
 Methods, stats not published
 At mercy of commercial corporation
9
Also:




No replicability
Methods, stats not published
At mercy of commercial corporation
Bad science
10
The 5-grams
 A present from Google
 All
 1-, 2-, 3-, 4-, 5-grams
 with fr>=40
 in a terabyte of English
 A large dataset
11
Prognosis
 Next 3 years
 Exciting new ideas
 Dazzlingly clever uses
 Drives progress in NLP
12
Prognosis
 Next 3 years
 Exciting new ideas
 Dazzlingly clever uses
 After 5+ years
 A chain round our necks
 Cf Penn Treebank (others? Brickbats?)
 Resource-led vs. ideas-led research
13
How to use the web?
 Google
or other commercial search engines (CSEs)
 not
14
Language and the web
 Web is mostly linguistic
 Text on web << whole web (in GB)
 Not many TB of text
 Special hardware not needed
 We are the experts
15
Community-building
 ACL SIGWAC
 WAC Kool Ynitiative (WaCKY)
 Mailing list
 Open source
 WAC workshops
 WAC1, Birmingham 2005
 WAC2, Trento (EACL), April 2006
 WAC3, Louvain, Sept 15-16 2007
16
Proof of concept: DeWaC, ItWaC
 1.5 B words each, German and Italian
 Marco Baroni, Bologna (+ AK)
17
What is out there?
 What text types?
 some are new: chatroom
 proportions
is it overwhelmed by porn? How much?
Hard question
18
What is out there
 The web
a social, cultural, political phenomenon
new, little understood
a legitimate object of science
mostly language
we are well placed
 a lot of people will be interested




 Let’s




study the web
source of language data
apply our tools for web use (dictionaries, MT)
use the web as infrastructure
19
How to do it:
Components
1. web crawler
2. filters and classifiers
 de-duplication
3. linguistic processing
•
Lemmatise, pos-tag, parse
4. Database
•
•
Indexing
user interface
20
1. Crawling
 How big is your hard disk?
 When will your sysadmin ban you?
DeWaC/ItWaC
 Open source crawler: heritrix
21
1.1 Seeding the crawl
 Mid-frequency words
 Spread of text types
 Formal and informal, not just newspaper
 DeWaC
 Words from newspaper corpus
 Words from list with “kitchen” vocab
 Use Google to get seeds for crawls
22
2. Filtering




non ‘running-text’ stripping
Function word filtering
Porn filtering
De-duplication
23
2.1 Filtering: Sentences
 What is the text that we want?




Lists?
Links?
Catalogues?
…
 For linguistics, NLP
 in sentences
 Use function words
24
2.2 Filtering: CLEANEVAL
 “Text cleaning”
 Lots to be done, not glamorous
 Many kinds of dirt needing many kinds of filter
 Open Competition/shared task
 Who can produce the cleanest text?!
 Input: arbitrary web pages
 “gold standard”
 paragraph-marked plain text
 Prepared by people
 Workshop Sept 2007. do join us!
 http://cleaneval.sigwac.org.uk
25
3.
Linguistic processing
 Lemmatise, POS-tag, parse
 Find leading NLP group for each
language
 Be nice to them
 Use their tools
26
Database, interface
 Solved problem (at least for 1.5 BW)
 Sketch Engine
27
“Despite all the disadvantages, it’s
still so much bigger”
28
How much bigger?
 Method
 Sample words




30
Mid-to-high freq
Not common words in other major lgs
Min 5 chars
 Compare freqs, Google vs ItWaC/DeWaC
29
Google results (Italian)
 Arbitrariness
 Repeat identical searches
 9/30: > 10% difference
 6/30: > 100% difference
 API: typically 1/18th ‘manual’ figure
 Language filter
 mista bomba clima
 mostly non-Italian pages
 use MAX and MIN of 6 lg-filtered results
30
 Clima=
 Computational logic in multi-agent systems
 Centre for Legumes in Mediterranean
Agriculture
 (5-char limit too short)
31
Ratios, Google:DeWaC
WORD
MAX
MIN
RAW
CLEAN
-------------------------------------------------------------besuchte
10.5
3.8
81840
18228
stirn
3.38
0.62
32320
11137
gerufen
7.14
3.72
66720
27187
verringert
6.86
3.46
52160
15987
bislang
24.4
11.6 239000
90098
brach
4.36
2.26
44520
19824
-------------------------------------------------------------MAX/MIN: max/min of 6 Google values (millions)
RAW:
DeWaC document frequency before filters, dedupe
CLEAN:
DeWaC document frequency after filters, dedupe
32
ItWaC:Google ratio, best estimate
 For each of 30 words
 Calculate ratio, max:raw
 Calculate ratio, min:raw
 Take mid-point and average: 1:33 or 3%
 Calculate raw:vert
 Average = 4.4
 half (for conservativeness/uncertainty) = 2.2
 3% x 2.2 = 6.6%
 ItWaC:Google = 6.6%
33
Italian web size
 ItWaC = 1.67b words
 Google indexes 1.67/.066 =
25 bn words
sentential non-dupe Italian
34
German web size




Analysis as for Italian
DeWaC: 3% Google
DeWaC = 1.41b words
Google indexes 1.41/.03 =
44 bn words
sentential non-dupe German
35
Effort
 ItWac, DeWac
 Less than 6 person months
 Developing the method
 (EnWaC: in progress)
36
Plan
 ACL adopts it (like ACL Anthology) (LDC?)
 Say: 3 core staff, 3 years
 Goals could be:
 English: 2% G-scale (still biggest part)
 6 other major languages: 30% G-scale
 30 other languages: 10% G-scale
 Online for
 Searching as in SkE
 Specifying, downloading subcorpora for
intensive NLP
 “corpora on demand”
 Don’t quote me 
37
Logjams
 Cleaning
 See CLEANEVAL
 Text type
 “what kind of page is it?”
 Critical but under-researched
 WebDoc proposal
 (with Serge Sharoff, Tony Hartley)
 (a different talk)
38
Moral
 Google, CSEs are wonderful
 Start today but
bad science
 Not
 Good science, reliable counts
 We (the NLP community) have the skills
 With collective effort, mid-sized project
Google-scale is achievable
39
Thank you
 http://www.sketchengine.co.uk
40
Scale and speed, LSE
 Commercial search engines
 banks of computers
 highly optimised code
but this is for performance
 no downtime
 instant responses to millions of queries
 This proposal
 crawling: once a year
 downtime: acceptable
 not so many users
41
…but it’s not representative
 The web is not representative
 but nor is anything else
 Text type variation
 under-researched, lacking in theory
Atkins Clear Ostler 1993 on design brief for BNC;
Biber 1988, Baayen 2001, Kilgarriff 2001
 Text type is an issue across NLP
 Web: issue is acute because, as against BNC or
WSJ, we simply don’t know what is there
42
Oxford English Corpus
 Method as above
 Whole domains chosen and
harvested
 control over text type
 1 billion words
 Public launch April 2006
 Loaded into Sketch Engine
43
Oxford English Corpus
44
Oxford English Corpus
45
Examples
 DeWaC, ItWaC
 Baroni and Kilgarriff, EACL 2006
 Serge Sharoff, Leeds Univ UK
 English Chinese Russian English French
Spanish, all searchable online
 Oxford English corpus
46
Options for academics
 Give up
 Niche markets, obscure languages
 Leave the mainstream to the big guys
 Work out how to work on that scale
 Web is free, data availability not a
problem
47