Transcript Document
Concordancing the Web
with KWiCFinder
William H. Fletcher
United States Naval Academy
American Association for Applied Corpus Linguistics
Third North American Symposium on
Corpus Linguistics and Language Teaching, Boston, MA,
23-25 March 2001
How Big is the Web?
Now 2-4 billion webpages accessible via public links
(Cyberveillance estimates & projection July 2000; Inktomi estimates are more modest.)
“Invisible web” / restricted sites several times larger
Estimated 80%-95% content in English, but…
Since mid 2000, non-Anglophones outnumber
English speakers online
Anglophones < 30% of 850 million users in 2005
Percentage of new users fluent in English decreasing
For many regions / languages, still no data available
Search Purposes
General users typically seek…
a specific site
any well-stocked site meeting their needs
Scholarly searchers must examine and
evaluate a range of sites to identify the
most relevant and reliable resources
Educators want to foster similar online
research behavior in their students
Typical Search Behaviors
Marked preference for directories with pre-selected
links organized by topic over full-text search engines
Simple queries – single word or phrase –
predominate (80%-90%)
10%-25% of attempted complex queries (Boolean
operators, bracketing) are ill-formed
Users tend to work in a single window, calling up
one document at a time, then returning to search
engine for another link
Typical Search Outcomes
Users follow up only first few links, then settle
on a page after browsing from these
Usual outcome is a match, not best match
Ways to Use the Web for
Instruction and Research
Micro level
Discover eloquent examples
Verify current / possible usage, with rough
indication of prevalence
Acquire vocabulary not (yet) in dictionaries
Timeliness is essential -- “off-the-shelf
corpora” often cannot help here!
Enable students to develop discovery skills
(Salzman/Mills “Grammar Safari”)
Ways to Use the Web for
Instruction and Research (2)
Macro level
Find authentic texts accessible to students
Locate relevant online resources for
research projects
Student reports
Scholarly research
Impediments to Finding
Relevant Resources Online
Reliance on commercial search engines (SEs)
essential due to Web’s size
SEs’ priorities match ours only by coincidence
Link rot
Pages move or disappear
Page content changes
Challenges to Responsible Research
Online there is too much ephemeral
content of unknown reliability
Preponderance of journalistic, commercial
and personal texts of unknown authorship
and authority
Details of sources and research
methodology haphazard
Even student papers (gasp) and machine
translated texts (groan choke)
Challenges to Responsible Research
(2)
Representativity of Web as Corpus
Much ill-formed or fragmentary language
Domain only a rough clue to provenance
Numbers vs. Statistics
Search engines number of pages matching
a query, not actual citations
One page may contain alternate usages
Narrower filters may eliminate some pages
Webidence as Evidence
Our profession needs to develop
“Standards of Webidence” to guide
selection and documentation of online
language for serious research purposes.
The Web is not a corpus in the
classical sense…
…but it does offer an
inexhaustible body of linguistic
and cultural information for
research and use.
Why KWiCFinder?
Automate process of search and
retrieval
Expedite evaluation of webpages
Provide specific enhancements for
foreign language users and linguists
Encourage students and colleagues to
take full advantage of online resources
Why AltaVista?
All words are indexed, including "stopwords"
Distinguishes case and "special characters"
Supports Boolean operators, bracketing, and
wildcards
True world-wide coverage, with search by
language
No limits to length or complexity of the query
Literal text search, without "second-guessing"
KWiCFinder Enhances
AltaVista with…
Intuitive input for foreign characters,
bracketing, operators, dates
Inclusion / exclusion criteria not
included in KWiC report to focus search
Automatic search and retrieval in the
background returning KWiC abstracts
KWiCFinder Enhances
AltaVista with… (2)
Restricted wildcards ? % (1, 0-1 char)
vs. AltaVista * (0-5 chars)
“Sic” option so “plain” or lower-case
char does not match “special” or uppercase variants:
By SE default, a matches any of
aáâäàãæåAÁÂÄÀÃÆÅ
KWiCFinder Enhances
AltaVista with… (3)
“Tamecards” -- User inputs pattern, KF
generates variants:
on-line matches on-line, on line, online
s[iau]ng matches sing, sang, sung
{me,te,se,nos,os,se}
desp[i,]ert{o,as,a,amos,áis,an}
matches only reflexive forms me despierto,
te despiertas, se despierta, nos
despertamos, os despertáis, se despiertan
How Does XML Enhance
KWiCFinder?
Search results become a dynamic database
for end user to manipulate:
categorize, annotate, delete, merge / split
searches, citations and documents
Free tools permit developer or end-user to
restyle and add interactivity to reports
Layouts
Languages
Data format
Why WebKWiC?
Original hope: cross-platform, crossbrowser solution
Minimal entry threshold: small
download of HTML pages + JavaScript
Support for non-Western European
languages
Why Google?
Link popularity ranking puts relevant sites at
or near top of list
Straightforward approach to Advanced Search
(“implicit Booleans”) easy to learn, thus most
likely to be used by students independently
Largest number of pages analyzed
Matching pages always* available in cache
with KWiC markup
How Does WebKWiC
Complement Google?
Focuses and enhances interface for
language learners
Provides tools to navigate among
citations and documents
Simplifies management of multiple
windows
Future of Web Concordancing
Agents will create specialized corpora
on demand, by “search and crawl” or by
monitoring specific sites
Multiplicity of encoding formats (various
HTMLs, XML…) and languages will place
increasing demands on developers of
KWiCFinder and analogues
Pleas(e)
Visit http://miniappolis.com/
Download and try KWiCFinder and
WebKWiC
View bibliography as well as this and
related presentations
Use these tools with your students
Send feedback and suggestions to
[email protected]