Trends in Web Search and its relevance to Digital Libraries

Download Report

Transcript Trends in Web Search and its relevance to Digital Libraries

Trends in Web Search
and its relevance to Digital Libraries
Min-Yen Kan
Web IR NLP Group (WING)
National University of Singapore
Min-Yen Kan, WING@NUS
Tips on Web Searching
• Visualize results, then come up
with multiple queries
• Use multiple search engines
• Advanced Search
– inurl:, site:
– “Phrasal search”
But that’s just general search…
• Federated resources /
Niche search engines
26 Sep 2008
World Scientific Talk
2
Min-Yen Kan, WING@NUS
Site- and Task-specific resources
• Site Prestige
Know what others think and do
– Google PageRank (Link structure), Alexa (Traffic)
– Google Trends / Insight (Queries)
• Social Searching (Web 2.0)
The voice of the reader / critic
– (Bookmarks / Tags) Del.icio.us,
Citeulike.org, Bibsonomy.org
– (News) Digg / Slashdot
– (Blogs) Google Blog, Technorati
• People Search:
Finding public information on a person
– Spock (web), Zabasearch (US only)
– LinkedIn, Facebook
– Must validate your sources
http://labs.digg.com/arc/
26 Sep 2008
World Scientific Talk
3
Min-Yen Kan, WING@NUS
Expert Search
Find people who will advocate on your behalf
• What do they want? → Impact
• Scholar:
– Active? → Check their recent articles
– Names common? → Define area
of interest
– Compare against peers
– Download vs. citation counts
• Patent search:
– Referenced by: (citation count;
different than scholar)
• Identifying webfaced advocates:
– Blog search, PageRank
26 Sep 2008
http://flickr.com/photos/phauly/
How do machines do it?
• Expert search task as benchmark test
• Download web pages to analyze
• Needed to deal with spam pages
• Used PageRank to assess prestige
World Scientific Talk
4
Min-Yen Kan, WING@NUS
Problem or opportunity?
• Revenue from print continually declining
• Students and researchers rely on internet
• Researchers want archiving rights – freedom
of academic information
The game has
fundamentally changed
Characteristics:
• Not zero-sum content
• Distribution is now largely the role of search
engines
→ Necessitates new role of publisher and
new revenue model
– Will classic models work?
Advertising, Subscription, Transactional &
Bundling
– Variants?
Versioning (Varian), Moving window
(JSTOR)
26 Sep 2008
World Scientific Talk
http://flickr.com/photos/danielbroche/
5
Min-Yen Kan, WING@NUS
Forecasting
–
+
Content is becoming free
– MIT / Stanford opening up
textbooks
– Open access archiving
→ long term: content will not be
primary revenue source
eBook revenue hasn’t held up its
promise yet…
– Device gap: iPhone and
nextGen devices
→ Revenue may be further down
the pipe
26 Sep 2008
Academic publishers
– Connect to libraries and
federations at institution level
– Individual customers are
secondary
Trusted source
– Expertise in copyediting,
typesetting, project management,
distribution, social networking
– Many individual web publishers
rediscovering same problems
→ Consultancy model
→ Win-win partnerships with
individual authors
World Scientific Talk
6
Min-Yen Kan, WING@NUS
Web Trends
• Social Content
• Wisdom of masses:
Crowdsourcing
• Rich Media
• Open Source / Access
Paradigmatic change
– Classifieds → Craigslist
– POTS → Skype
– CD store → iTunes
– Publishers → ??
http://www.informationarchitects.jp/
slash/iA_WebTrends_2007_2_1024_768.gif
26 Sep 2008
World Scientific Talk
7
Min-Yen Kan, WING@NUS
Where is research going?
Server centric
• Search API usage
• Browser as computer
• Web page structure,
mining text data
User centric
• Modeling web users at tasks:
Exploring / Fact-finding
• Personalization, recommending
• Social networks
• Understanding opinion
• Query and log analysis
http://flickr.com/photos/alisdair/
26 Sep 2008
World Scientific Talk
8
Min-Yen Kan, WING@NUS
WING@NUS
Webfaced pop quiz – which is which?
American Statistical
Society
World Scientific
Springer
courtesy: http://pagerank.si/
26 Sep 2008
World Scientific Talk
9
Min-Yen Kan, WING@NUS
Forecast: Know your strengths
Get advocates
• Make it easy to get individuals to insist to their
institution to buy your materials
• Know who is accessing (not necessarily buying)
your content
Content revenue will continue to decline
• Find an economic model that works for you
• Work as partners in content creation
Be savvy on trends
• Be visible: do “white hat” Search Engine
Optimization (SEO)
• Make your abstracts indexable by others
26 Sep 2008
World Scientific Talk
+
Academic publishers
– Connect to libraries and
federations at institution level
– Individual customers are
secondary
Trusted source
– Expertise in copyediting,
typesetting, project
management, distribution, social
networking
– Many individual web
publishers rediscovering same
problems
–→ Consultancy model
–→ Win-win partnerships with
individual authors
10
Min-Yen Kan, WING@NUS
Trends in Digital Libraries
>> WING @ NUS
• Expanding types of information in search
• Automated tools for DLs
• Usability in E-books and online media
• User modeling
• Personalization, annotation and relation to other user tasks
http://flickr.com/photos/pathfinderlinden
26 Sep 2008
World Scientific Talk
11
Min-Yen Kan, WING@NUS
Scholarly Digital Libraries
• ForeCite: our scholarly DL
• Data Cleaning
• Slide and Document Alignment
• Searching in the OPAC
• Math Information Retrieval
26 Sep 2008
World Scientific Talk
12
Min-Yen Kan, WING@NUS
ForeCite: Beyond the document as an item
Server
Client
A user-centric DL framework
• Put author / reader functionality together
• Tagging, correction, annotation and viewing
• Automatic tools: keyphrases and sentence classification
• For use on and offline, organizes local PDF files for you
• Only need your web browser
26 Sep 2008
World Scientific Talk
13
Min-Yen Kan, WING@NUS
Data Cleaning
• Addresses
– Dongwon Lee,
110 E. Foster Ave.
#410, State
College, PA,
16802
– LEE Dong, 110
East Foster
Avenue Apartment
410, Univ. Park,
PA 16802-2343
• Products
– Honda Fix vs.
Honda Jazz
– Apple iPod Nano
4GB vs. 4GB iPod
nano 4GB
26 Sep 2008
Search results:
“Jeffrey D. Ullman”
384,000
pages
45%
“Jeffrey D. Ullman” + “aho” 174,000 pages
“J. Ullman”
“J. Ullman” + “aho”
124,000 pages
33%
41,000 pages
“Shimon Ullman”
27,300 pages
0%
“Shimon Ullman” + “aho”
66 pages
• Idea: use web as additional context for
disambiguation and clustering
• Placed 3rd in Web People Search Task
(WEPS 2007)
World Scientific Talk
14
Min-Yen Kan, WING@NUS
Slides and their relationship to documents
Document in focus
26 Sep 2008
Slides in Focus
World Scientific Talk
15
Min-Yen Kan, WING@NUS
Searching in Libraries
http://linc.comp.nus.edu.sg
26 Sep 2008
World Scientific Talk
16
Min-Yen Kan, WING@NUS
Symbolic Information Search
How do users want to search math materials?
Not quite right…
Our answer: Text-to-Expression Linking
– Resolve text keywords to expressions
– e.g., “Pythagorean Theorem”  “a2+b2=c2” or “x2+y2=z2”
Reduce the need for expression input
Solves the notational variation problem
26 Sep 2008
World Scientific Talk
17
Min-Yen Kan, WING@NUS
Conclusions
• Consider us your research WING!
• Trade data and problems for solutions and interns
Meanwhile:
• Use better search strategies
• Practice white hat SEO
• Identify webfaced advocates
26 Sep 2008
World Scientific Talk
18
Min-Yen Kan, WING@NUS
References
• Kahin and Varian (2000) Internet Publishing and Beyond
• Towle et al. (2007) Electronic Books in the 2003-2005
Period, Pub Res Q 23:95-104
Photo Credits
• Flickr Creative Commons Search
Thanks to all of you for listening
& my fellow WING group members
26 Sep 2008
World Scientific Talk
19
Min-Yen Kan, WING@NUS
26 Sep 2008
World Scientific Talk
20
Min-Yen Kan, WING@NUS
Abstract
•I will present trends in current academic research on web search
and
digital libraries, and discuss their relevance to publishers and
their
economic model. With respect to the web, I will cover how search
engines are starting to specialize and use click through and ad
data
to improve relevance ranking. With respect to digital library
research, I discuss my group's research at NUS on advancing the
state-of-the-art in scholarly digital libraries. I cover advances on
how we deal with data cleaning issues, and slide and equation
retrieval and alignment.
26 Sep 2008
World Scientific Talk
21