How To Read Research Papers

Download Report

Transcript How To Read Research Papers

WIRED Week 4
•Syllabus Review
•Readings Overview
-Web IR Chapter
-Brin & Page - Google
-Kobayashi & Takeda – Overview
•Search Engine Optimization
•Assignment Overview & Scheduling
•Projects and/or Papers Discussion
-Idea Pitch
-Group Formation
Web makes IR an everyday activity
• Search Engines
• Search Interfaces
• The openness of the Web changes everything
-
Access
Technological progress
Expectation
Credibility
Networks and Networking
How Much Information Out there?
•
•
•
•
•
UC Berkeley project
Center for the Digital Future
Pew Internet & American Life Project
What kinds of information is it?
What formats?
- Information = Web pages?
- Now
- Future
• Who creates it?
• Why do they publish it?
• Content and Context
Investigation of Web Documents
• Are Web documents different?
- Structure - HTML & other markup
• Common tags
- Content - “information” & commerce
• Readability
• Usability
- Context - “sociological insights”, & spam
• Links
- Interest - topics, titles, keywords, file types
- Interface - browsers (& crawlers)
• Older study, what’s new?
- More multimedia
- XHTML & XML
- AJAX, REST, SOAP, Web 2.0?
Statistical Profiles of Highly-rated Web sites
• A Quality Checker
- Good design makes better Web pages
• Look at popular pages & see what makes them
popular
• We know good pages when we see (use) them
- Different types of Web page sturctures
• Elements
- Text, links & graphics (& their formatting)
- Accessibility, Size, errors, nav links (scent)
- Architecture of site
• What makes these pages good for searching?
Content - Organizing & Accessing
• Distributed Data(base)
• Dynamic Data
- Mobile
- Ephemeral
•
•
•
•
Huge Volume
Unstructured and Redundant
Quality
Heterogeneous
- Languages
- Code pages
Measuring the Web
• How would you measure?
-
Size (crawling)
Surveys
Hits & Metering
Bandwidth use
• What do numbers mean?
- Number of Hosts?
- Number of Sites?
- Number of Pages?
• Accurate +/- a lot
The Web is a Bowtie?
• Structure
- pass from any node of IN
through SCC to any node
of OUT
- hanging off IN & OUT are
TENDRILS containing
nodes that are reachable
from portions of IN, or
that can reach portions of
OUT , without passage
through SCC
- a TENDRIL hanging off
from IN to be hooked into
a TENDRIL leading into
OUT , forming a TUBE - a
passage from a portion of
IN to a portion of OUT
without touching SCC .
•
Broder, et. al 2000
Web Search Engines
• Independent of IR model
• Distributed index and servers
- Crawler
- Query server
- Indexer
• Crawlers and Spiders
- Centralized control, Coordinated, Refresh, Filtering
- Not the main problem
• Queries
- Interface, processing, results
• Indexing
- Data normalization, load balancing, data sharing
Harvesting
• Not just Web data
- Caching, Duplication, Normalization
• Armies of crawlers
• Filtering collected data
• Gatherers
- Collects and extracts on various schedules
- Works with several brokers
• Brokers
- Indexes and interfaces to queries
- Works with other Brokers and Gatherers
• Topical Agents?
Web Crawling Issues
•
•
•
•
•
•
•
•
•
•
•
•
•
Follow chains of URLs to gather more URLs
Extract index (content) from each page
Lather-Rinse-Repeat
Update crawler to-do list
Associate frequency of crawls
Breadth or Depth first?
Endless looping
Duplicate pages/sites
Changed page (or not really?)
Dynamically generated pages
Intranet pages
Markup language getting in the way
NOROBOTS
• What should a crawler get?
Indexing the Web
• Inverted File Index
- Sorted words with pointers to location(s) & page(s)
- Pointers are the focus (inversion)
• What about pages and sites?
- Massive redundancy on well-organized sites
• Navigation
• Topics
• Content
• “State of the art indexing techniques” = 30%
of text (not page) size. p 383
• How can you tune an index for massively changing
documents?
Ranking
• Boolean and Vector models mostly used
- Why?
- Works from the index, not the text
• Which ranking methods are best?
- Datasets
- Syntaxes
- Users & Testing
Ranking Methods
• TF-IDF
- Simple, smaller data sets
• Boolean Spread
-
Degrees of match
Within a document
Set of documents
Links between documents (meta docs?)
• Vector Spread
- Standard cosine between query and index (to
document)
- Links with answer or pointing to answer
• Most Cited
Is Web ranking different?
• Links are the difference that makes the
difference
-
Internal links on a page
Internal links on a site
Relationships between sites
Link freshness
• Kleinberg’s HITS method (1998)
-
Hypertext Induced Topic Search
Number of pages that point to (processed) query
Authorities (relevant content by links)
Hubs (links to varied authorities)
Problems with Hubs & Authorities
• Is more links always better?
• What about pages without many outgoing
links?
• How do you count multiple links from within
one page to another?
• Do automatically generated sites/pages have
an advantage?
- CMS systems may have linking “fingerprints”
- Metadata
• How varied are the link weights?
- Simple counts
- Modified by other IR measures
Anatomy of a LS Web Search Engine
• Initial Google Design
• PageRank
- PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
- “A model of user behavior”
• probability of a random surfer visiting a page is
its PageRank +
• a damping factor (boredom)
- Pages point to a page
- Highly ranked pages point to a page
- Anchor text is mined (the label for the link)
- Proximity included
Anatomy 2
• Repository of page content
• Document index
- Forward (sorted)
- Inverted (sorter)
•
•
•
•
•
Lexicon of words & pointers
Hit Lists of word occurrence(s)
Crawlers
Ranking
Feedback of selection (~)
Popularity?
• Do you always want the most popular
information source?
-
Talk Radio
New York Times Bestseller List
“Lincoln’s Doctors Dog”
“The C.S.I. Diet and Cookbook”
• Trend or Fad?
• Blogs, Editorials and Propaganda vs.
“Facts”?
• Result Diversity
• Death of the Mid-List
Next Generation Web Search
• Search works well now (80%), but what’s next?
• We need to be user-focused, not data-focused
• How do we match search to the task?
- Is it all about speed?
- How could metadata support search tasks?
• Best search is browsing?
- Faceted Search?
- Suggesting = browsing for interfaces
• Cooking
• Related results
• Specialized interfaces
• Natural language queries (quesiton answering)
• “Real world” metadata
• Context, personalization, query specifics
Metasearch Issues
•
•
•
•
One place for everything?
First or Last place to look?
Better or different interface?
Combined, sorted results would be best
- How to sort?
- Sorting for different types of queries
•
•
•
•
Syntax Errors
State Information (monitoring)
Copyright issues (robots)
User, content and interface
mismatches/challenges
Web Searching Metaphors
• How do people visualize the Web?
• Is Browsing better?
• Do we need new metaphors for using the
Web?
- Searching
- Browsing
- What else?
Assignments
• Read weekly Primary Readings & Participate
in class discussions 10%
- 1 page summaries
• Re-design Search Results interface 10%
• Web (log) analytics 20%
• Future of Search (“Google 2010”) (5 page
paper) 10%
• Web Information Retrieval System Evaluation
& Presentation 20%
• Main Project or Paper 30%
Re-design Search Results interface
• Choose a search engine (not Google) and re-design
the query AND result page interfaces
- Snap, Live, Ask, Technorati, Clusty, & many others…
• Discuss what search features are and their interfaces
- Highlight the good & the bad (or hard to understand or use)
- Use your own perspective as a novice user or habitual user
of the search engine
• Sketch, Photoshop &/or re-build the HTML pages to
show your improved interface designs
- Explain why you made the interface (& feature) changes
- Illustrate how people would use the new interface
• Compare to other search engines or search tools &
interfaces to give context to your re-design
System Evaluation & Presentation
- 5 page written evaluation of a Web IR System
- technology overview (how it works)
- a brief history of the development of this type of
system (why it works better)
- intended uses for the system (who, when, why)
- (your) examples or case studies of the system in
use & its overall effectiveness
Future of Search paper
• How can (Web) IR be better?
- Better IR models
- Better User Interfaces
• More to find vs. easier to find
• Scriptable applications
• New interfaces for applications
• New datasets for applications