Web Search Engines - School of Information

Download Report

Transcript Web Search Engines - School of Information

WIRED Week 6
•Syllabus Review
•Readings Overview
•Search Engine Optimization
•Assignment Overview & Scheduling
•Projects and/or Papers Discussion
Web Search Engines
• Independent of IR model
• Distributed index and servers
- Crawler
- Query server
- Indexer
• Crawlers and Spiders
- Centralized control, Coordinated, Refresh, Filtering
- Not the main problem
• Queries
- Interface, processing, results
• Indexing
- Data normalization, load balancing, data sharing
Harvesting
• Not just Web data
- Caching, Duplication, Normalization
• Armies of crawlers
• Filtering collected data
• Gatherers
- Collects and extracts on various schedules
- Works with several brokers
• Brokers
- Indexes and interfaces to queries
- Works with other Brokers and Gatherers
• Topical Agents?
Web Crawling Issues
•
•
•
•
•
•
•
•
•
•
•
•
•
Follow chains of URLs to gather more URLs
Extract index (content) from each page
Lather-Rinse-Repeat
Update crawler to-do list
Associate frequency of crawls
Breadth or Depth first?
Endless looping
Duplicate pages/sites
Changed page (or not really?)
Dynamically generated pages
Intranet pages
Markup language getting in the way
NOROBOTS
• What should a crawler get?
Indexing the Web
• Inverted File Index
- Sorted words with pointers to location(s) & page(s)
- Pointers are the focus (inversion)
• What about pages and sites?
- Massive redundancy on well-organized sites
• Navigation
• Topics
• Content
• “State of the art indexing techniques” = 30%
of text (not page) size. p 383
• How can you tune an index for massively changing
documents?
Ranking
• Boolean and Vector models mostly used
- Why?
- Works from the index, not the text
• Which ranking methods are best?
- Datasets
- Syntaxes
- Users & Testing
Ranking Methods
• TF-IDF
- Simple, smaller data sets
• Boolean Spread
-
Degrees of match
Within a document
Set of documents
Links between documents (meta docs?)
• Vector Spread
- Standard cosine between query and index (to
document)
- Links with answer or pointing to answer
• Most Cited
Is Web ranking different?
• Links are the difference that makes the
difference
-
Internal links on a page
Internal links on a site
Relationships between sites
Link freshness
• Kleinberg’s HITS method (1998)
-
Hypertext Induced Topic Search
Number of pages that point to (processed) query
Authorities (relevant content by links)
Hubs (links to varied authorities)
Problems with Hubs & Authorities
• Is more links always better?
• What about pages without many outgoing
links?
• How do you count multiple links from within
one page to another?
• Do automatically generated sites/pages have
an advantage?
- CMS systems may have linking “fingerprints”
- Metadata
• How varied are the link weights?
- Simple counts
- Modified by other IR measures
Anatomy of a LS Web Search Engine
• Initial Google Design
• PageRank
- PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
- “A model of user behavior”
• probability of a random surfer visiting a page is
its PageRank +
• a damping factor (boredom)
- Pages point to a page
- Highly ranked pages point to a page
- Anchor text is mined (the label for the link)
- Proximity included
Anatomy 2
• Repository of page content
• Document index
- Forward (sorted)
- Inverted (sorter)
•
•
•
•
•
Lexicon of words & pointers
Hit Lists of word occurrence(s)
Crawlers
Ranking
Feedback of selection (~)
Popularity?
• Do you always want the most popular
information source?
-
Talk Radio
New York Times Bestseller List
“Lincoln’s Doctors Dog”
“The C.S.I. Diet and Cookbook”
• Trend or Fad?
• Blogs, Editorials and Propaganda vs.
“Facts”?
• Result Diversity
• Death of the Mid-List
Metasearch Issues
•
•
•
•
One place for everything?
First or Last place to look?
Better or different interface?
Combined, sorted results would be best
- How to sort?
- Sorting for different types of queries
•
•
•
•
Syntax Errors
State Information (monitoring)
Copyright issues (robots)
User, content and interface
mismatches/challenges
Web Searching Metaphors
• How do people visualize the Web?
• Is Browsing better?
• Do we need new metaphors for using the
Web?
- Searching
- Browsing
- What else?
Search Engine Optimization
• Found by spiders and submissions
- More links to and from site
- Registration on major directories
- Links to and from major directories
• Real Contact information Helps prove validity
-
META tag
Header and footer of home page
About Us or Contact Us pages
Location/Map page
Good Design is SEO
• Basic interface
• Well-structured links
- Comprehensive Site Navigation
- Updated and accurate links
• Easy to find (via the Web or on the site itself)
• Clear labels
-
TITLEs
Headings
Term consistency
Link consistency
• Small sizes to download quickly
Web Search Tests
• Perform searches with targeted keywords
• Compare and contrast top results with your potential
site
- Similar terms
- Links (external and internal)
- Popularity (sites that link to the site)
• Use Data to
- Build a keyword list
- Build an introductory text
• Blurbs
• Description (2 sentences max)
• Any page found via a Web search engine should have
search for the site itself
• Regularly monitor Search with your terms
Internal Search
• Robots.txt
• Log and analyze search results
Measure success and failure
Tune for click-through productivity
Keep list of terms
Match terms to pages
• Add terms
• Script terms to certain pages
- Provide list (links) of most recent search terms
- Provide list (links) of most popular search terms
-
Page Design
• Use CSS
- <style type=“text/css”>
- Keep content in pages, not CSS templates
• Put JavaScript, etc. in external files
- <script language=“JavaScript” src=“scripts/myscript.js”
type=“text/javascript”>
</script>
- <noscript> tag too for alternate content
•
•
•
•
•
•
Continually verify external links
ALT tags & Accessibility Compliance
Index link on Splash page (if needed)
Exact consistency on internal links (ending “/”s)
<noframes>
Redirects <META HTTP-EQUIV=“refresh” content=“0”;
URL=http://www.newsite.com/index.html>
Search and MIME types
• Flash now supports internal text
• PDF files
- Add comments and authorship info
- Modify existing PDFs
• Check Document PropertiesFonts with fonts shows that
PDF can be indexed (not a group a graphics files)
- Provide text abstract or summary of PDF
• PPT, use text if possible
• Java interfaces prove difficult
• Dynamic pages should have key(word) static
elements
• FORMs not always completely indexed
Track your Tracking
• Keep list of sites submitted to
- When, Who, Email address, exact URL submitted
- Suggested categories, Current site description
- Terms and Conditions
• Keep list of “goal” keywords
• Keep list of sites you check keywords
- Keywords
- Dates
- Successes/Failures
Assignment Overview & Scheduling
• Leading WIRED Topic Discussions
- # in class = # of weeks left?
• Web Information Retrieval System Evaluation
& Presentation
- 5 page written evaluation of a Web IR System
- technology overview (how it works)
- a brief history of the development of this type of
system (why it works better)
- intended uses for the system (who, when, why)
- (your) examples or case studies of the system in
use and its overall effectiveness
Projects and/or Papers Overview
• How can (Web) IR be better?
- Better IR models
- Better User Interfaces
• More to find vs. easier to find
• Scriptable applications
• New interfaces for applications
• New datasets for applications