Web search engine

Download Report

Transcript Web search engine

Web Searching
Basics
Dr. Dania Bilal
IS 530
Fall 2009
How the Web Came About?
• First, we had the Internet with text-based
files and indexes to find information in
these files
– Static, no graphics or multimedia
– No point and click using a mouse
– No GUI (Graphical User Interface)
– Menu-driven and subject categories for topics
were hierarchical in nature
How the Web Came About?
• Tim Berners-Lee
– Late 1980s created the HTTP protocol
– Hypertext Transfer Protocol
– Links various files and documents (text,
sound, images, videos, etc.) available on
various Internet host servers in a seamless
way
• Beginning of the World Wide Web (WWW)
• WWW is part of the Internet
How the Web Came About?
• Graphical Web browsers were developed
for navigating through Web content
• Mosaic
– First Web browser
– Appeared in 1993
– Revolutionized access to information
– Made use of the Web much easier to use
• Other browsers appeared
Searching the Web
• Search engines (general and subjectdriven)
• Directories
• Meta-search engines
• Meta-directories
Search Engines
• Engines are computer programs designed
for searching the Web
• Components
– Crawlers or spiders
– Database
– Search engine software
– Search algorithms
Crawlers or Spiders
• Traverse the Web, visits web pages that
are not blocked
• Read the pages visited
• Follows links form pages to additional
pages
• Return frequently to the pages for updates
Database Component
• Stores copies of the web pages the
crawlers or spiders visited
• Database is organized based on a preset
scheme
• Fields in each document or webpage are
identified (e.g., URL, page title, header or
section title, metadata described by author
of a page)----> pages are indexed
Search Engine Software
• Program that sorts through the pages stored in
the database
• Takes a user query entered in a search engine
• Matches the words in the query to the web
pages stored in the database alongside the
search criteria in the query
– Matches each word and accounts for the operators
appearing in the query (+; -; “ “)
• The + sign is assumed when no operators are used
Search Engine Software
• Matching is performed by algorithms
(computational rules)
• Relevance of what was matched is
calculated using sophisticated algorithms
• Relevance ranking of pages returned to a
user are based on rules used by the
engine company
Search Engine Relevance Ranking
• Some criteria
– Word frequency
– Location of a word in the web page or
document
• page title, page URL, page first heading, 2nd
heading, first sentence in a heading, etc.)
– Number of links to a page by other pages
– No. of clicks on a page when it appears in the
result of a search
– Meta-tags (metadata)
Basic Search Strategy
• Identify the information need
• Extract basic concepts from the information need (broad
ideas)
• Choose possible keywords or terms related to the
concepts
– Think of broader, narrower, or related terms
• Determine the search logic and techniques most suitable
for formulating a search using the keywords or terms
– Boolean? Proximity? Combination of both? Nesting?
• Select an appropriate engine, directory, metaengine, or meta-directory based on the topic
Basic Search Strategy
•
Explore the features of the engine or directory if you’re unfamiliar with them
– Visit the Advanced Search options, Help file, Search Tips, as applicable
•
•
Conduct the search
Examine the first page of returned results and visit the top five or more
– Search engine ranks results not based on the context of the topic search; rather,
based on the matching and ranking criteria
• System relevance
•
Identify the pages or documents that are the most relevant to your topic
– User relevance judgment (also called pertinence)
•
Use the most relevant document or page and explore the keywords,
headings, phrases, etc. that you can use to find additional relevant pages or
documents.
– “Seed” document or “Pearl growing”
– Follow the Cited by, as applicable to find additional documents relevant to the
topic.
•
•
Revise your search if needed.
Try your search in another engine, specialized engine, meta-engine,
directory, etc.
The Question of Quality
• Criteria for evaluating information quality
– Source domain (.com, .edu, .gov, etc.)
– Authority
– Purpose or motivation
– Quality of writing
– Balanced views
– Currency of information
– Sources cited
The Question of Quality
• Accuracy
• Factual information (check against two or
more authoritative sources)
• Use additional sources for evaluating the
quality of information on the Internet.
 http://www.virtualchase.com/quality
 http://www.lib.berkeley.edu/TeachingLib/Guides/I
nternet/Evaluate.html
The Invisible Web
• Search engines don’t index all web pages
• Reasons:
– Information stored in databases that require
subscription
– Pages or websites that are passwordprotected
– Pages that are not linked to other pages
– Pages that are blocked to spiders or crawlers
Search Logic: Boolean Operators
Source: Google Images
Boolean and Search Engines
• AND
+
• OR
• NOT
-
Phrase Searching
•
•
•
•
Proximity searching
“ “ are used in search engines
Provides more precise results
Limits the results to the words that are
close to each other.
Demos
• Google Features
– Basic
– Advanced
– I’m feeling lucky
– Google Directory
– About Google
– More (from the menu option)
– Show options/Hide options (from the results
page)
Google Advanced Searching
• Video on YouTube
http://www.youtube.com/watch?v=tk6vZiGi
aiQ
Yahoo Demo
•
•
•
•
•
•
Basic
Advanced
Directory
Yahoo Answers
Ask Earl
Other features