Web Science: Searching the web

Download Report

Transcript Web Science: Searching the web

WEB SCIENCE:
SEARCHING THE WEB
Basic Terms
• Search engine
• Software that finds information on the Internet or World Wide Web
• Web crawler
• An automated program that surfs the web and indexes and/or
copies the website
• Also known as bots, web spiders, web robots
• Meta-tag
• Extra information that tags the HTML document
• <meta name="keywords" content="HTML,CSS,XML,JavaScript">
• HyperLink or Link
• A reference/link to another web page
How do you evaluate a search engine?
• Time taken to return results
• Number of results
• Quality of results
How does a web crawler work?
1. Start at a webpage
2. Download the HTML content
3. Search for the HTML link tags <a href=“URL”></a>
4. Repeat steps 2-3 for each of the links
5. When a website has been completely indexed, load and
crawl other websites
Parallel Web Crawling
• Speed up your web crawling by running on multiple
computers at the same time (i.e. parallel computing
• How often should you crawl the entire Internet?
• How many copies of the Internet should you keep?
• What are the different ways to index a webpage?
• Meta keywords
• Content
• Page rank (# links to page)
Basic Search Engine Algorithm
1. Crawl the Internet
2. Save meta keywords for every page
3. Save the content and popular words on the page
4. When somebody needs to find something, search for
matching keywords or content words
Problem:
• Nothing stops you from inserting your own keywords or
content that do not relate to the page’s *actual* content
PageRank Algorithm
1. Crawl the Internet
2. Save the content and index the contents’ popular words
3. Identify the links on the page
4. Each link to an already indexed page increases the
PageRank of that linked page
5. When somebody needs to find something, search for
matching keywords or content words, BUT rank the
search results according to PageRank
Problem: Create a bunch of websites that link to a single
specific page (http://en.wikipedia.org/wiki/Google_bomb)
Shallow Web vs. Deep Web
• Shallow web
• Websites and content that are easily visible to “dumb search
engines”
• Content publicly links to other content
• Shallow web content tends to be static content (unchanging)
• Deep web
• Websites and content that tend to be dynamic and/or unlinked
• Private web sites
• Unlinked content
• Smarter search engines can crawl the deep web
Search Engine Optimization (SEO)
• Meta keywords
• Words the relate to your content
• Human-readible URLs
• i.e. avoid complicated dynamically created URLs
• Links to your page on other websites
• Page visits
• Others?
• White hat vs. black hat SEO
• White hats are the good guys. When would they be used?
• Black hats are the bad guys. When would they be used?
Search Engine Design
• Assumptions are key to design!
• Major problem in older search engines:
• People gamed the search results
• Results were not tailored to the user
• What assumptions does a typical search engine make
now? (i.e. what factors influence search today?)