Search Engine

Download Report

Transcript Search Engine

Basic Web Applications 2
Search Engine

Why we need search ensigns?
– because there are hundreds of millions of
pages available on the web
– most of them titled according to the
notion of their author
– almost all of them sitting on servers with
hidden names.
– We use search engines get
information on those pages.
what is Internet Search
Engine


Special sites on the Web that are designed to
help people find information stored on other
sites.
various search engines use different ways to
work, but they all perform three basic tasks:
– Select pieces of the Internet -- based on important
words.
– Keep an index of the words they find, and where they
find them.
– Allow users to look for words or combinations of
words found in that index.
Search Engine
the
process
is
called Web crawling
1- Search engines use software called spiders, which comb the
internet looking for documents and their web addresses
2- Spreading out across the most widely used portions of the Web.
Search Engine
The documents and web addresses are collected and
sent to the search engine's indexing software
Search Engine
The indexing software extracts information from the
documents, storing it in a database. (every words or
titles)
When you perform search by entering keywords, the
database is searched for documents that match.
Search Engine
Search Engine




In Google- multiple spiders at one time.
Each spider --- > keep 300 connections to Web
pages open at a time.
The system crawl over 100 pages per second-
around 600 kilobytes of data each second.
to minimize delays use its own DNS.
Search Engine

Google spider take note of two things:
– The words within the page
– Where the words were found
– The frequency and location of
keywords within the Web page
– How long the Web page has existed
– The number of other Web pages that
link to the page in question
Search Engine

Lycos:
– keep track of the words in the title, subheadings
– Links- the 100 most frequently used words on
the page
– each word in the first 20 lines of text.
Each commercial search engine -- different
formula for assigning weight to the words
in its index.
Meta Tags



Meta tags - key words and
concepts- under which the page will
be indexed.
Meta tags can guide the search
engine.
There is of course careless page
owner might ( irrelevant meta tags).
Meta Tags

To protect against this:
– spiders correlate Meta tags with page content -
rejecting the not matched meta tags.
 <meta name="googlebot" content="noindex">
– The owner of a page may or may not wants its page
to be included in the results of a search engine's
activities.
– Exclusion protocol was developed and
implemented in the meta-tag section at the beginning
of a Web page to tell a spider to leave the page
Building the Index

Once the spiders finish finding
information on Web pages, the
search engine must store the
information in a useful way:
– The information stored with the data
(for simplicity word + url)
– The method by which the information is
indexed
Building the Index

Different search engines
– will produce different lists
– pages presented in different orders.
Building the Index



Indexing process allows information to
be found as quickly as possible.
One ways to build index is to build a
hash table.
In hashing, a formula is applied to
attach a numerical value to each word.
Building the Index



In English, the "M" section of the dictionary is
much thicker than the "X" section -finding a
word beginning with a very "popular" letter tae
time.
Hashing evens out the difference, and reduces
the average time it takes to find an entry.
It also separates the index from the actual entry.
Building the Index

The hash table contains the hashed
number which Point to the actual
data, which is sorted in efficiently
way.
Building a Search


Searching through an index involves a
user building a query and submitting it
through the search engine.
Boolean operators:
– AND -. Some search engines substitute
the operator "+" for the word AND.
– OR - At least one of the terms joined by
"OR" must appear in the pages or
documents.
Building a Search




NOT - must not appear in the pages or
documents. Some search engines substitute
the operator "-" for the word NOT.
FOLLOWED BY - One of the terms must be
directly followed by the other.
NEAR - One of the terms must be within a
specified number of words of the other.
Quotation Marks - The words between
the quotation marks are treated as a
phrase, and that phrase must be found
within the document or file
Overall view