ITIS 1210 Introduction to Web

Download Report

Transcript ITIS 1210 Introduction to Web

ITIS 1210
Introduction to Web-Based
Information Systems
Internet Research Two
How Search Engines Rank Pages &
Constructing Complex Searches
How do Search Engines Crawl?
 Gathering data from the Web is like
browsing:
1.
2.
3.
4.
Visit a page.
Record all the words on the page
Choose a link you haven’t seen/recorded
Click on the link.
Repeat 8 billion times.
Crawling the Web
 One person with a Web browser, following
one link per second.
 How long does it take to browse the
surface Web (8 billion pages)?
8 billion seconds = 133 million minutes
= 2 million hours
= 93 thousand days
= 256 years
Crawling the Web
 How many people would it take to crawl
the surface Web in a week? If each person
follows one link per second (with no
sleep):
One week = six hundred thousand seconds
Six hundred thousand / eight billion =
thirteen thousand
Challenges:
 Remembering where you’ve been
 Remembering where you haven’t been
 Storing all the data
A (small) Server Farm
The Deep Web
 Not all pages get crawled:
 Private pages on Intranets (company
networks)
 Pages that people don’t want crawled
 Dynamic content pages (from databases)
 Dynamic content pages make the size of the
Internet infinite!
Dynamic Content Example
 zillow.com
 Won’t be
indexed
Identifying High Quality Web
Pages
 Google has ranked billions of Web pages
by "quality".
 You enter your search terms:
UNC Charlotte HCI
 Google finds the highest quality page
associated with these search terms.
Google Pagerank
Pretend you're surfing the Web randomly.
To move from page to page you could:
1) type in an address (www.sis.uncc.edu)
includes using a bookmark
OR
2) follow a link.
Pagerank measures how likely you are to reach a
particular page through random surfing (either 1 or 2).
The main idea is that links to your page from important
web pages indicate that your page is important.
Computing Pagerank
(what’s the probability of getting to this page?)
Q = Web page
A, B, C, ... = Pages pointing to Q
L(A), L(B), L(C),... = number of links on each page
Pagerank of Q:
R(Q) = (1-d) + d·(R(A)/L(A) + R(B)/L(B) + ...)
d represents the relative chance of following a link to page Q
and 1-d represents the relative chance of going directly to
page Q (via typing in the address or using a bookmark):
Usually these are:
d = 0.9
(1-d) = 0.1
Computing Pagerank
Pretend the Web has only four pages:
W
X
Y
Z
Links:
WX
YW
YZ
L(W)=1
L(X)=0
L(Y)=2
Which page has the highest “quality”?
ZW
L(Z)=1
Computing Pagerank
Links: W  X
YW
YZ
Z W
L(W)=1
L(X)=0
L(Y)=2
L(Z)=1
R(W) = (1-d) + d * (R(Y)/L(Y) + R(Z)/L(Z))
= 0.1 + 0.9 * (R(Y)/2 + R(Z)/1))
R(X) = 0.1 + 0.9 * R(W)
R(Y) = 0.1
R(Z) = 0.1 + 0.9 * (R(Y)/2)
Now, solve for:
R(W), R(X), R(Y),
R(Z)
Computing Values for R(W), R(X), R(Y) and R(Z)
We could use algebra to find the values, in the
same way we could solve for x and y in:
x = 1 + 2x + y
y = 2 + x + 3y
Algebraic Solution
w = R(W)
x = R(X)
y = R(Y)
z = R(Z)
w = 0.1 + 0.45y + 0.9z
w = 0.2775
x = 0.1 + 0.9w
x = 0.34795
y = 0.1
y = 0.1
z = 0.1 + 0.45y
z = 0.145
But solving for eight billion variables is hard.
Instead, we'll use fixed point iteration.
Solution by Fixed-Point Iteration
Start with initial estimates of PageRank for each page:
R(W) = 1.0 R(X) = 1.0 R(Y) = 1.0 R(Z) = 1.0
Apply equations to compute new estimates:
new R(W) = 0.1 + 0.9 * (R(Y)/2 + R(Z))
= 0.1 + 0.9 * (1.0/2 + 1.0)
= 1.45
new R(X) = 0.1 + 0.9 *R(W) = 0.1 + 0.9 *1.0 = 1.0
new R(Y) = 0.1
new R(Z) = 0.1 + 0.9 * (R(Y)/2) = 0.1 + 0.9 * (1.0/2) = 0.55
Solution by Fixed-Point Iteration
Start with updated estimates:
R(W) = 1.45 R(X) = 1.0
R(Y) = 0.1
R(Z) = 0.55
Apply equations to compute new estimates:
new R(W) = 0.1 + 0.9 * (R(Y)/2 + R(Z))
= 0.1 + 0.9 * (0.1/2 + 0.55)
= 0.64
new R(X) = 0.1 + 0.9 *R(W) = 0.1 + 0.9 *1.45 = 1.405
new R(Y) = 0.1
new R(Z) = 0.1 + 0.9 * (R(Y)/2) = 0.1 + 0.9 * (0.1/2) = 0.145
Solution by Iteration
iteration
0
1
2
3
4
5
R(W)
1.00000
1.45000
0.64000
0.27550
0.27550
0.27550
...
R(X)
1.00000
1.00000
1.40500
0.67600
0.34795
0.34795
...
R(Y)
1.00000
0.10000
0.10000
0.10000
0.10000
0.10000
...
R(Z)
1.00000
0.55000
0.14500
0.14500
0.14500
0.14500
...
Compute new estimates from the old until the estimates
stop changing. Note that this is the same answer as the
traditional algebraic approach, but this way scales better.
Final Pageranks
highest
page X
R(X) = 0.34795
.
.
.
page W
R(W) = 0.2755
page Z
R(Z) = 0.14500
lowest
page Y
R(Y) = 0.10000
Final Pageranks
0.10000
0.14500
Y
Z
W
X
0.27550
0.34795
2
1
1
0
How does Google Use
Pagerank?
 You enter search terms, such as “UNC
Charlotte HCI”
 Google finds all the pages that have all
those words on them
 Of all those pages, Google will list the
ones with the highest page rank first, but…
 …other ‘magic ingredients’ are used by
Google: trade secrets of their algorithms.
Introduction
 Basic queries are somewhat limited
 One or two keywords
 Simple relationships
 Limited syntax
 Complex queries provide more power
 Keywords & phrase can be connected to form
more complex relationships
 Search filters can be employed to limit results
Understanding Boolean Operators
 Syntax
 Rules for combining simple words to form
complex sentences
 Search engine syntax implemented by
applying Boolean logic
 George Boole
 1815-1864
Understanding Boolean Operators
 Boolean logic
 Keywords act as nouns
 Boolean operators act as conjunctions
 They define the connections between keywords
 Illustrated with Venn diagrams
 John Venn
 1834-1923
Understanding Boolean Operators
All web pages containing the word cats
cats
WWW
Understanding Boolean Operators
All web pages containing the word dogs
dogs
WWW
Understanding Boolean Operators
All web pages containing the words cats and dogs
Searches containing both words
cats
dogs
Intersection of
the two sets
WWW
Understanding Boolean Operators
All web pages containing the words cats or dogs
Searches containing either word
cats
dogs
Union of the
two sets
WWW
Understanding Boolean Operators
All web pages containing the words cats and not dogs
Searches containing one word but not the other
cats
dogs
Exclusion of
the dogs set
WWW
Understanding Boolean Operators
All web pages containing the words dogs and not cats
Searches containing one word but not the other
cats
dogs
Exclusion of
the cats set
WWW
Understanding Boolean Operators
 Boolean operators
 AND
 OR
 NOT
 Instruct the engine on how to combine
keywords to produce results
 Always use capital letters to avoid
confusion with and, or, not as keywords
Understanding Boolean Operators
 AND
 All these keywords must be on the Web page
 OR
 These keywords may or may not be on the
Web page
 At least one of them must be
 NOT
 None of these keywords can be on the Web
page
Understanding Boolean Operators
 Default operator
 Some engines have a default Boolean
operator
 Usually AND
 Might be OR
 Some engines may search for multiple
words as phrases
Understanding Boolean Operators
 Boolean operators may be
 Allowed on main page
 Confined to Advanced search pages
 Some engines use symbols instead
 + for AND
 - for NOT
 No space between sign and word:
 +solar +energy -windmill
Narrowing Searches with AND
 AND
 Limits results
 Forces inclusion of a stop word
 Indicates that all keywords must be found
on Web page
 Adding more ANDed keywords limits
search more
 Results should be more relevant because
the keyword list has expanded
Narrowing Searches with AND
 Example:
 “solar energy association” AND Portland
Solar energy
association
Portland
WWW
Narrowing Searches with AND
 Example:
 Henry +I same as “Henry I”
Henry
I
WWW
Expanding Searches with OR
 OR expands results
 Useful if you didn’t get enough returns from
your first search
 The more keywords you add, the more results
you should get
 Every page returned must have at least
one of the keywords on it
 Good to use when you have synonyms
Expanding Searches with OR
 Example:
 oregon OR northwest
oregon
northwest
WWW
Restricting Queries with AND NOT
 AND NOT excludes the keyword that
follows NOT
 Limits your search
 Produces fewer results
 Useful if first search returns irrelevant
results
 Use AND NOT to get rid of those results
Restricting Queries with AND NOT
 Equivalent forms:




cats AND NOT dogs
cats AND-NOT dogs
cats NOT dogs
cats –dogs
Restricting Queries with AND NOT
 Example:
 “solar energy association” AND portland
AND NOT maine
portland
Solar energy
association
maine
Multiple Boolean Operators
 Boolean operators allow you to focus a
search
 Any logical combination of operators is
allowed
 If it makes sense when spoken like a
sentence it’s probably OK to use
 Order of operations is usually left to right
 Use parentheses to organize terms
Multiple Boolean Operators
 Bad example:
 constitution +american OR “united states”
constitution
american
“united states”
Multiple Boolean Operators
 Good example:
 constitution +(american OR “united states”)
constitution
american
“united states”