Slide 1

Transcript Slide 1

og/le
optimal guesswork/luck-based engine
Circa 1300 BC:
Ten Commandments (1956)
Circa 1971:
Courtesy of “An Atlas of Cyberspaces”
(http://www.cybergeography.org/atlas/historical.html
)
Circa 1999:
So, why do we need
search engines?
o The web is too big.
o There is too much irrelevant information.
o Search engines bring order to this chaos
filled land.
What does a search
engine require?
o Know the Data
o Store the Data
o Retrieve the Results
o Order the Results
Anatomy of Streaker
To Do
URL
Stream
List
Page
Retriever
Parser
Quarantine
Link Cache
Word Cache
Main
The
Database
http://www.carleton.edu
To Do
URL
Stream
List
Page
Retriever
Parser
Quarantine
Link Cache
Word Cache
Main
The
Database
To Do
URL
Stream
List
Page
Retriever
Parser
Quarantine
Link Cache
Word Cache
Main
The
Database
Playing nicely with
the network
o Is the server responding?
o Is the server overloaded?
o How much info are we requesting?
o How fast are we sending our requests?
Throttling Streaker
Formulas:
Pause Time: DELAY * 2
Ave. Delay: DELAY + (lastFetchTime – DELAY) * .5
1 second
Streaker
DELAY = 1
2 seconds
WEBSERVER
3 seconds
Streaker
DELAY = 2
4 seconds
4 seconds
Streaker
DELAY = 3
To Do
URL
Stream
List
Page
Retriever
Parser
Quarantine
Link Cache
Word Cache
Main
The
Database
To Do
Page
Retriever
URL
Stream
List
Parser
Quarantine
Link Cache
Word Cache
Main
The
Database
Before
<html><head><title>og/le</title></head>
<body>
<table width="95%"><tr>
<td>    </td>
<td> <font size="+4"><b>og/le</b></font><br> <font size="+1">optimal
guesswork/luck-based engine</font></td>
<td align="right"> <font size="+3"> : carleton search</font></td>
</tr></table>
<br><br> <center> <a
href="http://dictionary.reference.com/search?q=ogle">About</a>
    <a href="instructions.html">Instructions for
Testers</a>     <a href="stats.php">Statistics</a>
<br><br> <form name="Ogle"> <input type="text" name="query" size="50"
/><input type='hidden' name='pagenum' value='1'> <br><input
type='submit' value='Ogle Carleton' /> </form> </center>
<br><center><font size="-1"><p>ogling 25,49 pages</font> </center>
</p><p><center><font size="-1"> <img src="streaker.gif"><br>Powered by
Streaker<br><br> © 2004 Josh Allen, Andrew Drummer, Brendan
Foote, Aaron Miller, Mike Ottum</font></center></p>
</body></html>
After
Page object
Page Text
Page Header
Page URL
Etc…
Word Object(s)
The Word
Word Position
Other info
Link Object(s)
Link URL
Link Position
Page URL
Link text
Brief HTML
Introduction
Which elements of a page
are important?
o Text
o Individual Words
o Position
o Tag Information
o Links
o Link target
o Link text
Parsing Challenges
o Identical pages with different URLs
o Especially common with dynamically-generated
pages
o Solution: Compute a checksum as we parse and
then compare it to previously seen pages
o CRC-32 Checksum Algorithm
o HTML is not a strict language
o The Parser must be flexible enough to allow
for many different types of coding, especially
in tags.
To Do
URL
Stream
List
Page
Retriever
Parser
Quarantine
Link Cache
Word Cache
Main
The
Database
To Do
URL
Stream
List
Page
Retriever
Parser
Quarantine
Link Cache
Word Cache
Main
The
Database
Pages Indexed: 54,752
Fetch Errors: 43,862
To Do
URL
Stream
List
Page
Retriever
Parser
Quarantine
Link Cache
Word Cache
Main
The
Database
To Do
URL
Stream
List
Page
Retriever
Parser
Quarantine
Link Cache
Word Cache
Main
The
Database
Unique is Good
Word
philanderer
philanthropist
philanderer
Word ID
251
252
253
mySQL queries take a long time!
mySQL queries take a long time!
We have MANY queries to make.
Our current database contains
206,493 unique words
54,752 unique urls
Google stores the complete text of
6 Billion
web pages in
memory
THEN: 459 pages/hour
NOW: 3422 pages/hour
MySQL
Brief Databases
Introduction
o Why use databases?
o Data to store is too big for main memory
o Optimize disk accesses through intelligent
organization of data
o Relational Database Model
o Data is stored in tables according to
relationships
o Data is retrieved using Structured Query
Language (SQL)
Relational Example
o Relate Words to Pages
o Information that we care about:
o word (string)
o url (string)
o position (integer)
o HTML tag attributes (set)
The Non-Relational
Way
word
url
pos
tags
college
http://www.carleton.edu
1
<b>
a
http://www.mathcs.carleton.edu
3
<b,i>
college
http://www.carleton.edu
4
<>
Why is this method
bad?
o Wasted space
o The word “college” and the URL
“http://www.carleton.edu” appear twice in this
example
o In our actual crawl, the word “carleton”
appears 85,496 times
o String comparisons are slow
Our Database Tables
- Word
wid
word
1
carleton
2
college
3
is
4
a
5
great
6
place
7
fhqwhgads
Our Database Tables
- URL
urlid
url
1
http://www.carleton.edu
2
http://www.mathcs.carleton.edu
3
http://www.carleton.edu/student/
4
http://violet.mathcs.carleton.edu/ogle/search.php
WordToUrl Table Captures
a Relation
o Relates Word entries to URL entries
wid
urlid
pos
tags
2
1
1
<b>
4
2
3
<b,i>
2
1
4
<>
Executing a join
Operation
o Combine the information from multiple
tables to produce something meaningful
Word Table
URL Table
WordToURL Table
pos
tags
Desired Output
Word Table
wid
word
1
carleton
2
college
3
is
4
a
5
great
6
place
7
fhqwhgads
Word Table
wid
word
1
carleton
2
college
3
is
4
a
5
great
6
place
7
fhqwhgads
Word
Table
WordToURL
Table
pos
tags
Desired Output
URL
Table
Word Table
WordToURL Table
wid
word
1
carleton
wid
urlid
pos
tags
2
college
2
1
1
<b>
3
is
4
2
3
<b,i>
4
a
2
1
4
<>
5
great
6
place
7
fhqwhgads
Word Table
wid
word
1
carleton
2
college
wid
urlid
pos
tags
3
is
2
1
1
<b>
4
a
4
2
3
<b,i>
5
great
2
1
4
<>
6
place
7
fhqwhgads
WordToURL Table
Word
Table
WordToURL
Table
pos
tags
Desired Output
URL
Table
WordToURL Table
URL Table
wid
urlid
pos
tags
urlid
url
2
1
1
<b>
1
www.carleton.edu
4
2
3
<b,i>
2
www.mathcs…
2
1
4
<>
3
www.carleton…
4
violet.mathcs…
WordToURL Table
URL Table
wid
urlid
pos
tags
urlid
url
2
1
1
<b>
1
www.carleton.edu
4
2
3
<b,i>
2
www.mathcs…
2
1
4
<>
3
www.carleton…
4
violet.mathcs…
Word
Table
WordToURL
Table
pos
tags
Desired Output
URL
Table
Join Result
word
url
pos
tags
college
www.carleton.edu
1
<b>
college
www.carleton.edu
4
<>
Heuristics
o Tools by which we return search results
o Must be accurate
o Must be fast
Problems: In general, the more complex a
heuristic is, the slower it performs.
How heuristics work
o
o
o
o
o
Obtain search query from user
Use query to “pull out” relevant data
Use data to retrieve all relevant pages
Use specific heuristic to order pages
Output ordered pages to user
Basic Heuristics
o Word Occurrence
o Pages order by the number of times the words in the
query appear on the page
o Frequency
o Pages order by the number of times words in the
query appear over the total number of words on the
page
o Proximity
o Pages ordered by the number of times words in the
query appear in the same order on the page
Meta Heuristics
o Tags
o Words on a page are weighted depending on
their html tags
o Pages are ordered by the sum of the weighted
words that appear on the page
Ultimate Heuristic
A combination of data and context
o frequency
o proximity
o tag heuristics
o Rank of pages factored into heuristic
Problem: Using all these factors slows down searching
process
Vector Space Models
A table with relationships between terms
and documents:
doc1
doc2
doc3
Term1
1
17
20
Term2 0
0
5
Term3 7
0
2
Now consider the table to be a matrix.
Then
o The columns can be seen as document vectors
o The terms serve as a basis for the vector space
o We can compare documents using vector
functions
Comparing Vectors
Recall:
a b
cos 
a  b
If we set a threshold on cos , we find
the set of vectors that are within a cone
around a.
Normalizing the Data
Since the length of the document vectors and the values both affect
this calculation, we can do some pre-processing to help the heuristic.
Local Term Weighting Schemes:
Binary
Term frequency
Logarithmic
Augmented Normalized
Global Term Weighting Schemes:
Normal
Document Normalization Schemes:
Cosine
( fij )
fij
log( 1 + fij )
(( fij ) + (fij/maxk fkj)) / 2
1
1
2 2
( f ij )
1
2 2
( gilij )
Latent Semantic
Indexing
Matrix Decomposition
If the matrix A has rank k, we can represent the
matrix using k column vectors.
This has the effect of smooshing together like
documents, creating relationships between terms
that do not appear on the same page.
Example: if a user searches for “Samuel Clemens”,
the terms appear on the same page as “Mark Twain”
often enough that documents only containing “Mark
Twain” will match.
o Heuristics concerning text
o Heuristics concerning the context of the
text
o Heuristics concerning the context of the
pages
Page Rank
What makes Page Rank different?
o
o
o
o
Link-based
Independent of search terms
Fewer database queries during search
Copyrighted
Example Network
A
C
B
D
E
Ranking a page
What do you need to rank a page?
o Pages that link to your page
o The ranks of those pages
o The links on those pages
o
Rank = 0.15 + 0.85 * Σ(Ri/Li)
o Fifty iterations
Before Ranking
1.0
A
1.0
C
B
1.0
Total = 5.00
D
E
1.0
1.0
After First Iteration
1.85
A
0.97
0.54
C
B
0.77
Total = 5.41
D
E
1.28
After Second
Iteration
1.66
A
1.05
0.50
C
B
0.72
Total = 5.19
D
E
1.26
After Fifth Iteration
1.62
A
0.49
0.49
C
B
0.70
Total = 5.06
D
E
1.23
And in Conclusion . . .
How did we do?
og/le makes your laundry whiter than any
other leading brand!
Competitors:
Google (the big boys)
ht://Dig (Carleton’s current search engine)
Searching for Dave
Musicant
Google
Search
time:
.31
seconds
Searching for Dave
Musicant
ht://Dig
Search
time:
.5 seconds
Searching for Dave
Musicant
og/le
Search
time:
1.54
seconds
Searching for Aaron
Miller
Google
Search
time:
.40
seconds
Searching for Aaron
Miller
ht://Dig
Search
time:
.5 seconds
Searching for Aaron
Miller
og/le
Search
time:
.06
seconds
Future Goals
o Support Stemming
o Link Referrals
o Better Hardware
o T-shirts, coffee mugs
How to og/le
Visit us at:
http://violet.mathcs.carleton.edu/ogle/
Bibliography
Berry, Michael, and Murray Browne.
Understanding Search Engines. Philadephia:
SIAM, 1999.
Craven, Phil. "Google's PageRank Explained and how
to make the most of it ." Web Workshop.net.
<http://www.webworkshop.net/pagerank.html>

Slide 1

Transcript Slide 1

Directory