Transcript Slide 1
IELM 231: IT for Logistics and Manufacturing
Course Agenda
Introduction
IT applications design: Human-Computer Interface
Fundamental IT tools: sorting, searching
The Client-Server architecture, Interacting applications
IT in logistics, Case study 1: web-search
Search robots
Data processing, Data storage/retrieval (DB, indexes)
Data presentation: page ranking techniques
IT in logistics, Case study 2: web-based auctions
How auctions work
Web issues: session tracking
Web issues: secure communications
Web issues: cash transactions
The Internet and Web searching
Part 1. The infrastructure of the web
- The architecture of the Internet
- The ISO/OSI 7-layer communication protocol
- Client-server architecture for applications
Part 2. How to use a web search engine [case study: Google]
- Basic searching
- Advanced searching (Booleans) and filtering (setting up local engines)
Part 3. How search engines work
- Crawling the internet
- Indexing of information, database design
- Page ranking
History of search engines
1993: First web robot – World Wide Web Wanderer
Matthew Gray, Physics student from MIT
Objective: track all pages on web to monitor growth of the web
1994: First search engine – WebCrawler
Brian Pinkerton, CS student from U of Washington
Objective: download web pages, store the links linked to keyword-searchable DB
1994: Jerry’s Guide to the Internet
Jerry Yang, David Filo, Stanford University
Objective: crawl for web pages, organize them by content into hierarchies
Yet Another Hierarchical Officious Oracle (Yahoo)
1994-97: Lycos, Infoseek, AltaVista, Excite, LookSmart (meta engine)
1998: Google (Sergey Brin, Larry Page, CS students, Stanford University)
Basics of web searching: two modes of searching
Web directories
[Boolean] keyword search
Basics of web directories
A hierarchy of main topics is formed (by search company)
Each topic may have several sub-topics, etc.
All pages ‘known’ to engine are linked to the hierarchy by humans (editors)
Users search for ‘main’ articles in any subtopic by ‘browsing’
Pages deep inside a hierarchy can be searched using ‘keyword search’
Basics of web searching: user’s view
Basic search modes
Any page with the search term(s) occurring in it is returned
What if you search of multiple words?
implicit AND: page must contain all the words in search term
Exact phrase search
Typically by quoting the search term: “now is the time”
What happens if you use: now is the time
Typically: is and the will be omitted, results will differ from above!
Basics of web searching: user’s view..
Boolean (Set theoretic) operations on search results:
Intersection: AND (implicit) manufacturing die casting
Union: OR manufacturing die OR shell casting
Difference (NOT): ‘-’ manufacturing cold forging -hot
Some useful Google™ search modes:
Definition of a word or phrase define:search
Page contains numbers in a given range ming emperor 1400..1500
Simple calculator functions 150*(1+0.05)**5-150
Basics of web searching: user’s view…
Filtering
Anchor: the text in an HTML tag that carries the hyperlink to a page
Example:
<a href=http://www-ieem.ust.hk/dfaculty/ajay/courses/ielm330/330intro.html> IELM 330 web page </a>
Anchors have special significance in Google (we shall see this later)
Anchor text often indicates the content of the page it links
Search term must occur in the anchor text inanchor:manufacturing-wiki
Search term must occur in the title of page allintitle:joneja ajay
Focused search
Search only on pages that occur within a particular site IELM 330 site:ust.hk
The Internet and Web searching
Part 1. The infrastructure of the web
- The architecture of the Internet
- The ISO/OSI 7-layer communication protocol
- Client-server architecture for applications
Part 2. How to use a web search engine [case study: Google]
- Basic searching
- Advanced searching (Booleans) and filtering (setting up local engines)
Part 3. How search engines work
- Crawling the internet
- Indexing of information, database design
- Page ranking
Collecting information on the internet: Crawlers
How do search engines know which page has what data?
- By the use of robots (also called spiders or crawlers)
- By asking sites to tell them who they are (registering)
Crawling:
Start with a small list of web-pages (seeds) := OPEN
For each URL in OPEN:
Fetch page;
Process page (index its contents see later);
Add all web-links (URL’s) found on the page to OPEN
Problems in crawling
Too many web-sites > 100 million
Each website has many pages (estimated total: ~ 25 billion)
Some pages may not be linked from any seed
Dynamic page generation (e.g. DB based, or rapidly changing)
Suppose a web-site generates a page based on some inputs from user
How can we program a crawler to get this content ?
What potential problems can be faced by this technique?
Art gallery website:
- 3 methods to sort thumbnails; 3 choices of font size; 2 skins
18 URL’s (using HTTP GET with different options) for same page!
Crawler behavior
Basic crawler is easy to program
Policies used by crawlers determine their efficiency:
Selection policy: which page to download (among all URL’s in OPEN)
Breadth first is better than depth first
why?
Crawler behavior..
Re-visit policy: how often to re-visit a page (to account for dynamic content)
How often to re-visit? Depends on how frequently the page is changed
out-dated data cost [but how to estimate this cost?]
1, if p is identical to local copy
Freshness of a page p in repository at time t =
Age of a page p in repository at time t =
[source: wikipedia]
0, otherwise
0, if p is identical to local copy
t- time when p was modified, otherwise
Crawler behavior...
Policies used by crawlers determine their efficiency:
Politeness policy: how not to overload resources
Crawlers can send download requests faster than the server, the network
or the indexer can handle
Typical policy: wait n seconds before sending next page request to same server
n = 10 sec … 180 sec
Crawler behavior….
Policies used by crawlers determine their efficiency:
Parallelization policy: how to use multiple crawlers in a coordinated search
advantage: the search engine can use multiple computers to build indexes
problems:
- how to coordinate the merge the results of the multiple computers
- how to ensure that same page is not downloaded by more than one process
Search engines: Indexing
What to do with the pages downloaded by the crawler?
Processing the data:
- categorization (mime-types)
- text processing
- delete stop words [Why?] [How?]
- Identify and handle words
- Identify and handle phrases [How many ?]
- Handling similar words
- plurals, past/present, mis-spellings: stemming
Search engines: Indexing..
Categorization (mime-types)
The indexer must know what type of data it received, and handle it accordingly:
A/V mime-types: audio/mpeg, audio/x-wav, audio/x-ms-wma, video/mpeg, …
Currently: check meta data (anchor text, ALT-tags, HTTP Header, …
Future: Digital signal processing voice recognition text index
Image mime-types: image/gif, image/jpeg, image/png, image/tiff, …
Currently: check meta data (anchor text, ALT-tags, HTTP Header, …)
Application mime-types: application/javascript, application/xhtml+xml,
application/octet-stream, application/x-shockwave-flash, …
Currently: If interpreted language handle as text, otherwise meta data
Other mime-types: e.g. 3D data etc meta data, filenames, etc.
Search engines: Indexing…
Text processing
- Identify and handle words
what is a word?
- English: dictionary
- Programming: contiguous alphanumeric with ‘_’
- hyphenated [how does Google handle hyphenated search-terms?]
- special characters
- delete stop words
Stop words: very common words in text, such as the, a, an, on, …
How to define? language experts (human) make the list
How to handle? string-match do not index stop words
?
Search engines: Indexing….
- Identify and handle phrases
what is a phrase?
- English: dictionary of phrases, human experts
- How to identify: string match
- How many?
- Too many Index is too large
- Too few real-time Boolean operations during search (takes time!)
Google n-gram dataset: ~24 GB compressed text files
Number of tokens: 1,024,908,267,229
Number of unigrams: 13,588,391
Number of bigrams: 314,843,401
Number of trigrams: 977,069,902
Number of fourgrams: 1,313,818,354
Number of fivegrams: 1,176,470,663
Search engines: Indexing…..
- Handling similar words
- plurals, past/present, mis-spellings: stemming
- Search for ‘expert’ == ? == Search for ‘experts’
Stemming: To reduce a given word to a ‘canonical’ root word
Basic methods:
- Use a dictionary
- Analyze sentence, guess the meaning of the word used, find root from look-up table
- Use simple processing rules (English)
- if word ends in ‘ed’, ‘ing’, ‘ly’ remove ‘ed’, ‘ing’, ‘ly’
- How to identify mis-spellings?
- Dictionary lookup and match expensive
- Look-up table with list of commonly misspelled words
Search engines: Indexing……
Inverted indexes: How to store processed data
Most common storage structure: Inverted Index
ngram
Documents
storm
URL1, URL6, …, URL_n
teacup
URL1, …, URL_k
…
How to store?
- Database file (lookup time may be long)
- Hash (faster lookup on average)
Challenge: Updating an index [Why?] [How?]
Challenge: Merging indexes [Why? Parallel processing] [How?]
Search engines: presenting results to user
Typical query may match 100 millions web pages (hits)
Typical user may look at perhaps 3-5 pages to find the information they want
Successful search engine must provide a link to ‘best’ page among top few
How to rank the relative likelihood that a page will provide ‘wanted’ info?
- Google uses PageRank™
Search engines: PageRank
How to rank the relative likelihood that a page will provide ‘wanted’ info?
- Google uses PageRank™
PageRank: Invented by PhD student, Larry Page
Similar older ideas, e.g. link analysis (graph theory), citation index
Main idea of PageRank:
1. Importance of page is related to how many other pages link to it, and
2. How important are those pages
Is PageRank effective? [Google’s market capitalization: US$ 200B!]
Page Ranking
1
3
2
4
Consider a web with 4 pages
Arrow from A B => Page A has a link to Page B
Let: importance score of page k = score of k = xk
Simplest scheme:
xk = number of back-links to page k
x1 = 2, x2 = 1, x3 = 3, x4 = 2
Problem ?
does it account for condition 2?
Main Reference for PageRank notes: Bryan and Leise, The 25B eigenvector: The Linear Algebra Behind Google
Page Ranking..
1
3
2
4
Let: importance score of page k = score of k = xk
Suppose page j is more important (e.g. yahoo.com)
those pages that are linked from page j should be important
additional importance of page linked from page j xj
Let Lk be the set of page k’s backlinks, then: xk
xj
n
jLk
j
where nj = number of outgoing links on page j
Note: we ignore any link from a page to itself
Page Ranking…
Let Lk be the set of page k’s backlinks, then: xk
x1 =
xj
n
jLk
1
3
2
4
j
x3 + x4 /2
x2 = x1 /3
x3 = x1 /3 + x2/2 +
x4 = x1 /3 +
x2 /2
x4 /2
x1
x2
x3
x4
=
0
1/3
1/3
1/3
x
0
0
1/2
1/2
1 1/2
0 0
0 1/2
0 0
A
x1
x2
x3
x4
x
Recall from Linear Algebra that for an nXn square matrix A, a scalar l is an
eigenvalue if there is an eigenvector x ≠ 0 such that Ax = lx
PageRank of pages is the eigenvector of A corresponding to eigenvalue = 1
Page Ranking….
x1
x2
x3
x4
=
0
1/3
1/3
1/3
0
0
1/2
1/2
1 1/2
0 0
0 1/2
0 0
x1
x2
x3
x4
1
3
2
4
To find the eigenvector: solve the system using Gaussian elimination
In our example, any scalar multiple of:
12
4
9
6
12/31
Normalize: 4/31
9/31
6/31
=
.387
.129
.29
.194
Problem: Is it necessary that the link matrix of any web has eigenvalue = 1?
Page Ranking …..
Problem: Is it necessary that the link matrix of any web has eigenvalue = 1?
Dangling node: A web page that has no outgoing links
Theorem:
The link matrix of a web with no dangling nodes always has 1 as an eigenvalue
Examine matrix A:
Aij = 1/nj, if page j links to page i
Aij = 0, otherwise
sum of all entries in column j = total (unweighted) contribution of page j = 1
Column stochastic matrix: A square matrix with no negative entries, and
sum of each column = 1
Page Ranking …...
Column stochastic matrix: A square matrix with no negative entries, and
sum of each column = 1
Our link matrix is a column stochastic matrix
Theorem: Every column stochastic matrix has 1 as an eigenvalue
Proof:
1. (linear algebra): square matrix A and its Transpose, AT have same eigenvalues
2. Let e = an nX1 vector with all entries = 1
3. A is column stochastic AT e = e
AT has an eigenvector = e, (with eigenvalue = 1)
A has an eigenvalue = 1.
QED
Page Ranking issues
Problems with the Link Matrix approach, and How to handle them
Non-unique eigenvector for eigenvalue = 1
Dangling nodes
Computational effort and time
Other factors that also contribute to ‘importance of the page’
Page Ranking issues: non-unique eigenvector
Consider the web in the figure
1
3
A=
5
2
4
0
1
0
0
0
1
0
0
0
0
0 0 0
0 0 0
0 1 ½
1 0 ½
0 0 0
What are the eigenvectors for A, corresponding to eigenvalue = 1 ?
Any linear combination of:
½
½
0
0
0
0
0
½
½
0
What is the relative importance of the pages??
e.g.:
.375
.375
.125
.125
0
Page Ranking issues: non-unique eigenvector..
Consider the web in the figure
1
3
A=
5
2
4
0
1
0
0
0
1
0
0
0
0
0 0 0
0 0 0
0 1 ½
1 0 ½
0 0 0
Intuitive reason for the problem: this web can be partitioned into two independent sub-webs
Mathematically: Non-unique eigenvector results when A has a block diagonal form
A=
A1 0 … 0
0 A2 … 0
0 …
0 0 … Ak
Solution:
- Reorganize A into block diagonal form
- Find Page Rank for each block independently
Page Ranking issues: Dangling nodes
Suppose a web has dangling nodes (any page with no outgoing links)
Corresponding column of A is all 0’s
The matrix A is not column stochastic (it is column sub-stochastic)
For column sub-stochastic matrices, there must exist at least one
eigenvalue, 0 < l < 1, whose eigenvector has all non-negative entries
[Note: this eigenvector is called the Perron eigenvector]
[Note 2: we will omit the proof of its existence in this course]
We could use the Perron eigenvector to get page ranks (but we won’t)
Dangling nodes: A probabilistic view of the Link matrix
Consider an imaginary, randomized web-surfer:
He starts at a random page in the web and spends time t on the page.
He then jumps randomly to one of the links in that page
The Page Rank of a page = expected time spent on that page by the randomized surfer!
However: if our “randomized surfer” ends up on a dangling node, then he is
stuck there for ever ?!
One solution: If we are stuck at a dangling node, jump to a random page in the web
How to model this view-point into our matrix A ?
The S matrix: For any 0-column ( dangling node) in A, make all entries = 1/n
Question: Is the matrix S column stochastic ?
Page Ranking Issues: Other factors influencing rank
Main idea (Google’s damping factor):
Allow our randomized surfer to make a purely random jump “occasionally”
How to implement ?
What is the link matrix of a purely random surfer?
Note: R is column stochastic
Consider the Google matrix, G = d S + (1 – d) R
R=
1/n 1/n … 1/n
1/n 1/n … 1/n
…
1/n 1/n … 1/n
d [0, 1]
Note: G is column stochastic (Why?)
Effect of using G: With probability d, we follow the randomized surfer S, and with
probability (1 – d), we make a purely random jump.
Page Ranking Issues: Other factors influencing rank
Main idea (Google’s damping factor):
Allow our randomized surfer to make a purely random jump “occasionally”
Google matrix, G = d S + (1 – d) R
Google found that using matrix G, and damping factor d = 0.85 “works very
well” ( gives ranking that tends to rank “useful pages” higher)
Page Ranking Issues: Computing the eigenvector
The Rank of page n is the n-th element of the eigenvector, I, of the G matrix
for eigenvalue = 1
Size of G: approx 25Billion X 25 Billion !!!
Using Gaussian elimination will take months on a supercomputer!
Properties of G:
Large size (25Billion X 25 Billion)
Sparse (approx 10 non-zero entries in typical non-dangling node column)
1 is the largest eigenvalue for column stochastic matrix G
We can use the Power Method to compute the I-vector
Page Ranking Issues: Computing the eigenvector..
The Power Method to compute the eigenvector of the dominant eigenvalue
Start with an arbitrary rank vector, I0
Iterate:
I1 = G I0 , …, …, Ik+1 = G Ik
Ik converges to the eigenvector of the dominant eigenvalue
We only need to do matrix multiply for sparse matrix (~10 multiply per column)
[NOTE: by writing G = d S + (1 – d) R, the effect of R is merely to add a fixed
quantity to each row of I faster computation]
Brin and Page:
Typically, Google needs 50-100 iterations to get a good I-vector
Typically, Google computers take few days to calculate
Typically Google updates rankings once per month (Google dance)
Concluding notes
Why is web-search important?
- For researchers, only practical way to find information from web
- For E-commerce companies: higher rank larger sales (why?)
This is the reason for high stock value of Google, Baidu, …
Current trend:
- Page Ranking method dominates Directory-structured search engines
Technology note:
- Page Ranking: a practical application of abstract Linear Algebra concepts!
References and Further Reading
Books:
Sherman and Price, The Invisible Web: Uncovering Information Sources Search
Engines Can’t See, Cyber Age Books, 2001
[Note: some information in this book is already obsolete]
Web sources:
1. Googleguide: google search guide
2. David Austin’s article on how Google PageRank works
Next: Auctions, Web-transactions, Ebay