Transcript Slide 1

IELM 231: IT for Logistics and Manufacturing
Course Agenda
Introduction
IT applications design: Human-Computer Interface
Fundamental IT tools: sorting, searching
The Client-Server architecture, Interacting applications
IT in logistics, Case study 1: web-search
Search robots
Data processing, Data storage/retrieval (DB, indexes)
Data presentation: page ranking techniques
IT in logistics, Case study 2: web-based auctions
How auctions work
Web issues: session tracking
Web issues: secure communications
Web issues: cash transactions
The Internet and Web searching
Part 1. The infrastructure of the web
- The architecture of the Internet
- The ISO/OSI 7-layer communication protocol
- Client-server architecture for applications
Part 2. How to use a web search engine [case study: Google]
- Basic searching
- Advanced searching (Booleans) and filtering (setting up local engines)
Part 3. How search engines work
- Crawling the internet
- Indexing of information, database design
- Page ranking
History of search engines
1993: First web robot – World Wide Web Wanderer
Matthew Gray, Physics student from MIT
Objective: track all pages on web to monitor growth of the web
1994: First search engine – WebCrawler
Brian Pinkerton, CS student from U of Washington
Objective: download web pages, store the links linked to keyword-searchable DB
1994: Jerry’s Guide to the Internet
Jerry Yang, David Filo, Stanford University
Objective: crawl for web pages, organize them by content into hierarchies
 Yet Another Hierarchical Officious Oracle (Yahoo)
1994-97: Lycos, Infoseek, AltaVista, Excite, LookSmart (meta engine)
1998: Google (Sergey Brin, Larry Page, CS students, Stanford University)
Basics of web searching: two modes of searching
Web directories
[Boolean] keyword search
Basics of web directories
A hierarchy of main topics is formed (by search company)
Each topic may have several sub-topics, etc.
All pages ‘known’ to engine are linked to the hierarchy by humans (editors)
Users search for ‘main’ articles in any subtopic by ‘browsing’
Pages deep inside a hierarchy can be searched using ‘keyword search’
Basics of web searching: user’s view
Basic search modes
Any page with the search term(s) occurring in it is returned
What if you search of multiple words?
implicit AND: page must contain all the words in search term
Exact phrase search
Typically by quoting the search term: “now is the time”
What happens if you use: now is the time
Typically: is and the will be omitted, results will differ from above!
Basics of web searching: user’s view..
Boolean (Set theoretic) operations on search results:
Intersection: AND (implicit)  manufacturing die casting
Union: OR  manufacturing die OR shell casting
Difference (NOT): ‘-’  manufacturing cold forging -hot
Some useful Google™ search modes:
Definition of a word or phrase  define:search
Page contains numbers in a given range  ming emperor 1400..1500
Simple calculator functions  150*(1+0.05)**5-150
Basics of web searching: user’s view…
Filtering
Anchor: the text in an HTML tag that carries the hyperlink to a page
Example:
<a href=http://www-ieem.ust.hk/dfaculty/ajay/courses/ielm330/330intro.html> IELM 330 web page </a>
Anchors have special significance in Google (we shall see this later)
Anchor text often indicates the content of the page it links
Search term must occur in the anchor text  inanchor:manufacturing-wiki
Search term must occur in the title of page  allintitle:joneja ajay
Focused search
Search only on pages that occur within a particular site  IELM 330 site:ust.hk
The Internet and Web searching
Part 1. The infrastructure of the web
- The architecture of the Internet
- The ISO/OSI 7-layer communication protocol
- Client-server architecture for applications
Part 2. How to use a web search engine [case study: Google]
- Basic searching
- Advanced searching (Booleans) and filtering (setting up local engines)
Part 3. How search engines work
- Crawling the internet
- Indexing of information, database design
- Page ranking
Collecting information on the internet: Crawlers
How do search engines know which page has what data?
- By the use of robots (also called spiders or crawlers)
- By asking sites to tell them who they are (registering)
Crawling:
Start with a small list of web-pages (seeds) := OPEN
For each URL in OPEN:
Fetch page;
Process page (index its contents  see later);
Add all web-links (URL’s) found on the page to OPEN
Problems in crawling
Too many web-sites > 100 million
Each website has many pages (estimated total: ~ 25 billion)
Some pages may not be linked from any seed
Dynamic page generation (e.g. DB based, or rapidly changing)
Suppose a web-site generates a page based on some inputs from user
How can we program a crawler to get this content ?
What potential problems can be faced by this technique?
Art gallery website:
- 3 methods to sort thumbnails; 3 choices of font size; 2 skins 
18 URL’s (using HTTP GET with different options) for same page!
Crawler behavior
Basic crawler is easy to program
Policies used by crawlers determine their efficiency:
Selection policy: which page to download (among all URL’s in OPEN)
Breadth first is better than depth first
why?
Crawler behavior..
Re-visit policy: how often to re-visit a page (to account for dynamic content)
How often to re-visit? Depends on how frequently the page is changed
out-dated data  cost [but how to estimate this cost?]
1, if p is identical to local copy
Freshness of a page p in repository at time t =
Age of a page p in repository at time t =
[source: wikipedia]
0, otherwise
0, if p is identical to local copy
t- time when p was modified, otherwise
Crawler behavior...
Policies used by crawlers determine their efficiency:
Politeness policy: how not to overload resources
Crawlers can send download requests faster than the server, the network
or the indexer can handle
Typical policy: wait n seconds before sending next page request to same server
n = 10 sec … 180 sec
Crawler behavior….
Policies used by crawlers determine their efficiency:
Parallelization policy: how to use multiple crawlers in a coordinated search
advantage: the search engine can use multiple computers to build indexes
problems:
- how to coordinate the merge the results of the multiple computers
- how to ensure that same page is not downloaded by more than one process
Search engines: Indexing
What to do with the pages downloaded by the crawler?
Processing the data:
- categorization (mime-types)
- text processing
- delete stop words [Why?] [How?]
- Identify and handle words
- Identify and handle phrases [How many ?]
- Handling similar words
- plurals, past/present, mis-spellings: stemming
Search engines: Indexing..
Categorization (mime-types)
The indexer must know what type of data it received, and handle it accordingly:
A/V mime-types: audio/mpeg, audio/x-wav, audio/x-ms-wma, video/mpeg, …
Currently: check meta data (anchor text, ALT-tags, HTTP Header, …
Future: Digital signal processing  voice recognition  text  index
Image mime-types: image/gif, image/jpeg, image/png, image/tiff, …
Currently: check meta data (anchor text, ALT-tags, HTTP Header, …)
Application mime-types: application/javascript, application/xhtml+xml,
application/octet-stream, application/x-shockwave-flash, …
Currently: If interpreted language  handle as text, otherwise meta data
Other mime-types: e.g. 3D data etc  meta data, filenames, etc.
Search engines: Indexing…
Text processing
- Identify and handle words
what is a word?
- English: dictionary
- Programming: contiguous alphanumeric with ‘_’
- hyphenated [how does Google handle hyphenated search-terms?]
- special characters
- delete stop words
Stop words: very common words in text, such as the, a, an, on, …
How to define? language experts (human) make the list
How to handle? string-match  do not index stop words
?
Search engines: Indexing….
- Identify and handle phrases
what is a phrase?
- English: dictionary of phrases, human experts
- How to identify: string match
- How many?
- Too many  Index is too large
- Too few  real-time Boolean operations during search (takes time!)
Google n-gram dataset: ~24 GB compressed text files
Number of tokens: 1,024,908,267,229
Number of unigrams: 13,588,391
Number of bigrams: 314,843,401
Number of trigrams: 977,069,902
Number of fourgrams: 1,313,818,354
Number of fivegrams: 1,176,470,663
Search engines: Indexing…..
- Handling similar words
- plurals, past/present, mis-spellings: stemming
- Search for ‘expert’ == ? == Search for ‘experts’
Stemming: To reduce a given word to a ‘canonical’ root word
Basic methods:
- Use a dictionary
- Analyze sentence, guess the meaning of the word used, find root from look-up table
- Use simple processing rules (English)
- if word ends in ‘ed’, ‘ing’, ‘ly’  remove ‘ed’, ‘ing’, ‘ly’
- How to identify mis-spellings?
- Dictionary lookup and match  expensive
- Look-up table with list of commonly misspelled words
Search engines: Indexing……
Inverted indexes: How to store processed data
Most common storage structure: Inverted Index
ngram
Documents
storm
URL1, URL6, …, URL_n
teacup
URL1, …, URL_k
…
How to store?
- Database file (lookup time may be long)
- Hash (faster lookup on average)
Challenge: Updating an index [Why?] [How?]
Challenge: Merging indexes [Why?  Parallel processing] [How?]
Search engines: presenting results to user
Typical query may match 100 millions web pages (hits)
Typical user may look at perhaps 3-5 pages to find the information they want
Successful search engine  must provide a link to ‘best’ page among top few
How to rank the relative likelihood that a page will provide ‘wanted’ info?
- Google uses PageRank™
Search engines: PageRank
How to rank the relative likelihood that a page will provide ‘wanted’ info?
- Google uses PageRank™
PageRank: Invented by PhD student, Larry Page
Similar older ideas, e.g. link analysis (graph theory), citation index
Main idea of PageRank:
1. Importance of page is related to how many other pages link to it, and
2. How important are those pages
Is PageRank effective? [Google’s market capitalization: US$ 200B!]
Page Ranking
1
3
2
4
Consider a web with 4 pages
Arrow from A  B => Page A has a link to Page B
Let: importance score of page k = score of k = xk
Simplest scheme:
xk = number of back-links to page k
x1 = 2, x2 = 1, x3 = 3, x4 = 2
Problem ?
does it account for condition 2?
Main Reference for PageRank notes: Bryan and Leise, The 25B eigenvector: The Linear Algebra Behind Google
Page Ranking..
1
3
2
4
Let: importance score of page k = score of k = xk
Suppose page j is more important (e.g. yahoo.com) 
those pages that are linked from page j should be important 
additional importance of page linked from page j  xj
Let Lk be the set of page k’s backlinks, then: xk 
xj
n
jLk
j
where nj = number of outgoing links on page j
Note: we ignore any link from a page to itself
Page Ranking…
Let Lk be the set of page k’s backlinks, then: xk 
x1 =
xj
n
jLk
1
3
2
4
j
x3 + x4 /2
x2 = x1 /3
x3 = x1 /3 + x2/2 +
x4 = x1 /3 +
x2 /2
x4 /2
x1
x2
x3
x4
=
0
1/3
1/3
1/3
x
0
0
1/2
1/2
1 1/2
0 0
0 1/2
0 0
A
x1
x2
x3
x4
x
Recall from Linear Algebra that for an nXn square matrix A, a scalar l is an
eigenvalue if there is an eigenvector x ≠ 0 such that Ax = lx
PageRank of pages is the eigenvector of A corresponding to eigenvalue = 1
Page Ranking….
x1
x2
x3
x4
=
0
1/3
1/3
1/3
0
0
1/2
1/2
1 1/2
0 0
0 1/2
0 0
x1
x2
x3
x4
1
3
2
4
To find the eigenvector: solve the system using Gaussian elimination
In our example, any scalar multiple of:
12
4
9
6
12/31
Normalize: 4/31
9/31
6/31
=
.387
.129
.29
.194
Problem: Is it necessary that the link matrix of any web has eigenvalue = 1?
Page Ranking …..
Problem: Is it necessary that the link matrix of any web has eigenvalue = 1?
Dangling node: A web page that has no outgoing links
Theorem:
The link matrix of a web with no dangling nodes always has 1 as an eigenvalue
Examine matrix A:
Aij = 1/nj, if page j links to page i
Aij = 0, otherwise
 sum of all entries in column j = total (unweighted) contribution of page j = 1
Column stochastic matrix: A square matrix with no negative entries, and
sum of each column = 1
Page Ranking …...
Column stochastic matrix: A square matrix with no negative entries, and
sum of each column = 1
 Our link matrix is a column stochastic matrix
Theorem: Every column stochastic matrix has 1 as an eigenvalue
Proof:
1. (linear algebra): square matrix A and its Transpose, AT have same eigenvalues
2. Let e = an nX1 vector with all entries = 1
3. A is column stochastic  AT e = e
 AT has an eigenvector = e, (with eigenvalue = 1)
 A has an eigenvalue = 1.
QED
Page Ranking issues
Problems with the Link Matrix approach, and How to handle them
Non-unique eigenvector for eigenvalue = 1
Dangling nodes
Computational effort and time
Other factors that also contribute to ‘importance of the page’
Page Ranking issues: non-unique eigenvector
Consider the web in the figure
1
3
A=
5
2
4
0
1
0
0
0
1
0
0
0
0
0 0 0
0 0 0
0 1 ½
1 0 ½
0 0 0
What are the eigenvectors for A, corresponding to eigenvalue = 1 ?
Any linear combination of:
½
½
0
0
0
0
0
½
½
0
What is the relative importance of the pages??
e.g.:
.375
.375
.125
.125
0
Page Ranking issues: non-unique eigenvector..
Consider the web in the figure
1
3
A=
5
2
4
0
1
0
0
0
1
0
0
0
0
0 0 0
0 0 0
0 1 ½
1 0 ½
0 0 0
Intuitive reason for the problem: this web can be partitioned into two independent sub-webs
Mathematically: Non-unique eigenvector results when A has a block diagonal form
A=
A1 0 … 0
0 A2 … 0
0 …
0 0 … Ak
Solution:
- Reorganize A into block diagonal form
- Find Page Rank for each block independently
Page Ranking issues: Dangling nodes
Suppose a web has dangling nodes (any page with no outgoing links)
 Corresponding column of A is all 0’s
 The matrix A is not column stochastic (it is column sub-stochastic)
 For column sub-stochastic matrices, there must exist at least one
eigenvalue, 0 < l < 1, whose eigenvector has all non-negative entries
[Note: this eigenvector is called the Perron eigenvector]
[Note 2: we will omit the proof of its existence in this course]
 We could use the Perron eigenvector to get page ranks (but we won’t)
Dangling nodes: A probabilistic view of the Link matrix
Consider an imaginary, randomized web-surfer:
He starts at a random page in the web and spends time t on the page.
He then jumps randomly to one of the links in that page
The Page Rank of a page = expected time spent on that page by the randomized surfer!
However: if our “randomized surfer” ends up on a dangling node, then he is
stuck there for ever ?!
One solution: If we are stuck at a dangling node, jump to a random page in the web
How to model this view-point into our matrix A ?
The S matrix: For any 0-column ( dangling node) in A, make all entries = 1/n
Question: Is the matrix S column stochastic ?
Page Ranking Issues: Other factors influencing rank
Main idea (Google’s damping factor):
Allow our randomized surfer to make a purely random jump “occasionally”
How to implement ?
What is the link matrix of a purely random surfer?
Note: R is column stochastic
Consider the Google matrix, G = d S + (1 – d) R
R=
1/n 1/n … 1/n
1/n 1/n … 1/n
…
1/n 1/n … 1/n
d  [0, 1]
Note: G is column stochastic (Why?)
Effect of using G: With probability d, we follow the randomized surfer S, and with
probability (1 – d), we make a purely random jump.
Page Ranking Issues: Other factors influencing rank
Main idea (Google’s damping factor):
Allow our randomized surfer to make a purely random jump “occasionally”
Google matrix, G = d S + (1 – d) R
Google found that using matrix G, and damping factor d = 0.85 “works very
well” ( gives ranking that tends to rank “useful pages” higher)
Page Ranking Issues: Computing the eigenvector
The Rank of page n is the n-th element of the eigenvector, I, of the G matrix
for eigenvalue = 1
Size of G: approx 25Billion X 25 Billion !!!
 Using Gaussian elimination will take months on a supercomputer!
Properties of G:
Large size (25Billion X 25 Billion)
Sparse (approx 10 non-zero entries in typical non-dangling node column)
1 is the largest eigenvalue for column stochastic matrix G
We can use the Power Method to compute the I-vector
Page Ranking Issues: Computing the eigenvector..
The Power Method to compute the eigenvector of the dominant eigenvalue
Start with an arbitrary rank vector, I0
Iterate:
I1 = G I0 , …, …, Ik+1 = G Ik
Ik converges to the eigenvector of the dominant eigenvalue
 We only need to do matrix multiply for sparse matrix (~10 multiply per column)
[NOTE: by writing G = d S + (1 – d) R, the effect of R is merely to add a fixed
quantity to each row of I  faster computation]
Brin and Page:
Typically, Google needs 50-100 iterations to get a good I-vector
Typically, Google computers take few days to calculate
Typically Google updates rankings once per month (Google dance)
Concluding notes
Why is web-search important?
- For researchers, only practical way to find information from web
- For E-commerce companies: higher rank  larger sales (why?)
This is the reason for high stock value of Google, Baidu, …
Current trend:
- Page Ranking method dominates Directory-structured search engines
Technology note:
- Page Ranking: a practical application of abstract Linear Algebra concepts!
References and Further Reading
Books:
Sherman and Price, The Invisible Web: Uncovering Information Sources Search
Engines Can’t See, Cyber Age Books, 2001
[Note: some information in this book is already obsolete]
Web sources:
1. Googleguide: google search guide
2. David Austin’s article on how Google PageRank works
Next: Auctions, Web-transactions, Ebay