Efficient Crawling Through URL Ordering

Download Report

Transcript Efficient Crawling Through URL Ordering

Efficient Crawling Through
URL Ordering
By: Junghoo Cho, Hector Garcia-Molina, and Lawrence Page
Presenter :
Omkar S. Kasinadhuni
Simerjeet Kaur
Overview
1) Introduction
2) Importance metrics
3) Problem Definition
4) Ordering Metrics
5) Experiments
BackLink-based Crawler
Small-scale Crawler
Similarity-based Crawler
6) Conclusion
Introduction
Web Crawler:
An automated program that accesses a web site and traverses through the site by
following all the links present on the pages.
How does a web crawler work?:
– It starts scanning with an initial page p0.
– Collects the URLs from page p0 and stores them in a queue.
– The scanned pages are given to the Client (for indexing, summarizing or for
analyses).
–
Repeats the same process all over again, with each URL in the queue. .
Introduction contd..
Designing Web Crawler is Challenging – Why ?
Introduction contd..
–
Externally, the crawler should avoid overloading websites or network links.
–
Internally, it has to deal with huge volumes of data
–
Limited storage capacity on client.
The author says that the entire Web is about 1.5 TB
( http://www.worldwidewebsize.com/ )
–
Revisit previously scanned pages to update latest changes.
“The Crawler cannot visit the entire web”
Introduction contd..
The crawler visits limited number of web pages.
-
Therefore, we want the crawler to visit the “important” pages first
-
Within a given list of URLs crawler must use “efficient order” to visit the
“important” pages first.
How do you say a page is “Important”?
Importance Metrics
The importance of a Web Page (p), is represented by I(p).
It is defined by the following metrics:
–
–
–
–
–
Similarity to a Driving query Q IS(p)
BackLink Count IB(p)
PageRank IR(p)
Forward Link Count IF(p)
Location Metrics
IL(p)
Similarity to a Driving query Q IS(p)
Importance is defined as textual Similarity between page(p) and Query(Q).
–
Visualize documents (p or Q) as n-dimensional vector (w1,w2…..wn)
wi - ith word in the vocabulary.
–
If wi does not appear in the document, wi has no significance.
Else, wi is set to represent significance of the word.
Two ways to compute Significance :
–
Significance = number of times wi appears in the document * IDF
Inverse Document Frequency (IDF) = 1/ (number of times
the word appears in the entire collection).
–
Significance based on where wi appears in the page. (e.g. In page title,
body etc.)
Similarity Metric IS(p) contd.
Two ways to calculate the similarity between page p and Q:
– Calculate the inner product of the p and Q vectors
– Use Cosine similarity measure i.e., the inner product of the Normalized
vector.
IDF is estimated using the already crawled pages and not the entire web.
The Similarity Importance Metric calculated with an estimated IDF
is referred as estimated Importance of page IS'(p).
BackLink Count IB(p)
BackLink count of page p IB(p) is the number of links to Page p that
appears over the entire web.
Instead of using exact BackLink count we use an estimated BackLink
count IB'(p) – Why ?
Note: Links means endorsements (Discussed in the previous papers)
Page Rank
IR(p) recursively assigns importance of a page as the weighted sum of
the importance of all the pages that has backlinks to p.
IR(p) = (1-d) +d [ IR(t1)/c1+……+ IR(tn)/cn]
-
The page p is pointed by pages t1……tn.
-
ci is the number of links going out of page ti.
-
d- damping factor: The probability that the surfer would jump to a completely
random page which has no relation with the current page.
-
IR(p) gives the probability that the random surfer is at page (p) at any given
time.
Forward Link Count: IF(p)
-
IF(p) counts the number of links that emanate from page p. (eg: Web
Directory)
IF'(p) = IF(p) WHY?
Location Metric
In this metric.., the importance of a page is based on its location in the Web,
rather than its contents.
Examples:
- URLs ending with .com may be more important
- URL’s containing string ‘home’ may be more important than others.
- URL’s with fewer slashes may be more important.
-
All these metric can be evaluated by looking simply at the URL (Hence
Location Metrics).
Our important metric could be a combination of any/all of these metrics.
Problem Definition
-
Our task is to find a scheme for selecting pages in the decreasing order
of Importance.
-
Depending on how we want the crawler to operate, we have three
schemes…,
- Crawl & Stop
- Crawl & Stop with Threshold
- Limited Buffer Crawl
Crawl & Stop Model
-
The crawler C starts at its initial po and stops after visiting k pages.
(r1,r2..,rk)
-
For a perfect crawler, I(r1)> I(r2) >…..>I(rk).
-
We call the pages r1..rk (“HOT”) pages.
-
The real crawler would actually visit M pages with rank higher than or
equal to I(rk).
-
The performance of the crawler C, Pcs(C) = M/K.
(Note: Pcs(C) =1 for perfect crawler).
-
Performance of a random crawler= K/T
T is the total number of pages in the entire web.
-
The expected number of desired pages when the crawler stops in K2/T.
Crawl and stop with Threshold model
– Similar to Crawl and stop model.
– We have an importance target G, i.e., a page is hot if I(p) ≥ G.
H - total hot pages,
if K<H then the performance (ideal crawler) = K/H.
If K>=H, then the performance = 1.
– A purely random crawler visits (H/T).K hot pages when it stops.
performance = 1 {Random Crawler}
Limited Buffer Crawl Model
-
Crawler has limited storage.
-
Crawler can only store B pages at most.
-
When the buffer gets filled up, the crawler should flush some of its pages.
-
Ideal crawler will flush the pages with the least importance in the buffer.
-
A real crawler must guess which pages in the buffer will eventually have low
I(p).
-
Performance of the crawler PBC(C) = Fraction of pages (B) in the buffer.
-
Hot pages are those with I(p) >=G or I(p) >= I(rB)
rB is the page with the Bth Highest Importance.
Ordering Metrics
“How should a crawler select URLs to scan from its queue of
known URLs and in which order? ”
Ordering Metrics
– The crawler uses the Ordering Metrics O, i.e., it selects a URL u such that
O(u) is the highest in the queue.
– The Order Metrics should be based on some Importance Metrics.
–
E.g. If IB(p) is the Importance metrics, then O(u) = IB(p)
where URL u points to page p.
– Similarly we can use IR(p), or IL(p).
Why is it difficult to use a Forward link metrics IF(p) or Similarity Metrics?
– We can use O(u) = IS(A,Q), where A is the Anchor text of the URL u and Q is
the driving query
Experiments
-
With the above mentioned techniques, a web crawler was built and was
ready to be tested.
-
In order to avoid congestion and heavy load on the servers, the Authors has
decided to do the experiments in two steps..,
1) Physically crawl Stanford Web pages and build a local repository of
pages using Stanford WebBase. (A system designed to create and
maintain large Web repositories.
2) After building repositories., use virtual crawlers to evaluate
crawling schemes.
different
Why was the repository used instead of physically re-crawling the standford.edu?
Experiments contd
-
To form a repository, WebBase was started with initial list of Stanford.edu
URL’s. They were about 89,119 URL’s found in initial crawl.
-
But the number was reduced to avoid,
- Automatically generating infinite pages e.g. URLs containing “/cgi-bin/”.
- URL’s which were undesirable. (elimination by location metrics.)
- Robot Exclusion Protocol was used to further reduce the dataset
-
There were 784,592 known URLs to Stanford pages. But 352,944 of these
known URL’s were on the server http://www.slac.standford.edu.
-
Among these 352,944, only 225,000 pages were valid HTML pages, which had
2.5 GB
-
Out of these 225,000 pages, 46000 pages were unreachable from the starting
point, leaving 179,000.
BackLink-based Crawlers:
-
The effectiveness of
various
ordering
metrics is studied,
where
the
importance
is
measured through
BackLinks (IB(p) or
IR(p)).
-
The crawler starts
from
Stanford
homepage and for
PageRank metric,
the damping factor
d of 0.9 is used.
Figure1: Basic crawling algorithm
BackLink-based Crawlers contd.
-
Under this hot page definition, about H = 85,000 (45%), 17,500(10%), and 1400
(0.8%) pages out of the 17,000 total web pages were considered to be hot.
What do you infer from the graph ?
BackLink-based Crawlers contd.
Backlink-based crawlers:contd
– IB’(p) performed more like depth first .But the PageRank crawler combined
breadth, depth first
– Since the BackLink information is biased by the starting point, we get skewed
information. And if we use this information we end up getting locally HOT
pages than globally HOT pages.
BackLink-based Crawlers contd.
– This result is not limited to only the Crawl and stop with Threshold model.
BackLink-based Crawlers contd.
–
When this experiment (G=13) was done with BackLink, breadth first importance
metrics, again, PageRank ordering metrics showed the best results.
Small-scale Crawler
– A small-scale crawler is a crawler whose client is interested only in a small
portion of the web. (e.g. when the client wants to create a mirror)
– An experiment was conducted to see the impact of scale on the performance.
– This time, only the pages in the Stanford Database Group (on the server
http://ww-db.stanford.edu) were considered.
–
This subset of the Stanford pages were only 1100 HTML pages
Small-scale Crawler
–The results show all the ordering metrics perform even worse than the random crawler.
small-scale crawler contd..
-
BackLink is not a
good
importance
metrics for a small
domain, because in
small domains, most
of the pages have
very less number of
BackLink.
-
Also the number of
backlinks is very
sensitive to a page
creator’s style.
-
The IR(p) is also not
a good importance
metrics in this case.
Small scale crawlers rules out our initial assumption that links to a page
endorses that page.
small-scale crawler contd..
– The performance is still not as good as Large Stanford domain, but, it is
much better when compared to the earlier case where BackLink importance
was used.
Similarity-based Crawlers
-
Here
the
importance
is
similarity-based .
-
In
this
experiment, the
page
is
considered HOT
if is contains a
word ”Computer”
in its title or if it
has more than 10
occurrences
of
word ”Computer”
in its body.
Similarity-based Crawlers contd..
-
The results show
that the BackLink
crawler and the
PageRank
crawler behave
no better than a
random crawler.
-
Only the breadth
first crawler gave
a
reasonable
result.
-
All three crawlers
differ only in their
ordering scheme
and are neutral to
page content.
-
This unexpected
performance
is
from breadth first
crawler’s
FIFO
nature.
Similarity-based Crawlers contd.
Similarity-based crawlers contd..
All the crawlers performed better than earlier and also the gap between the
breadth first and other crawlers has decreased. The Breadth first crawler is
still superior to the other two.
Similarity-based crawlers contd..
Discussion
– What do you think is the best Metric for assigning Importance to a page?
– What combination of Importance metrics should we use to get the best
results?
– We are using estimated metrics to give importance to a page. How far can
we trust the estimated values?
– The web contains huge amounts of replicated data. How does this
replicated data affects the crawler's performance?
Conclusion
-
The PageRank Ordering metrics is an excellent ordering metrics when
either pages with many backlinks or with high PageRank are sought.
-
With a good ordering strategy, we can build crawlers that can obtain a
significant portion of the hot pages relatively early.
-
One limitation with this experiment is that is was run only over Stanford
University Web pages. Nevertheless, the Stanford University Web pages
are a reasonable sample.