Information Retrieval - SEAS
Download
Report
Transcript Information Retrieval - SEAS
Scalable Web Crawling
and Basic Transactions
Zachary G. Ives
University of Pennsylvania
CIS 455 / 555 – Internet and Web Systems
March 26, 2016
Administrivia
Emailed list of project partners due Friday
… For those without 4-person groups, I will try to
assign / merge groups over the weekend
… This might result in breaking up some 3 person groups
2
Mercator: Scalable Web Crawler
Expands a “URL frontier”
Avoids re-crawling same URLs
Also considers whether a document has been seen
before
Every document has signature/checksum info computed as
it’s crawled
3
Mercator Web Crawler
1.
2.
3.
4.
Dequeue frontier URL
Fetch document
Record into RewindStream (RIS)
Check against fingerprints to
verify it’s new
5.
6.
7.
8.
Extract hyperlinks
Filter unwanted links
Check if URL repeated
(compare its hash)
Enqueue URL
4
Mercator’s Polite Frontier Queues
Tries to go beyond breadth-first approach – want to
have only one crawler thread per server
Distributed URL frontier queue:
One subqueue per worker thread
The worker thread is determined by hashing the
hostname of the URL
Thus, only one outstanding request per web server
5
Mercator’s HTTP Fetcher
First, needs to ensure robots.txt is followed
Caches the contents of robots.txt for various web sites as
it crawls them
Designed to be extensible to other protocols
Had to write own HTTP requestor in Java – their
Java version didn’t have timeouts
Today, can use setSoTimeout()
Can also use Java non-blocking I/O if you wish:
http://www.owlmountain.com/tutorials/NonBlockingIo.htm
But they use multiple threads and synchronous I/O
6
Other Caveats
Infinitely long URL names (good way to get a buffer
overflow!)
Aliased host names
Alternative paths to the same host
Can catch most of these with signatures of
document data (e.g., MD5)
Crawler traps (e.g., CGI scripts that link to
themselves using a different name)
May need to have a way for human to override certain
URL paths – see Section 5 of paper
7
Mercator Document Statistics
PAGE TYPE PERCENT
text/html
69.2%
image/gif
17.9%
image/jpeg
8.1%
text/plain
1.5%
pdf
0.9%
audio
0.4%
zip
0.4%
postscript
0.3%
other
1.4%
Histogram of document sizes
(60M pages)
Further Considerations
May want to prioritize certain pages as being most
worth crawling
Focused crawling tries to prioritize based on relevance
May need to refresh certain pages more often
9
Web Search Summarized
Two important factors:
Indexing and ranking scheme that allows most relevant
documents to be prioritized highest
Crawler that manages to be (1) well-mannered, (2) avoid
traps, (3) scale
We’ll be using Pastry to distribute the work of
crawling and to distribute the data (what Google
calls “barrels”)
10
We Need More Than Synchronization
What needs to happen when you…
Click on “purchase” on Amazon?
Suppose you purchased by credit card?
Use online bill-paying services from your bank?
Place a bid in an eBay-like auction system?
Order music from iTunes?
What if your connection drops in the middle of downloading?
Is this more than a case of making a simple Web
Service (-like) call?
11
Transactions Are a
Means of Handling Failures
There are many (especially, financial) applications
where we want to create atomic operations that
either commit or roll back
This is one of the most basic services provided by
database management systems, but we want to do it
in a broader sense
Part of “ACID” semantics…
12
ACID Semantics
Atomicity: operations are atomic, either committing
or aborting as a single entity
Consistency: the state of the data is internally
consistent
Isolation: all operations act as if they were run by
themselves
Durability: all writes stay persistent!
13
A Problem Confronted by eBay
eBay wants to sell an item to:
The highest bidder, once the auction is over, or
The person who’s first to click “Buy It Now!”
But:
What if the bidder doesn’t have the cash?
A solution:
Record the item as sold
Validate the PayPal or credit card info with a 3rd party
If not valid, discard this bidder and resume in prior state
14
“No Payment” Isn’t the Only
Source of Failure
Suppose we start to transfer the money, but a
server goes down…
CRASH!
Purchase:
sb = Seller.bal
bb = Buyer.bal
Write Buyer.bal= bb - $100
Write Item.sellTo = Buyer
Write Seller.bal= sb + $100
15
Providing Atomicity and Consistency
Database systems provide transactions with the
ability to abort a transaction upon some failure
condition
Based on transaction logging – record all operations and
undo them as necessary
Database systems also use the log to perform
recovery from crashes
Undo all of the steps in a partially-complete transaction
Then redo them in their entirety
This is part of a protocol called ARIES
16
The Need for Isolation
Suppose eBay seller S has a bank account that we’re
depositing money into, as people buy:
S = Accounts.Get(1234)
Write S.bal = S.bal + $50
What if two purchases occur simultaneously, from two
different servers on different continents?
17
Concurrent Deposits
This update code is represented as a sequence of
read and write operations on “data items” (which
for now should be thought of as individual accounts):
Deposit 1
Read(S.bal)
S.bal := S.bal + $50
Write(S.bal)
Deposit 2
Read(S.bal)
S.bal:= S.bal + €10
Write(S.bal)
where S is the data item representing the seller’s
account # 1234
18
A “Bad” Concurrent Execution
Only one action (e.g. a read or a write) can actually
happen at a time for a given database, and we can
interleave deposit operations in many ways:
BAD!
Deposit 1
Read(S.bal)
Deposit 2
Read(S.bal)
time
S.bal := S.bal + $50
S.bal:= S.bal + €10
Write(S.bal)
Write(S.bal)
19
A “Good” Execution
Previous execution would have been fine if the accounts
were different (i.e. one were S and one were T), i.e.,
transactions were independent
The following execution is a serial execution, and executes
one transaction after the other:
Deposit 1
Deposit 2
Read(S.bal)
GOOD! S.bal := S.bal + $50
write(S.bal)
time
Read(S.bal)
S.bal:= S.bal + $10
Write(S.bal)
20
Good Executions
An execution is “good” if it is serial (transactions are
executed atomically and consecutively) or serializable (i.e.
equivalent to some serial execution)
Deposit 1
Deposit 3
read(S.bal)
read(T.bal)
S.bal := S.bal + $50
T.bal:= T.bal + €10
write(S.bal)
write(T.bal)
Equivalent to executing Deposit 1 then 3, or vice versa
Why would we want to do this instead?
21
Concurrency Control
A means of ensuring that transactions are
serializable
There are many methods, of which we’ll see one
Lock-based concurrency control (2-phase locking)
Optimistic concurrency control (no locks – based on
timestamps)
Multiversion CC
…
22