Transcript Slide 1

Crawling Gnutella Network
By:
Samer Al-Kiswany
1
Roadmap
• Introduction
• Gnutella network structure
• Gnutella protocol overview
• Gnutella crawling protocol
• Crawling topology information
• Crawling node content
EECE 411
2
Introduction
Gnutella network is a decentralized peer to peer system for file sharing.
 Original created by Justin Frankel of Nullsoft
 Large scale
today up to 4M nodes, 1000TB data, 100M files today
 Fast growth in its early stages
more than 50 times during first half of 2001
(50 times again 2001 to 2006)
 Self-organizing network
 Open, simple and flexible protocol
EECE 411
3
Roadmap
• Introduction
• Gnutella network structure
• Gnutella protocol overview
• Gnutella crawling protocol
• Crawling topology information
• Crawling node content
EECE 411
4
Gnutella Network Structure
Gnutella Protocol 0.6
Two tier architectures of ultrapeers and leaves
Ultrapeers
Leaves
EECE 411
5
Roadmap
• Introduction
• Gnutella network structure
• Gnutella protocol overview
• Gnutella crawling protocol
• Crawling topology information
• Crawling node content
EECE 411
6
Basic Primitives for File Sharing




Join: How do I begin participating?
Publish: How do I advertise my file(s)?
Search: How do I find a file?
Fetch: How do I retrieve a file?
EECE 411
7
Gnutella Protocol Overview
 Join: on startup, client contacts an ultrapeer node(s)
 Publish: no need
 Search:
 Ask the ultrapeer node
 The ultrapeer will propagate the questions to other
ultrapeers and will return the answer back
 Fetch: get the file directly from peer (HTTP)
EECE 411
8
Roadmap
• Introduction
• Gnutella network structure
• Gnutella protocol overview
• Gnutella crawling protocol
• Crawling topology information
• Crawling node content
EECE 411
9
Crawling a Gnutella node
By Crawling we are interested in two main pieces of information:
 With whom the node is connected ? - Topology information
Gnutella protocols terms “Crawling/Communicating
Network Topology Information ”
 What files the node is sharing with others?
Gnutella protocol terms “Browsing Host ”
EECE 411
10
Crawling Topology Information
Gnutella protocol 0.6 supports network topology information crawling !!!
Topo crawl
Gnutella
Network
Topo information
Topology Information:
-
Ultrapeers
-
Leaves
EECE 411
11
Crawling Topology Information
Topo Crawl
Topo information
GNUTELLA CONNECT/0.6
User-Agent: LimeWire (crawl)
X-Ultrapeer: False
Query-Routing: 0.1
Crawler: 0.1
GNUTELLA/0.6 200 OK
User-Agent: BearShare
Leaves: 127.0.0.1:6346,127.0.0.2:6346
Peers: 127.0.0.4:6346,127.0.0.5:6346
GNUTELLA/0.6 200 OK
EECE 411
12
Browsing Node Content
Browse Host
Gnutella
Network
List of files
EECE 411
13
Browsing Node Content
Browse Host
List of files
GET / HTTP/1.1
Host: Crawler_IP:PORT
User-Agent: UBCECE
Accept: application/x-gnutellapackets
Connection: close
HTTP/1.1 200 OK
Server: LimeWire/x.y
Content-Type: application/x-gnutellapackets
Connection:close
<List of files>
Query Hit Message
EECE 411
14
Query Hit Parsing
Query Hit Message
1
2
A
B C D E
F
3
– Gnutella message header
1
important field : message length.
– Query Hit Header
2
The HTTP response
message may
contain more than
one query Hit
response
important field : Number of files
A-F– list of shared files
includes file name and size
– Other Gnutella protocol fields
3
Query Hit Message
1
2
A BCDE F
3
Query Hit Message
--EECE 411
1
2
ABCDE F
3
15
Limitations - Does this always work ?
Topology Crawling:
•
The topology information crawling is not supported by some
Gnutella protocol v0.4 implementations
Host Browsing :
•
Some Gnutella node implementations will return the list of files
in HTML (BearShare for instance). (will not respond with
Query Hit message)
EECE 411
16
Roadmap
• Introduction
• Gnutella network structure
• Gnutella protocol overview
• Gnutella crawling protocol
• Crawling topology information
• Crawling node content
EECE 411
17
Single Gnutella-Node Crawler
A proof of concept implementation of single Gnutella-node crawler.
The main class that implements the crawling protocol is the Crawler
class:
• crawlpeers(ip_address, port)
• parsePeers(byte[] )
• listFiles(ip_address, port)
• processQueryHit(byte[] )
Available through the following link
http://www.ece.ubc.ca/~samera/TA/project/sgnc.html
EECE 411
18
Project Phase II
• Implement a single-node Gnutella network crawler
• Report:
 The active leaf nodes
 Information regarding the “agent” (i.e., the implementation:
LimeWire , BearShare …etc)
 The domain name corresponding to the node IP address.
Avoid cycles !!
EECE 411
19
Project Phase III
• Implement a master/worker crawler with Java NIO sockets.
Crawled
Problems ?
(Hint: Failures)
To be Crawled
Master
Primary
Crawl the following list : …
Results: peers IPs, statistics
Problems ?
Gnutella Network
EECE 411
20
Project Phase III
• Implement a master/worker crawler with Java NIO sockets.
• Adopt primary/backup replication for the manager
Master
Backup
Master
Primary
X
Crawled
To be Crawled
Gnutella Network
EECE 411
21
Previous Years Ideas – Part I
Programming languages / frameworks / protocols
• Java (the vast majority)
• Scala
• Apache MINA framework.
• Java RMI
• Jython
• XML-RPC
• SQL
• Python/Perl/Shell/cron jobs
Architecture
• Master/worker (the majority)
• Hierarchical
EECE 411
22
Previous Years Ideas – Part II
Design choices
• NIO at both master and workers
• Careful load balancing
• Keep the workers always busy
• Bootstrapping new workers if old works fail
Additional bells and whistles
• GUI manager
• Statistics in real-time through GUI and web page
• Graphviz
EECE 411
23
References
• Single Gnutella-Node Crawler:
http://www.ece.ubc.ca/~samera/TA/project/sgnc.html
• Gnutella Crawling protocol :
http://www.ece.ubc.ca/~samera/TA/project/Gnuttela-Protocol.html
Other references:
• http://gnutella-specs.rakjar.de/index.php/Main_Page
• www.limewire.com
EECE 411
24
Thank you
www.ece.ubc.ca/~samera
25