Transcript Slide 1
Crawling Gnutella Network
By:
Samer Al-Kiswany
1
Roadmap
• Introduction
• Gnutella network structure
• Gnutella protocol overview
• Gnutella crawling protocol
• Crawling topology information
• Crawling node content
EECE 411
2
Introduction
Gnutella network is a decentralized peer to peer system for file sharing.
Original created by Justin Frankel of Nullsoft
Large scale
today up to 4M nodes, 1000TB data, 100M files today
Fast growth in its early stages
more than 50 times during first half of 2001
(50 times again 2001 to 2006)
Self-organizing network
Open, simple and flexible protocol
EECE 411
3
Roadmap
• Introduction
• Gnutella network structure
• Gnutella protocol overview
• Gnutella crawling protocol
• Crawling topology information
• Crawling node content
EECE 411
4
Gnutella Network Structure
Gnutella Protocol 0.6
Two tier architectures of ultrapeers and leaves
Ultrapeers
Leaves
EECE 411
5
Roadmap
• Introduction
• Gnutella network structure
• Gnutella protocol overview
• Gnutella crawling protocol
• Crawling topology information
• Crawling node content
EECE 411
6
Basic Primitives for File Sharing
Join: How do I begin participating?
Publish: How do I advertise my file(s)?
Search: How do I find a file?
Fetch: How do I retrieve a file?
EECE 411
7
Gnutella Protocol Overview
Join: on startup, client contacts an ultrapeer node(s)
Publish: no need
Search:
Ask the ultrapeer node
The ultrapeer will propagate the questions to other
ultrapeers and will return the answer back
Fetch: get the file directly from peer (HTTP)
EECE 411
8
Roadmap
• Introduction
• Gnutella network structure
• Gnutella protocol overview
• Gnutella crawling protocol
• Crawling topology information
• Crawling node content
EECE 411
9
Crawling a Gnutella node
By Crawling we are interested in two main pieces of information:
With whom the node is connected ? - Topology information
Gnutella protocols terms “Crawling/Communicating
Network Topology Information ”
What files the node is sharing with others?
Gnutella protocol terms “Browsing Host ”
EECE 411
10
Crawling Topology Information
Gnutella protocol 0.6 supports network topology information crawling !!!
Topo crawl
Gnutella
Network
Topo information
Topology Information:
-
Ultrapeers
-
Leaves
EECE 411
11
Crawling Topology Information
Topo Crawl
Topo information
GNUTELLA CONNECT/0.6
User-Agent: LimeWire (crawl)
X-Ultrapeer: False
Query-Routing: 0.1
Crawler: 0.1
GNUTELLA/0.6 200 OK
User-Agent: BearShare
Leaves: 127.0.0.1:6346,127.0.0.2:6346
Peers: 127.0.0.4:6346,127.0.0.5:6346
GNUTELLA/0.6 200 OK
EECE 411
12
Browsing Node Content
Browse Host
Gnutella
Network
List of files
EECE 411
13
Browsing Node Content
Browse Host
List of files
GET / HTTP/1.1
Host: Crawler_IP:PORT
User-Agent: UBCECE
Accept: application/x-gnutellapackets
Connection: close
HTTP/1.1 200 OK
Server: LimeWire/x.y
Content-Type: application/x-gnutellapackets
Connection:close
<List of files>
Query Hit Message
EECE 411
14
Query Hit Parsing
Query Hit Message
1
2
A
B C D E
F
3
– Gnutella message header
1
important field : message length.
– Query Hit Header
2
The HTTP response
message may
contain more than
one query Hit
response
important field : Number of files
A-F– list of shared files
includes file name and size
– Other Gnutella protocol fields
3
Query Hit Message
1
2
A BCDE F
3
Query Hit Message
--EECE 411
1
2
ABCDE F
3
15
Limitations - Does this always work ?
Topology Crawling:
•
The topology information crawling is not supported by some
Gnutella protocol v0.4 implementations
Host Browsing :
•
Some Gnutella node implementations will return the list of files
in HTML (BearShare for instance). (will not respond with
Query Hit message)
EECE 411
16
Roadmap
• Introduction
• Gnutella network structure
• Gnutella protocol overview
• Gnutella crawling protocol
• Crawling topology information
• Crawling node content
EECE 411
17
Single Gnutella-Node Crawler
A proof of concept implementation of single Gnutella-node crawler.
The main class that implements the crawling protocol is the Crawler
class:
• crawlpeers(ip_address, port)
• parsePeers(byte[] )
• listFiles(ip_address, port)
• processQueryHit(byte[] )
Available through the following link
http://www.ece.ubc.ca/~samera/TA/project/sgnc.html
EECE 411
18
Project Phase II
• Implement a single-node Gnutella network crawler
• Report:
The active leaf nodes
Information regarding the “agent” (i.e., the implementation:
LimeWire , BearShare …etc)
The domain name corresponding to the node IP address.
Avoid cycles !!
EECE 411
19
Project Phase III
• Implement a master/worker crawler with Java NIO sockets.
Crawled
Problems ?
(Hint: Failures)
To be Crawled
Master
Primary
Crawl the following list : …
Results: peers IPs, statistics
Problems ?
Gnutella Network
EECE 411
20
Project Phase III
• Implement a master/worker crawler with Java NIO sockets.
• Adopt primary/backup replication for the manager
Master
Backup
Master
Primary
X
Crawled
To be Crawled
Gnutella Network
EECE 411
21
Previous Years Ideas – Part I
Programming languages / frameworks / protocols
• Java (the vast majority)
• Scala
• Apache MINA framework.
• Java RMI
• Jython
• XML-RPC
• SQL
• Python/Perl/Shell/cron jobs
Architecture
• Master/worker (the majority)
• Hierarchical
EECE 411
22
Previous Years Ideas – Part II
Design choices
• NIO at both master and workers
• Careful load balancing
• Keep the workers always busy
• Bootstrapping new workers if old works fail
Additional bells and whistles
• GUI manager
• Statistics in real-time through GUI and web page
• Graphviz
EECE 411
23
References
• Single Gnutella-Node Crawler:
http://www.ece.ubc.ca/~samera/TA/project/sgnc.html
• Gnutella Crawling protocol :
http://www.ece.ubc.ca/~samera/TA/project/Gnuttela-Protocol.html
Other references:
• http://gnutella-specs.rakjar.de/index.php/Main_Page
• www.limewire.com
EECE 411
24
Thank you
www.ece.ubc.ca/~samera
25