NeuroGrid P2P Simulator
Download
Report
Transcript NeuroGrid P2P Simulator
Application Technology Workshop : P2P and GRID: 28/01/2004
P2P
Simulation
and Reality
Sam Joseph
Sam Joseph
Strategic Software Division
Graduate School of Information
Science and Technology,
University of Tokyo
Laboratory for Interactive Learning
Technology (LILT), Department of
Information and Computer Sciences,
University of Hawai'i at Manoa
Personal Profile
Founder of NeuroGrid project:
http://www.neurogrid.net
Sub-editor on the P2PJournal:
http://www.p2pjournal.com
MetaData subgroup leader for P2P
research groups:
http://www.irtf.org/charters/p2prg.html
Talk Contents
What is a Simulation?
Why Simulate P2P?
Simulation Methodology
P2P
Simulation Issues
The Dangers of Simulation
Real P2P Systems
Types of Simulator
An Extendable Simulator
What is a simulation?
A simulation is “an attempt to model a system in
order to study it scientifically” (Law & Kelton, 2000)
Real world complexity often prevent directs
mathematical analysis of model
Thus a numerical approach or simulation is
required
This will require an abstraction of the real system,
since otherwise we would just be building the real
system
Central question is which abstractions to make, as
one can accidentally abstract away essential details
For example is peer heterogeneity required in simulation?
Why Simulate?
Testing scalability to large numbers of peers,
requires … large numbers of peers
Thus one motivation to simulate comes from the
expense of running the real system
Testing solutions to malicious peers requires
… malicious peers
And introducing malicious peers into a real system
is somewhat socially irresponsible
However crucial question is are simulation studies
relevant to real p2p systems?
Simulation Methodology
All too often simulation "studies" involve
building a model and using the results of a
single run to obtain the "answer". (Law &
Kelton, 2000)
This pattern is replicated across P2P
simulation studies
Drawing valid and credible conclusions
requires:
Careful assessment of assumptions
Appropriate probability distributions of starting
parameters
Subjecting results to the appropriate statistical analysis
P2P Simulation Issues 1
Content Model
1. Representational complexity
document is represented by hash X
document is in category X and no other
document is related to “whales” and “dolphins”
document “defines” “whales” and "has illustrations of”
“dolphins”.
2. Vocabulary
whether users map fundamental concepts onto the
same terms, e.g. I say “whale” and you say “kujira”,
but we both mean marine mammal
3. Fundamental concepts
agreement about fundamental concepts; you say this
marine mammal is food and I say it is sentient
content-centric or user-centric?
4. Dishonesty
e.g. you say this is a “revolutionary product” and I say
this is “unsolicited junk”
P2P Simulation Issues 2
Content Model
Network state serialization
Each content issue subject to dynamic evolutionary
processes where users change opinions and strategies
over time
More on content modeling in P2P networks in Joseph
& Hoshiai (2003)
Allows stopping and starting
Danger of biasing statistical analysis
Network Markup Language (NML)
Visualization, unit-testing
Visualization greatly aids debugging
Unit-testing particularly important in extendible
framework
P2P Simulation Issues 3
Parameter Distributions
Starting topology, content & query distributions and
churn rates
Determine from real system where available
Lv et al.(2002) showed different macro-behaviour
depending on whether topology was constructed using
a Zipfian model or using real world data
Results Analysis
run multiple simulations starting with different
selections from the same input probability distributions
present results indicating confidence intervals
Or repeat assessment of confidence intervals, after
sets of additional simulations, until the specified
precision is acquired
Dangers of Simulation
Case study: Query Message Combination Protocol
(QMCP)
QMCP is a Gnutella Protocol modification to
combine multiple queries, that could lead to more
efficient use of bandwidth (based on 2001 study)
However network protocols are frequently changing
– do older results about the Gnet still apply?
Failing to consider lower network levels may leave
you suggesting redundant things
e.g. replicating a Nagle Algorithm in the overlay when it
already exists in TCP/IP
Real P2P Systems
Saroiu et al (2001) Gnutella/Napster study:
Significant heterogeneity: bandwidth, latency,
availability vary between 3-5 orders magnitude
Peers deliberately misreport information if there is an
incentive to do so
Clip2 showed Gnet follows a power law – Saroiu et al
show resistance to random failure, but fragments under
directed attack
Ripeanu et al, 2002 show Gnutella diverging from a
power law network
Ge et al (2002) unregulated and transitory
nature of p2p systems makes it difficult to
evaluate assumptions in real system
Types of Simulator
Hierarchy of approaches
Numerical Model
• SimP2 (Kant & Iyer, 2003)
• Queuing Model (Ge et al., 2002?)
Flow-based simulation
• Narses (Baker & Giuli, 2002)
Event-based simulation
• NeuroGrid (Joseph, 2003)
• QueryCycle (Schlosser et al., 2002)
Packet-based simulation
• PLP2P (He et al., 2003)
• NS-2
Real system
NeuroGrid Simulator
Abstract Classes
Keyword
Document
Message
Node
Network
MessageHandler
By extending the above classes allows
us to create different p2p networks
Gnutella
Freenet
NeuroGrid
Pastry
Action Event framework
0 1 2 3 4 5 6 7 8 9
Execution causes two actions
to be inserted at timestep 3
Action
0 1
2
3
Action
4 5
6
7
2
Action
Action
3
4 5
Action
Action
9
Execution causes one actions to be inserted
at timestep 4, another at timestep 8
Action
0 1
8
Execution causes two more actions
to be inserted at timestep 8
6
7
8
Action
9
Action
Action
Action
Conclusion
P2P systems are characterized by many of
the annoying real life complexities that
prevent simple analysis and simulation
For example
high turnover of peers
download & connection failures
large numbers of stochastically behaving peers
Simplifications used for tractable simulations
can lead to unrealistic behaviour
Effective use of simulation studies requires a
lot of work,but not as much as full
implementation?
Questions?
[email protected]
Gnutella Search
Gnutella
search
uses broadcast
N002
The
spread of the messages
is limited by TTL and GUID
G044
G023
G047
TTL=2
N001
Query-G067
TTL=3
Time To Live - the
number of hops before a
message is expired
G084
G023
G045
TTL:
GUID:
Globally Unique
Identifier - allows nodes to
identify loops
Stop
TTL=2
G084
G032
G099
TTL=1
TTL=2
Seen
it
N007
N005
GUID
LOOP
GUID
G084
G067
G045
TTL=1
GUID
G037
G048
G045
TTL=1
N003
GUID
TTL=0
N004
GUID
TTL=1
Match
G099
G023
G045
N006
Seen
it
GUID
GUID
G084
G067
G045
Abstract Class Extension
Extending the abstract classes implements p2p functions
Keyword ID
Hashtable
Document ID
Hashtable
Message ID
Hashtable
Node ID
Hashtable
Keyword
Document
Message
Node
SimpleKeyword
SimpleDocument
SimpleMessage
SimpleNode
E.g. the Message abstract class contains Document and Keyword array
variables
SimpleMessage implements a second constructor, which is used when
nodes forward messages
GUID
public SimpleMessage(Message p_message)
throws Exception
{
if(p_message == null) throw new Exception("Message is null");
o_message_ID = p_message.getMessageID();
o_TTL = p_message.getTTL() - 1;
o_keywords = p_message.getKeywords();
o_document = p_message.getDocument();
etc …
}
TTL decrement
processMessage()
The Node abstract class has a GUID Hashtable
protected Hashtable o_seenGUIDs = new Hashtable(10);
The Node abstract class has an abstract processMessage
method
public abstract void processMessage(Message p_message, boolean p_start)
throws Exception;
SimpleNode implements this method
public void processMessage(Message p_message, boolean p_start)
throws Exception
{
if(p_message == null) throw new Exception(“p_message is null");
String x_previous = (String)(o_seenGUIDs.get(p_message.getMessageID()));
if(x_previous != null)
return;
Message ID of incoming
Message goes into Hashtable
Message
seen?
o_seenGUIDs.put(p_message.getMessageID(),p_message.getMessageID());
etc …
Forwarding Messages
When a message is forwarded, a new message object is
created through the SimpleMessage constructor which
ensures the GUID is maintained and the TTL decremented
Also in the SimpleNode processMessage implementation:
Enumeration x_enum = o_conn_list.elements();
while(x_enum.hasMoreElements())
Create new message
{
x_temp_node = (Node)(x_enum.nextElement());
x_new_message = new SimpleMessage(p_message);
x_new_message.setPreviousLocation(this);
o_sending_message.put(x_temp_node,x_temp_node);
x_temp_node.addMessageToInbox(x_new_message);
}
etc …
Forward to the
next node
NeuroGrid Search
NeuroGrid
nodes learn
data location and forward
accordingly
Human
analogy
networking
N002
TTL=2
N001
Query-G067
TTL=3
N003
GUID
KB
G084
G023
A – N003
G045
B – N002
C – N003
TTL=2
GUID
KB
G084
A – N004
G032
B – N005
G099
C – N005
TTL=0
N004
GUID
KB
G044
A – NXX
G023
B – NXX
G047
C – NXX
Stop
TTL=1
N007
KB
G037
G048
G045
A – NXX
B – NXX
C – NXX
N005
Match
TTL=1
N006
GUID
KB
G084
A – NXX
G067
B – NXX
G045
C – NXX
GUID
GUID
KB
G084
A – NXX
G067
B – NXX
G045
C – NXX
GUID
KB
G099
A – NXX
G023
B – NXX
G045
C – NXX
NeuroGrid Nodes
NeuroGrid nodes have MultiHashtables that associate a
single key with a Vector of objects
// MultiHashtable used to store which documents are in this node (key = keyword)
protected MultiHashtable o_contents = new MultiHashtable();
// MultiHashtable used to store information about documents in other nodes (key = keyword)
protected MultiHashtable o_knowledge = new MultiHashtable();
A successful search and processMessage updates the
knowledge base of the node that generated the query
for(int i=0;i<x_keywords.length;i++)
{
x_docs = (Vector)(o_contents.get(x_keywords[i]));
if(x_docs != null)
{
if(x_docs.contains(p_message.getDocument()))
{
if(Network.o_learning == true)
{
x_start_node = p_message.getStart();
x_start_node.addConnection(this);
x_start_node.addKnowledge(this,x_keywords);
}
break; // stop checking once we find a node
The keywords in the
incoming message
Document with that
keyword present?
Update original
Node KB
Freenet Search
Freenet
aggressively
caches data while
performing a serial
search
Routing uses
document hashes
N002
TTL=19
TTL=14
N004
Seen
it
TTL=12
TTL=11
N001
N003
Query-G067
GUID
KB
TTL=20
GUID
KB
G044
K002 G023
– NXXX
K003 G047
– NXXX
K004 – NXXX
G084
G023
K002 – N002
G045
K003 – N003
K004 – N007
TTL=13
GUID
KB
G084
K002 G032
– N004
K003 G099
– N005
K004 – N006
= match
N007
KB
G067
G048
G045
K002 – NXXX
K003 – NXXX
K004 – NXXX
N005
Match
TTL=10
N006
GUID
KB
G084
K002
G067
– NXXX
K003
G045
– NXXX
K004 – NXXX
GUID
GUID
KB
G099
K002 G023
– N002
K003 G045
– N003
K004 – N007
GUID
KB
G084
K002G067
– NXXX
K003G045
– NXXX
K004 – NXXX