Trustworthy Information and Retrieval - iTrust

Download Report

Transcript Trustworthy Information and Retrieval - iTrust

Trustworthy
Information Distribution and Retrieval
Michael Melliar-Smith
University of California,
Santa Barbara
Research conducted in collaboration with
Louise E. Moser, Isai Michel Lombera and Yung-Ting Chuang
Supported in part by NSF Grant CNS 10-16103
Information Access over the Internet
 Modern society and commerce depend on access to
information over the Internet
 Information is accessed over the Internet using
centralized search engines and search indexes
– Internet search engines are centralized for efficiency
and scalability
 We cannot assume that centralized search engines
will always deliver the information we seek,
uncensored and unbiased
 iTrust is a system for publishing, searching for,
and retrieving information over the Internet that
provides trustworthy access to information
METIS'2011
iTrust
Michael Melliar-Smith
2
What Is Trust?
 "Trust is the mental background to delegation"
(Castelfranchi and Falcone [1])
 When you delegate an activity, you trust that:
– Good things will happen
– Bad things will not happen
 However, what is good and what is bad is
a
matter of your intent
 Thus, it is not possible to provide a universal
definition of trust
 Several useful surveys [2], [3] on trust, related to
reliability, security, and privacy
METIS'2011
iTrust
Michael Melliar-Smith
3
Trust in Algorithms
 Trust in a strong encryption algorithm is based on
estimates of the cost of breaking the keys
– Publication of the encryption algorithm
– Independent validation of the strength of the algorithm
– No need to trust the people who created the algorithm
 We aim to provide the same kind of trust for iTrust
– Published algorithms
– Statistical analysis
– No need to trust an administrator
Trust a social network with a large number of users
METIS'2011
iTrust
Michael Melliar-Smith
4
Related Work
 Useful surveys [4], [5], [6] on publish/subscribe and
distributed search, categorized as:
– Structured – Require nodes to be organized in an overlay network,
e.g., Distributed Hash Tables (DHTs), rings, trees
 More efficient than unstructured
 Involve administrative control and additional overhead for constructing
and maintaining the overlay network
 Still incur a trust risk through administrative control
– Unstructured – Gossip-based; typically use randomization
 Gnutella [7] - Great grandfather of unstructured systems; uses
flooding of requests to find information
 Freenet [8] - More sophisticated and efficient than Gnutella, because
it learns from previous requests
 Zhong and Shen [9] - Uses random walks; number of nodes visited by
a request is proportional to the square root of the content popularity
 Ferreira, et al. [10] - Uses random walks; replicates both the queries
and the data in a sparse network
METIS'2011
iTrust
Michael Melliar-Smith
5
Related Work
 Several recent information distribution and
retrieval systems are concerned with security,
privacy, and trust
– Quasar [11] – A probabilistic publish/subscribe system,
using a sparse structured overlay that is concerned with
the release of sensitive information by a central node
– OneSwarm [12] – A peer-to-peer data sharing system
that uses a combination of trusted and untrusted peers
 Part of an effort to provide an alternative to cloud computing
that does not depend on centralized trust
 Initial goal is to protect the privacy of the users
 Uses trusted intermediary nodes to preserve anonymity
METIS'2011
iTrust
Michael Melliar-Smith
6
Objectives of iTrust
 Main Objective: Provide users with information
that is important to them
 Publish information for other people to access
 Search for, and retrieve, published information
 Detect that the system is under attack
 When the system is under attack, adapt to increase
the probability of information distribution and
retrieval, with some increase in costs
METIS'2011
iTrust
Michael Melliar-Smith
7
Non-Objectives of iTrust
 No attempt to prevent the distribution of
misinformation
 No attempt to maintain secrecy and privacy
 No attempt to minimize communication,
processing and storage costs
– Costs are greater than for conventional, centralized
Internet search engines
– We assume that the additional costs are acceptable,
given the primary objective
– Information distribution and retrieval are probabilistic
METIS'2011
iTrust
Michael Melliar-Smith
8
Characteristics of iTrust
 Falls into the general category of
random walk publish/subscribe systems
–
–
–
–
–
–
–
All nodes are equal
No central search engine
Replication of metadata and requests
Random search or random walk
No supernodes
Distributed membership
No central control over membership
METIS'2011
iTrust
Michael Melliar-Smith
9
Basic Idea of iTrust
 The participating nodes constitute the membership of iTrust
 A source node is a participating node that
– Produces information, which it makes available to other
participating nodes
– Produces metadata (keywords) describing the information, and
distributes the metadata to a subset of participating nodes
chosen at random
 A requesting node is a participating node that
– Generates requests containing metadata and distributes the
requests to a subset of participating nodes chosen at random
 A participating node that receives a request
– Compares the metadata in the request with the metadata it holds
– If it finds a match, it returns the URL of the associated information
to the requesting node
 The requesting node then uses the URL to retrieve the
information from the source node
METIS'2011
iTrust
Michael Melliar-Smith
10
Distribution of Metadata
Source of
Information
METIS'2011
iTrust
Michael Melliar-Smith
11
Distribution of a Request
Source of
Information
Request
Encounters
Metadata
Requester of
Information
METIS'2011
iTrust
Michael Melliar-Smith
12
Retrieval of Information
Source of
Information
Request
Matched
Requester of
Information
METIS'2011
iTrust
Michael Melliar-Smith
13
Probabilistic Analysis
 Membership contains n participating nodes
 Metadata are distributed to m nodes
 Requests are distributed to r nodes
 Of the participating nodes, a proportion x
are operational
METIS'2011
iTrust
Michael Melliar-Smith
14
Probabilistic Analysis
 A request is distributed to r nodes (r trials)
 Probability of no match on 1st trial:
n-m
n
 Probability of no match on 2nd trial: n-m-1
n-1
 Probability of no match on rth trial: n-m-r+1
n-r+1
 Probability q of no match on r trials:
n-m n-m-1 … n-m-r+1 = (n-m)! (n-r)!
n
n-1
n-r+1
n! (n-m-r)!
METIS'2011
iTrust
Michael Melliar-Smith
15
Probabilistic Analysis
 Probability p of a match on r trials:
1 - n-m n-m-1 … n-m-r+1
n
n-1
n-r+1
= 1 - (n-m)! (n-r)! where n ≥ m+r
n! (n-m-r)!
 m=r =n
p > 1 – e-1  0.6321
 m=r =2n
p > 1 – e-2  0.8647
 m=r =2n
p > 1 – e-4  0.9817
 If m+r > n, then p = 1
METIS'2011
iTrust
Michael Melliar-Smith
16
Probabilistic Analysis
 Now assume that only a proportion x of the
participating nodes are operational
 Probability that 1st node has the metadata: m
n
 Probability that 1st node has the metadata
and is operational: mx
n
 Probability of no match on 1st trial: 1- mx = n-mx
n
n
 Probability of no match on 2nd trial: n-mx-1
n-1
 Probability of no match on rth trial: n-mx-r+1
n-r+1
METIS'2011
iTrust
Michael Melliar-Smith
17
Probabilistic Analysis
 Probability q of no match on r trials:
n-mx n-mx-1 … n-mx-r+1
n
n-1
n-r+1
= (n-mx)! (n-r)!
n! (n-mx-r)!
 Probability p of a match on r trials:
1 - n-mx n-mx-1 … n-mx-r+1
n
n-1
n-r+1
= 1 - (n-mx)! (n-r)! where n ≥ mx+r
n! (n-mx-r)!
 If mx+r > n, then p = 1
METIS'2011
iTrust
Michael Melliar-Smith
18
Probability of a Match
METIS'2011
iTrust
Michael Melliar-Smith
19
Probability of a Match
METIS'2011
iTrust
Michael Melliar-Smith
20
Time-to-Live
 Metadata can be provided with a time-to-live
– Receiver of the metadata deletes the metadata when
the time-to-live expires
 Similarly, a request can be provided with a
time-to-live
– Receiver of the request stores the request until the timeto-live and then deletes the request
– Receiver attempts to match newly arrived metadata
with the metadata in the request until the time-to-live
METIS'2011
iTrust
Michael Melliar-Smith
21
Small Information
 Many information items are small
 Distribute the information itself, rather than
the metadata about the information
METIS'2011
iTrust
Michael Melliar-Smith
22
Different Classes of Nodes
 Some nodes are less capable, or are only
intermittently connected to the network
 Distribute the metadata and the requests
only to the more capable nodes
 Less capable nodes might have more powerful
proxy nodes or home agents
METIS'2011
iTrust
Michael Melliar-Smith
23
Forwarding Metadata and Requests
 To exploit the parallelism of the Internet, the
originator of the metadata or request does not
necessarily send the metadata or request to all
m or r of the participating nodes
 When a node receives the metadata or request,
with some probability, it forwards the
metadata or request to another participating
node selected at random
 Doing so introduces some variability in the
number of nodes to which the metadata and
requests are distributed
METIS'2011
iTrust
Michael Melliar-Smith
24
Differential Distribution
 If there are many more requests than metadata,
it might be appropriate to distribute the metadata
to more nodes and the requests to fewer nodes
 Similarly, long-lived metadata and requests
might be distributed to more nodes than
short-lived metadata and requests
 Likewise, frequently requested metadata might be
distributed to more nodes than rarely requested
metadata
METIS'2011
iTrust
Michael Melliar-Smith
25
Network Load
 We are investigating the effects of metadata and
request distribution on the network load and also
on the load of participating nodes
 If the network load is too high, it might be necessary
to reduce the number of nodes to which the
metadata and requests are distributed
METIS'2011
iTrust
Michael Melliar-Smith
26
Membership
 The membership of participating nodes
need not be exact and up-to-date
 Small differences in the membership
are equivalent to small proportions of
non-operational nodes
 It is essential, to the iTrust strategy, that the
membership should not be centrally managed
 Thus, we employ a membership algorithm that
is based on iTrust itself
METIS'2011
iTrust
Michael Melliar-Smith
27
Membership Algorithm
 A node wishing to join the membership
contacts any current member to obtain the
current membership
– It does so using mechanisms that are outside the
iTrust strategy, perhaps Email, Twitter, etc.
 It then publishes its joining the membership,
through the iTrust distribution and retrieval
mechanisms
 All nodes periodically request and retrieve
information about new nodes that have joined
the membership
METIS'2011
iTrust
Michael Melliar-Smith
28
Joining the Membership
METIS'2011
iTrust
Michael Melliar-Smith
29
Discovering New Members
METIS'2011
iTrust
Michael Melliar-Smith
30
Leaving the Membership
METIS'2011
iTrust
Michael Melliar-Smith
31
Rapidly Changing Memberships
 At times of rapid membership change, it might
be appropriate to request and retrieve
membership information more frequently,
with increased computation and
communication costs
 At times of rapid membership change, it might
be appropriate to distribute the metadata and
requests to more nodes to compensate for
inaccurate membership information
METIS'2011
iTrust
Michael Melliar-Smith
32
Large Memberships
 Large memberships (perhaps millions of nodes)
might be expensive to retrieve and store
 The potentially high rate of notifications of
membership changes for a large membership
might impose a heavy load on the network
 We are investigating strategies for creating and
maintaining memberships in which each node is
aware of only a small subset of the membership
 We are also investigating the effects of small
subsets of the membership, on the effectiveness
of information distribution and retrieval
METIS'2011
iTrust
Michael Melliar-Smith
33
Encryption
 In iTrust, there is no intention to use encryption
to ensure secrecy or privacy at the node level
 Necessarily, metadata and requests must be
readable by large numbers of nodes and, thus,
they are public
 However, encryption can be used to make it
prohibitively expensive for routers to use deep
packet inspection to censor metadata or requests
 For this purpose, iTrust uses standard public key
encryption
METIS'2011
iTrust
Michael Melliar-Smith
34
Encrypted Metadata and Requests
 When a node sends metadata or requests,
it encrypts the message with
– Its private key
– Destination’s public key
 The sending node includes its public key in the message
– Some receiver nodes might not yet have
its information in their membership tables
 When a node finds that a request matches its metadata,
– It uses the public key in the request to encrypt the response
reporting the match to the requester
– The response supplies the URL that directs the requester
to the source of the information, and also the source’s public key
METIS'2011
iTrust
Michael Melliar-Smith
35
Potential Malicious Attacks
 A malicious attacker might seed the network with
covertly subverted nodes that behave normally,
except that they fail to report matches involving
information that the attacker wants to suppress
 A malicious attacker must ensure that a large
number of nodes that participate in matching
have been subverted
 In iTrust, it is important to detect a malicious
attack, and to prevent the malicious attack from
being effective
METIS'2011
iTrust
Michael Melliar-Smith
36
Detecting a Malicious Attack
 In iTrust, it is likely that a request will result in
several reports of matches
 The probability of multiple reports of matches
depends on:
–
–
–
–
Number n of participating nodes
Number m of nodes to which the metadata are distributed
Number r of nodes to which the requests are distributed
Probability x that a node is operational
 The effect of a malicious attack is to increase
the probability that a substantial number of
subverted nodes appear to be non-operational for
certain metadata or requests
METIS'2011
iTrust
Michael Melliar-Smith
37
Detecting a Malicious Attack
Probability of Number of Matches
0.6
1000 Node Network with
Distribution to 60 Nodes
0.5
Percentage of nodes
operational
100%
80%
0.4
60%
0.3
40%
20%
0.2
0.1
0
k=0
k=1
k=2
k=3
k=4
k=5
Number of Matches
METIS'2011
iTrust
Michael Melliar-Smith
38
Responding to a Malicious Attack
 If the iTrust network is under attack, it is
appropriate to increase the number of nodes
to which the metadata and requests are distributed
 We are investigating an adaptive algorithm that
increases the number of nodes to which the
metadata and requests are distributed, as the
probability of an attack increases, i.e., our estimate
of the number of subverted nodes increases
METIS'2011
iTrust
Michael Melliar-Smith
39
Prototype Implementation of iTrust
 Based on the Apache Web server, compiled
with several PHP standard modules and
library extensions
 Uses HTTP for distribution of metadata and
requests, and for retrieval of documents
 Multiple iTrust nodes can be installed on a
single Web server by creating multiple virtual
Web sites on that server
 Comprises three components:
– Web Server Foundation
– Application Infrastructure
– Public Interface
METIS'2011
iTrust
Michael Melliar-Smith
40
The Components of iTrust
apache
PHP
cURL
metadata functions
metadata xml engine
register metadata list
apply xml
publish xml list
helper functions
nodes wrapper
keywords wrapper
log
PECL http
tools
user settings
statistics
SQLite
session
public interface
resource wrapper
metadata inbox
leave membership
delete nodes
tag keyword resource
search functions
search
globals / navigation
query
inbox
tika / lucene / dictionary
(a)
METIS'2011
(b)
iTrust
(c)
Michael Melliar-Smith
41
Web Server Foundation
 cURL is used for inter-node communication and
resource-specific actions
 SQLite database tables are used to store node,
metadata, and resource information
– A node uses the SQLite LIKE function to match the
metadata in a request with the metadata that it holds
 The session module tracks and distinguishes users
 The log module is used for debugging and
for simulation
 The PHP Extension Community Library (PECL)
for HTTP is used for inter-node search and requests
METIS'2011
iTrust
Michael Melliar-Smith
42
Application Infrastructure
 The metadata XML engine scans the resources and
creates an XML list describing the relationship
between the metadata and the resource
 The node and resource-related helper functions
insert nodes into the membership, insert keywords
into the database, and upload or fetch resources
 The Apache Tika and Lucene packages are used to
generate metadata from resources, if the user opts
not to generate the metadata manually
 The WordNet dictionary is used to provide spell
checking and synonym suggestions
METIS'2011
iTrust
Michael Melliar-Smith
43
Public Interface
 Comprises two kinds of interfaces:
– Computer interfaces – Handle all inter-node
communication such as queries, resource distribution,
and metadata list distribution
 Request is sent to participating nodes using computer interfaces
in a simple inbox-type fashion
 A participating nodes reads its inbox for queries, and sends back
a response if it has a match
– Human interfaces – Consist of PHP HTML Web pages
 Administrator can add nodes or metadata keywords using
HTML form text boxes
 User generates requests using HTML form text boxes
 User settings and statistics Web pages provide feedback on
the membership size, resource count, etc.
METIS'2011
iTrust
Michael Melliar-Smith
44
Prototype Implementation of iTrust
METIS'2011
iTrust
Michael Melliar-Smith
45
Prototype Implementation of iTrust
METIS'2011
iTrust
Michael Melliar-Smith
46
Prototype Implementation of iTrust
METIS'2011
iTrust
Michael Melliar-Smith
47
Simulation Results Based on the Implementation
METIS'2011
iTrust
Michael Melliar-Smith
48
Conclusion and Future Work
 We have described iTrust [13], [14], a trustworthy
information distribution and retrieval network
 We plan to do experimental evaluations of the
prototype implementation using PlanetLab
 We are investigating other implementations of
iTrust based on:
– SMS
– Twitter
– What else? We need your advice.
 We plan to make the iTrust source code, tools,
documentation, etc. freely available for all to use
METIS'2011
iTrust
Michael Melliar-Smith
49
References
[1] C. Castelfranchi and R. Falcone, Principles of Trust for MAS: Cognitive Anatomy, Social Importance, and
Quantification, Proceedings of the International Conference on Multi-Agent Systems, Paris, France, 72-79.
[2] D. Artz and Y. Gil, A Survey of Trust in Computer Science and the Semantic Web,
Journal of Web Semantics, 5:58-71, Elsevier, 2007.
[3] T. Grandison and M. Sloman, A Survey of Trust in Internet Applications,
IEEE Communications Survey Tutorials, 4(4):2-16, 2000.
[4] P. T. Eugster, P. A. Felber, R. Guerraoui and A. M. Kermarrec, The Many Faces of Publish/Subscribe,
ACM Computing Surveys, 35(2):114-131, June 2005.
[5] J. Mischke and B. Stiller, A Methodology for the Design of Distributed Search in P2P Middleware,
IEEE Network 19(1):30-37, January 2004.
[6] J. Risson and T. Moors, Survey of Research Towards Robust Peer-to-Peer Networks: Search Methods,
Technical
Report UNSW-EE-P2P-1-1. University of New South Wales, September 2007, RFC 4982,
http://tools.ietf.org/html/rfc-4821.
[7] Gnutella, http://gnutella.wego.com/.
[8] I. Clarke, O. Sandberg, B. Wiley and T. Hong, Freenet: A Distributed Anonymous Information Storage and
Retrieval System, Proceedings of the Workshop on Design Issues in Anonymity and Unobservability,
Lecture Notes in Computer Science, Berkeley, CA, July 2000, 46-66.
[9] M. Zhong and K. Shen, Popularity-Based Random Walks for Peer-to-Peer Search under the Square-Root Principle,
Lecture Notes in Computer Science 4490, 2007, 877-880.
[10] R. A. Ferreira, M. K. Ramanathan, A. Awan, A. Grama and S. Jagannathan, Search with Probabilistic Guarantees
in Unstructured Peer-to-Peer Networks, Proceedings of the Fifth IEEE International Conference on Peer-to-Peer
Computing, Konstance, Germany, August 2005, 165-172.
[11] B. Wong and S. Guha, Quasar: A Probabilistic Publish-Subscribe System for Social Networks,
Proceedings of the 7th International Workshop on Peer-to-Peer Systems, Tampa Bay, FL, February 2008.
[12] T. Isdal, M. Piatek, A. Krishnamurthy and T. Anderson, Privacy Preserving P2P Data Sharing with OneSwarm,
Technical Report UW-CSE, Department of Computer Science, University of Washington, 2009.
[13] I. Michel Lombera, Y. T. Chuang, P. M. Melliar-Smith and L. E. Moser, Trustworthy Distribution and Retrieval
of Information over HTTP and the Internet, Proceedings of the Third International Conference on the Evolving
Internet, INTERNET 2011, Luxembourg, June 2011.
[14] Y. T. Chuang, I. Michel Lombera, L. E. Moser and P. M. Melliar-Smith, Trustworthy Distributed Search and
Retrieval over the Internet, Proceedings of the 2011 WORLDCOMP, International Conference on Internet
Computing, ICOMP, Las Vegas, NV, July 2011.
METIS'2011
iTrust
Michael Melliar-Smith
50
Thank You!
Questions?
Comments…
METIS'2011
iTrust
Michael Melliar-Smith
51