Second Phase: Analyzing

Download Report

Transcript Second Phase: Analyzing

Technical Advisor: Dr. Lidror Troyansky
Academic Advisor: Dr. Yuval Elovic
Presents:
• As the world gets computerized and connected,
organizations are getting more and more exposed to data
leaks (both malicious and innocent).
• 70% of the network traffic is occupied by P2P!
BitTorrent, eMule, FreeNet, Gnutella…
File Transferors can deliberately or unintentionally distribute
sensitive information in seconds to all of the world!
• Examples:
– Israeli Air Force lieutenant colonel shared via P2P his laptop and
revealed confidential documents of the Israeli Air Force and got
suspended from his office.
– Israeli Police of Eilat’s chief of Intelligence also shared a secret police
plan with all of the world and risked many policemen lives…
Nothing!!!!
A Google search for the terms “P2P networks
Information leaks” results with just 148 pages!!!
After checking the first 50 for the relevance we got tired…
As a world leader in the ILP (Information Leaks Prevention)
PortAuthority© Technologies addressed this problem.
The research will be done using “P2P Inspector Gadget”
system.
Computer A:
Sharing non-confidential files
Laptop B:
Containing an organization
confidential file
Router
PDA C:
Searches and downloads
organizations confidential file
Router
Gnutella network
P2P Inspector
Gadget
Router
Router
Organization Firewall
Client Organization
• Develop a system which will:
– Connect to P2P networks and perform smart search and
download suspicious files while avoiding P2P anti-bots
algorithms.
– Analyze the files (PDFs, DOCs, TXTs, source codes and
other types) using smart Machine Learning, industry’s
most advanced algorithms and user feedback
mechanism with very few false-positives.
– produce history and statistics such as IPGeoLocation
and file information, stored in a database.
– Enable the research of information leaks in P2P
networks.
P2PIG User
P2PInspectorGadget
GUI controller, written with Java
SWT
IGDBHandler
IGtellaHandler
Written in Java
Gnutella network connection
IGDB
Gnutella Network
IGStatisticsHandler
IGFileConverter
Local File System
IGConfClassifier
Written with Python
JEP
Not in
System
Converted
Search and Download file
FeedBacked/
Learned
FeedBack file
Convert to .txt
Downloaded
System ReInitialzed
Analyze File
Analyzed
• The system
file isisanalyzed
user
found
converted
inable
isthe
reinitialized.
by
to
P2P
view
the
by
tonetwork
the
system’s
text
thesystem
format
file’s
and
search
content
and
as
weahave
its
preliminary
engine
and
confidential
nogive
information
andaction
is fully
about it.it istois
downloaded.
before
probability
feedback
analyzed
the
determined.
system.
by the system.
• In
The
this
system
stage currently
the system
file isworks
first
adds
saved
with
theall
to
filetext
the
to system’s
its
formats
database
and
database.
binary
and to
fileprobability
its
types suchhash
as PDF,
tables.
Word and PowerPoint .
• The problem of analyzing the files for confidential
information is a part of the Categorization Problem
Domain.
• In our case, there are two, well defined sets of documents
(Confidential and Non-Confidential).
• There are many kinds of Algorithms for Categorization
problems, after a research in the area and a warm
recommendation from our professional advisor we chose
the usage of an algorithm based on the Bayes Theorem,
Conditional Probabilities (with some improvements).
• The usage of Bayes Theorem is very common in the
problem of SPAM filtering (which resemblance to our
problem).
• The Algorithm works in two Phases:
– First Phase: Learning – Building the Probabilities.
At first, the Algorithms is given two Training Sets, a Set of
Confidential files and a Set of Non-Confidential files.
Using Bayes Conditional Probability formula, the Probability of each
of the terms in the files is saved in a dedicated Data-Structure.
– Second Phase: Analyzing - Combining.
In the second phase each of the terms in the analyzed files gets its
Pr (B|Ai )  Pr (Ai )
probability (computed
phase).
Pr (Ai|B)inthe learning
Pr (B|A j the
)  Prprobabilities
(A j )
The Algorithm now tries to
Combine
of all of the
j
most frequent terms.
We are using the Robinson-Fisher Combiner which improves
greatly the Algorithm accuracy and reduces significantly the number
of false-positives.
2
 
2
v/2
exp  v2x
(

v
/
2
)
2
f ( x; v,  ) 
 1 v / 2
 ( v / 2)
x
Option to force
connection to a
specific Ultra-peer
Connecting to several
Ultra-peers simultaneously
Status bar shows the current status of the
system and displays help messages.
Insert the keywords
When pressing
the system
generates several search queries
based on the words and file types.
The user can choose which files to
download, or configure the system to
download all files.
When the file starts to download, the
system starts to save information about
this file (IP sources, number of users
currently hold the file and more).
The user can send
view the
cancel
a any
downloaded
downloading
downloadfile
at
progress.
to
anytime.
be analyzed.
• Remaining tasks:
–
–
–
–
–
Statistics gathering.
Improve smart search and filtering.
Add more GUI functionality.
Conduct official algorithm test and document it.
ARD, Test Document, and more.
• Start date: Oct’ 2005.
• Estimated End date: Aug’ 2006.
• Over 15,500 lines of code and still counting…
– More than 1267 python lines.
• Over 800 hours of work per man.
• 18 pizza platter 