Analyze downloaded files
Download
Report
Transcript Analyze downloaded files
ADD
Presentation
Academic Advisor: Dr. Yuval Elovici
Technical Advisor: Dr. Lidror Troyansky
Continued…
Computer A:
Sharing non-confidential files
Laptop B:
Containing an organization
confidential file
Router
PDA C:
Searches and downloads
organizations confidential file
Router
Gnutella network
P2P Inspector
Gadget
Router
Router
Organization Firewall
Client Organization
• Develop a system which will:
– be able to Configure the searching parameters.
– scan the P2P networks.
– download files suspicious as confidential.
– analyze the material using Machine Learning.
– generate reports.
– produce statistics.
• Scanning and looking for suspicious target (e.g. as
confidential) information in the P2P network (Gnutella).
• Downloading the suspicious target (e.g. as confidential)
information from the P2P network (Gnutella).
• Analyzing the scanned results (determine the value of
the documents).
– The system will use the Machine Learning based on the filtering
algorithm to classify the documents.
• Statistics Gathering:
– The number of users which currently hold the target
information.
– Using IP Geolocation and finding out the geographic location of
the leaked information.
– The history of searched for, downloaded & analyzed files.
• Performance constraints:
– The system should return a search result for
suspicious target after no more than 15 minutes.
– The system should not limit the download target time.
(Remark: it should be configurable. By default, a
time-out should always be set)
– The system should hold history result and statistics of
not more than one year ago.
• Safety and Security:
– The system will not be used for any other purpose
than find information leaks in P2P networks (e.g. to
find MP3 shares).
– The system will not expose the confidential
documents it downloads and the documents were
used in the Machine Learning algorithm.
• The system is constructed from several components
which are written in different languages and
communicate between each other in several ways.
• All software modules reside in the same computer.
– IGTellaHandler-
The primary responsibility of this component is downloading documents
from the Gnutella Network.
The IGTellaHandler is written in Java and communicates with the main
component (P2PinspectorGadget) via RMI technology (to increase the
de-coupling between the copmonents).
– IGConfClassifier-
The primary responsibility of this component is classifying documents
using different classification rules.
The output of this process will be saved in the database, and will be
available for further use.
– IGDBHandler-
The primary responsibility of this component is connecting to an external
database and stand as an interface for the system's modules for the database.
IGDBHandler will be written in java and will communicate with the main
component via RMI communication.
– P2PInspectorGadget-
This component is the system's main component, it has two primary
reponsibilities, the first is interaction with the user via the Graphical
User Interface, and the second is to control the flow of the system.
P2PinspectorGadget will be written in Java and will connect to the
different components with the connection mentioned above, and will
not communicate with any other external system.
P2PInspectorGadget
IGGUI
The GUI is in charge of comm unicating
with the user, our architecture uses a
MVC design pattern.
IGTellaHandler
IGStatistics
IGInfrastructure
The infrastructure package will include the file formats
converters and will comm unicate with the database
interface for the statistics and file analysis.
IGFileAnalyzer
The file analyzer will control both IGTella and
IGConfClassifier com ponents. it will also use the
IGInfrastructure services for comm unicating with
the DataBase and converting the files formats.
Searching files – seq. diagram
Package
NetworkUIImpl
Scanner
FileNameCreator
SearchSession
Handler
ResultParser
1: getTableResult(String Keyword)
2: CreateFileNames(String Keyword)
3: Hashset of file names
4: getResult(Hashset)
5: Loop on the Hashset
6: GetSharedFiles(filename)
7: shared_files_parameters
8: Parse the result into Table
9: Table result
10: Table result
11: Table result
JTella
• Unit testing
All the units will be tested for every use case.
For each use-case all of the possible paths will be tested.
The unit testing is a part of the design of the project, an
automated tests are running all of the time when we
develop the system.
Here are some of the testing in the test-plan:
– [Start system] Starting the system with a firewall blocking of the
P2P needed ports, and see that the system doesn't crush and
outputs the right error message.
– [Scan network] Verify that this process concludes after a predefined time-out.
– [Analyze downloaded files] Verify that the system converts the
different text formats (DOC, PPT and PDF) correctly into "raw"
text.
– [Analyze downloaded files] Verify accuracy of the algorithm
(achieving the standard of false-positive and true-positive as
defined in the project's targets.
• Acceptance Tests:
As a part of the acceptance tests, all of the use cases
will be fully checked from the beginning to the end.
In addition, all of the non-functional requirements will be
tested to make sure they meet their targets:
– System's History:
In order to verify that the System saves all the information for the
period that the user has defined (default is 1 year), we shall
manually try to change the system's clock and trick and see that
the data that needs to be saved is saved and the data that
should have been deleted, is deleted.
– System legitimacy (non pirate uses):
The system will be blocked for uploading data, this will be
checked with planting a unique media file (maybe MP3, or
MPEG) that we composed, with a unique name, and try with a
different client to download the media file.
– Content Safety:
In order to test for Content safety (classified documents used for
the learning part of the algorithm will not be exposed to the P2P
network), those two sub-application are running as a separate
processes with different memory space.
The test will be attempt from another client to download the
classified documents or the list of the documents from the
process that connnects to the P2P network.
•
•
•
•
•
•
Create and Integrate the GUI.
Find a list of Gnutella1 working servers.
Classification algorithm inspecting and learning.
Integrate Python written algorithm to Java.
Finish PDF 2 DOC converter.
Finish Gnutella driver (able to perform search
and download of documents).