Web Content Filter for secure WWW access

Download Report

Transcript Web Content Filter for secure WWW access

Web Content Filter:
technology for social safe
browsing
Ilya Tikhomirov
Institute for Systems Analysis
of the Russian Academy of
Sciences
E-mail: [email protected]
WWW: http://www.isa.ru
State of art (1)
• The HTTP traffic is about 50 % of the information
transfer in the Web.
• A part of Web is inappropriate (extremist and
porno sites, social networks and music archives)
for some categories of users (children, students,
employees).
• The number of inappropriate sites grows constantly.
Blocking access to inappropriate Web sites is the
main goal of the content filters.
2
State of art (2)
• Modern content filters use predefined ban lists
of URLs or IP addresses of inappropriate Web
sites.
• Ban lists are formed manually by content filter
developers or by network administrators.
• HTTP proxy servers became more popular in
Web surfing. It makes usage of predefined ban
lists inefficient and useless.
3
Disadvantage of current solutions
• Content filters use pattern matching for HTTP
response.
Content is blocked if any restricted signature was
found in the HTTP response.
• Text is seen like a stream of bytes.
• No full-text analysis.
• Low recall and precision of filtering.
The exponential growth of the Web and
dynamically changing content cause an
incomplete coverage of inappropriate Web
resources.
4
Solution
The solution is automatic classification based on
full-text content analysis of the Web pages “on
the fly”:
• System assigns documents to categories and
denies access to inappropriate Web pages.
• The automatic classification method analyses
Web pages according to terms importance in
natural language text.
• Morphological analysis is used for text
preprocessing.
5
Architecture of the Web Content
Filter
User
PROXY server
WEB server
Web
LAN
Web server
application
Redirector
URL cache
Classification
subsystem
Linguistic analyzer
Requested
Web page
6
Dynamic content filtering
algorithm
1. User’s request for the
web document
NO
2. Is the document URL
present in URL cache?
YES
3. Analyze the document
text
4. Classify document
5. Add document’s URL to
the URL cache
NO
7-a. Allow
access
6. Is the document
inappropriate ?
YES
7-b. Deny
access
7
Deployment of
the Web Content Filter
Main steps:
1. Forming inappropriate categories.
2. Preparing learning examples.
3. Running the automatic classifier in learning
mode.
4. System setup, testing and parameters
customization.
5. Running the automatic classifier in filtering
mode.
8
Advantages
• Higher precision and recall of filtering than
traditional methods.
• Transparency for users: no advanced settings of
Web browsers and other Web applications are
required.
• Adaptability to users’ behavior: only requested
pages are examined.
• Scalability: distributed modules withstand load in
big computer networks.
• Strictness level customization.
9
Usage
Content Filtering for social safe Web browsing at:
•
•
•
•
Educational institutions.
State institutions.
Companies.
Home networks.
10
Contacts
Institute for Systems Analysis
of the Russian Academy of Sciences
117312,Moscow,
pr. 60-let Octiabrya, 9
phone/fax: +7 (499) 135-04-63
e-mail: [email protected]
11