Transcript Clipboard

Chapter 7
Web Content Mining
DSCI 4520/5240
Dr. Nick Evangelopoulos
Xxxxxxxx
Ihr Logo
Introduction
 Web content mining is the mining,
extraction and integration of useful data,
information and knowledge from Web page
contents.
- textual
- audio
- video
- still images
- metadata
- hyperlinks
Your Logo
Introduction
 Problems with the web data
 Distributed data
 Large volume
 Unstructured data
 Redundant data
 Quality of data
 Extreme percentage volatile data
 Varied data
Your Logo
Introduction
 Two approaches of web-content mining:

agent-based
software agents perform the
content mining
 database oriented
view the Web data as belonging to a
database
Your Logo
Web Crawler
 A computer program that navigates the
hypertext structure of the web
 Crawlers are used to ease the formation
of indexes used by search engines
 The page(s) that the crawler begins with
are called the seed URLs.
 Builds an index visiting number of pages and
then replaces the current index
 Known as a periodic crawler because it is
activated periodically
Your Logo
Web Crawler
 Another type is a Focused Crawler
 Generally recommended for use due to
large size of the Web
 Visits pages related to topics of interest
 If a page is not pertinent, the entire
set of possible pages below it is
pruned
Your Logo
Web Crawler
 Crawling process
 Begin with group of URLs

Submitted by users

Common URLs
 Breath-first or depth-first
 Extract more URLs
 Numerous crawlers
 Problem of redundancy
 Web partition  robot per partition
Your Logo
Focused Crawler
 The focused crawler structure consists of
two major parts:
 The distiller
 The hypertext classifier
Your Logo
Focused Crawler
 The pages that the crawler visits are
selected using a priority-based
structure managed by the priority
associated with pages by the classifier
and the distiller
Your Logo
Focused Crawler
 Sample documents are identified and
classified based on a hierarchical
classification tree
 Documents are used as the seed documents
to begin the focused crawling
Your Logo
Context Graph
 Focused crawling has proposed the use of
context graphs, which in turn created the
context focused crawler (CFC)
 The CFC performs crawling in two steps:
 Context graphs and classifiers are
constructed using a set of seed documents
as a training set
 Crawling is performed using the
classifiers to guide it
Your Logo
Content Graph
Your Logo
Implementation of a Web Crawler
 Wget is a free GNU utility that makes it
possible to retrieve web documents
 Wget supports Internet protocols
 HTTP (Hyper Text Transfer Protocol)
 FTP (File Transfer Protocol)
 Recursively browse through the structure
of HTML documents and FTP directory trees
Your Logo
Commonly Used Options for Wget
Your Logo
Methods for Crawl Class
Your Logo
Crawl class
Figure 7.7 Code from the main of Crawl class (Suitable for Java programmers)
Your Logo
The readContent Method of Crawl Class

Your Logo
Figure 7.8 Code
from the
readContent method
of Crawl class
(Suitable for Java
programmers)
Code for Extracting Links from Crawl Class
Figure 7.9
Your Logo
Thank you for your attention
Your Logo