Transcript Clipboard
Chapter 7
Web Content Mining
DSCI 4520/5240
Dr. Nick Evangelopoulos
Xxxxxxxx
Ihr Logo
Introduction
Web content mining is the mining,
extraction and integration of useful data,
information and knowledge from Web page
contents.
- textual
- audio
- video
- still images
- metadata
- hyperlinks
Your Logo
Introduction
Problems with the web data
Distributed data
Large volume
Unstructured data
Redundant data
Quality of data
Extreme percentage volatile data
Varied data
Your Logo
Introduction
Two approaches of web-content mining:
agent-based
software agents perform the
content mining
database oriented
view the Web data as belonging to a
database
Your Logo
Web Crawler
A computer program that navigates the
hypertext structure of the web
Crawlers are used to ease the formation
of indexes used by search engines
The page(s) that the crawler begins with
are called the seed URLs.
Builds an index visiting number of pages and
then replaces the current index
Known as a periodic crawler because it is
activated periodically
Your Logo
Web Crawler
Another type is a Focused Crawler
Generally recommended for use due to
large size of the Web
Visits pages related to topics of interest
If a page is not pertinent, the entire
set of possible pages below it is
pruned
Your Logo
Web Crawler
Crawling process
Begin with group of URLs
Submitted by users
Common URLs
Breath-first or depth-first
Extract more URLs
Numerous crawlers
Problem of redundancy
Web partition robot per partition
Your Logo
Focused Crawler
The focused crawler structure consists of
two major parts:
The distiller
The hypertext classifier
Your Logo
Focused Crawler
The pages that the crawler visits are
selected using a priority-based
structure managed by the priority
associated with pages by the classifier
and the distiller
Your Logo
Focused Crawler
Sample documents are identified and
classified based on a hierarchical
classification tree
Documents are used as the seed documents
to begin the focused crawling
Your Logo
Context Graph
Focused crawling has proposed the use of
context graphs, which in turn created the
context focused crawler (CFC)
The CFC performs crawling in two steps:
Context graphs and classifiers are
constructed using a set of seed documents
as a training set
Crawling is performed using the
classifiers to guide it
Your Logo
Content Graph
Your Logo
Implementation of a Web Crawler
Wget is a free GNU utility that makes it
possible to retrieve web documents
Wget supports Internet protocols
HTTP (Hyper Text Transfer Protocol)
FTP (File Transfer Protocol)
Recursively browse through the structure
of HTML documents and FTP directory trees
Your Logo
Commonly Used Options for Wget
Your Logo
Methods for Crawl Class
Your Logo
Crawl class
Figure 7.7 Code from the main of Crawl class (Suitable for Java programmers)
Your Logo
The readContent Method of Crawl Class
Your Logo
Figure 7.8 Code
from the
readContent method
of Crawl class
(Suitable for Java
programmers)
Code for Extracting Links from Crawl Class
Figure 7.9
Your Logo
Thank you for your attention
Your Logo