Transcript Web Mining

Web Mining
Introduction Business Intelligence
ODIT – 2
The Institute of Finance Management: Computing and
IT Dept.
http://www.ifm.ac.tz/staff/bajuna
Introduction
• With huge amount of information available
online, the World Wide Web is a fertile area
for data mining.
• We are drowning in information and facing
information overload from the Web
• Web Mining techniques could be used to solve
the information overload problems.
Problems With Search Engines
• Search tools have the following problem
– Low precision due to the irrelevance of many of
the search results. Difficult finding relevant
information.
– Low recall which is due to the inability to index all
the information available on the Web.
• Hard to find unindexed relevant information.
Definition
• Web mining is the use of data mining
techniques to automatically discover and
extract information from Web document and
services.
Web Mining Tasks
• According to Etzioni Web Mining can be
decomposed into the following subtasks:
– Resource findings: The task of retrieving an
intended Web documents
– Information selection and pre-processing:
automatically selecting and pre-processing specific
information from retrieved Web resources.
– Generalisation: automaticallly discover general
patterns at individual Web sites as well as across
multiple sites.
– Analysis: validation and / or interpretation of the
mined patterns.
Web Mining Tasks
• By resource findings we mean the process of
retrieving the data that is either online or
offline from the text sources available on the
Web, such as electronic newsletters,
electronic newswire, newsgroups, etc.
• The information selection and pre-processing
step is any kind of transformation process of
the original data retrieved in the IR process.
Web Mining Tasks
• In step 3, data mining techniques are used for
the generalisation.
• Humans do play an important role in the
information or knowledge discovery process
on the Web since the Web is an interactive
medium. This is important for validation and/
or interpretation in step 4.
Generally
• Web mining refers to the overall process of
discovering of potentially useful and
previously unknown information or
knowledge from the Web data.
• It implicitly cover the standard process of
Knowledge Discovery in Databases (KDD).
• Web mining can be simply viewed as an
extension of KDD that is applied on the Web
data.
Web Mining and IR
• IR has the primary goal of indexing text and
searching for useful documents in a collection,
and document classification, categorisation,
etc.
• The task that can be considered Web Mining is
Web document classification or
categorisation, which could be used for
indexing.
• Therefore Web Mining is a part of IR process
although not all of indexing use data mining
techniques.
Web Mining and Information
Extraction
• IE aims at extracting relevant facts from the
documents.
• Due to nature of the Web IE systems focus on
specifics web sites to extract.
• Others use data mining techniques to learn
the extraction patterns or rules for Web
documents.
• Web mining in this case is part of the (Web) IE
process.
Web Mining and Information
Extraction
• The result of IE process could be in the form of
structure of database or could be a
compression or summary of the original text
or documents.
• Therefore IE can be viewed as pre-processing
stage in the Web mining process, which is the
step after IR process and before the data
mining techniques are being performed.
Web Mining Categories
• Can be categorised into three area of interest
based on which part of the Web to mine.
• These are:
– Web Content Mining
– Web Structure Mining
– Web Usage Mining
Web Content Mining
• Describes the discovery of useful information
from the Web contents/data/documents.
• However what consists of the Web contents
could encompass a broad range of data.
• The Web contents consists of several types of
data such as textual, image, audio, video,
metadata, and hyperlinks
Web Content Mining
• The Web content data consist of:
– Unstructured data such as free texts.
– Semi - structured data such as HTML documents
– More structured data such as data in the tables or
database generated HTML pages
• However much of the Web content data is
unstructured text data.
Web Content Mining
• The research around applying data mining
techniques to unstructured text is termed
Knowledge Discovery in Texts (KDT), or text
data mining or more commonly as Text
Mining.
• Therefore text mining is an instance of Web
content mining.
Web Structure Mining
• This tries to discover the model underlying the
link structure of the Web.
• The model is based upon the type of the
hyperlinks with or without the description of
the links.
• This model can be used to categorise Web
pages and is useful to generate information
such as the similarity and relationship
between different Web sites.
Web Usage Mining
• This tries to make sense of the data generated
by the Web surfer’s sessions or behaviours.
• While the Web content and structure mining
utilise the real or primary data on the Web,
Web usage mining mines the secondary data
derived from the interactions of the users
while interacting with the Web.
Web Usage Mining
• The Web usage data includes the data from
Web server access logs, user profiles,
registration data, user sessions, or
transactions, cookies, user queries, bookmark
data, mouse clicks and scrolls and any other
data as the results of interactions.
Generally
• In practice three Web Mining tasks could be
used in isolation or combined in an
application, especially in Web content and
structure mining.