dawak-mining

Download Report

Transcript dawak-mining

Research Issues in Web Data
Mining
Sanjay Kumar Madria
Department of Computer Science
Purdue University, West Lafayette, IN
[email protected]
Sourav Bhowmick, Ng Wee Keong
and Lim Ee Peng
Nanyang Technological University, Singapore
Dawak'99
copy-right@sanjaymadria
WHOWEDA!
www.cais.ntu.edu.sg:8000/~whoweda
• A WareHouse Of WEb DAta
• Web Information Coupling Model (WICM)
– Web Objects
– Web Schema
• Web Information Coupling Algebra
• Web Information Maintenance
• Web Mining and Knowledge discovery
Dawak'99
copy-right@sanjaymadria
Web Objects
•
•
•
•
•
•
Node - url, title, format, size, date, text
Link - source-url, target-url, label, link-type
Web tuple
Web table
Web schema
Web database
Dawak'99
copy-right@sanjaymadria
WWW
User
Warehouse
Concept
Mart
Web Information
Mining System
Web Querying
& Analysis Component
Web
Information
Coupling
System
Web
Mart
Web
Mart
Web Information
Maintenance System
Web
Mart
Web
Warehouse
Web
Mart
User
WWW
Web Query & Display
Warehouse
Concept
Mart
Global Ranking
Data Visualization
Global Web
Manipulation
Global Web
Coupling
Pre processing
Schema Tightness
Web
Warehouse
Web Select
Web Project
Local Web Coupling
Local Ranking
Web Join
Local Web
Manipulation
Data Visualization
Web Union
Web Intersection
Schema Tightness
Schema Search
Schema Match
Web Schema
• Structural ‘summary’ of web table
• Information Coupling using a Query
graph
• Query graph ->Web schema
• directed graph as ordered 4-tuple:
–
–
–
–
Set of node variables
Set of link variables
Connectivities
Predicates
Dawak'99
copy-right@sanjaymadria
Dawak'99
copy-right@sanjaymadria
Brief Organization of Information Space's Web Site
Informatio Headline
n Square's article 1
homepage
Headline
article n
News
@TCS
News
specials
Airport
info
Dawak'99
(List of
video
files)
List of
links to
local
news
List of
links to
world
news
copy-right@sanjaymadria
Local
news 1
Local
news
kWorld
news 1
World
news t
e
x
y
target_url
CONTAINS
url contains
"article”“headlines”
f
g
z
target_URL
CONTAINS
"newshub/spe
cials"
label
CONTAINS
"Local
News"
h
url
CONTAINS
"local"
w
Dawak'99
label
CONTAINS
"World
News"
copy-right@sanjaymadria
url
CONTAINS
"world"
Information
Square's
homepage
Headline
article 1
List of links
to
local news
Local news
1
News
specials
World news
1
Dawak'99
List of links
to
world news
copy-right@sanjaymadria
Schema- example
• Node variables:
• Link variable:
• Connectivities:
Xn = { x, y, z, w }
Xl = { e, f, g }
C = { x<e>y and x<fg->z
and x<fh->w }
Dawak'99
copy-right@sanjaymadria
• Predicates
• P={x.url=”http://www.mediacity.com.sg/i
-square”,
• y.url CONTAINS “headlines”
• e.target_url CONTAINS "article",
• f.target.url CONTAINS
"newshub/specials",
• g.label CONTAINS "Local News",
• z.url CONTAINS "local",
• h.label CONTAINS "World News",
• Dawak'99
w.url CONTAINS "world"
}
copy-right@sanjaymadria
Query Graph - Example
• Query graph - same as schema
• Informally, it is directed connected graph
consists of nodes, links and keywords
imposed on them.
• Produce a list of diseases with their
symptoms, evaluation procedures and
treatment starting from the web site at
http://www.panacea.org/
• Web table Diseases
Dawak'99
copy-right@sanjaymadria
Treatment list
q
g Treatment
http://www.panacea.org/
Issues
y
x
Symptoms list
f
Symptoms
z
List of Diseases
e
Evaluation
Evaluation
w
p
q1
g1
Treatment list
Treatment
http://www.panacea.org/
Issues
x0
y1
AIDS
List of Diseases
e1
f1
Symptoms
z1
Symptoms
list
Evaluation
Evaluation
w1
Elisa Test
p2
Example 2
• Produce a list of drugs, and their uses and
side effects starting from the web site at
http://www.panacea.org/
• Web table Drugs
Dawak'99
copy-right@sanjaymadria
http://www.panacea.org/
a
Drug
list
Side
effects
Issues
c
b
r
Side effects
List of Diseases
Use
s
k
Uses
d
http://www.panacea.org/
a0
List of
Diseases
AIDS
Drug
list
b1
Side effects
of Indavir
Issues
Indavir
c1
r1
Side effects
Use
s1
k1
Uses of
Indavir
d1
WWW Data Mining
• web structure mining : Web structure
mining involves mining the web document’s
structures and links.
• web content mining : Web content mining
describes the automatic search of
information resources available on-line.
• web usage mining : Web usage mining
includes the data from server access logs,
user registration or profiles, user sessions or
transactions etc.
Dawak'99
copy-right@sanjaymadria
Web Structure Mining : Issues
 Measuring the frequency of the local links
(links in the same server) in the web tuples in a
web table.
 web tuples have more information about interrelated documents that exists at the same server.
 measures the completeness of the web site in a
sense that most of the closely related information
is available at the same site(server).
 For example, an airline’s home page will have
more local links connecting the “routing
information with air-fares and schedules”.
Dawak'99
copy-right@sanjaymadria
 Measuring the frequency of web tuples in a
web table containing links which are
interior; links which are within the same
file.
 measures a web document’s ability to crossreference other related web pages within the
same document.
 measures the flow of the web documents.
Dawak'99
copy-right@sanjaymadria
 Measuring the frequency of web tuples in a web
table that contains links that are global; links
which span different web sites.
 measures the visibility of the web documents and
ability to relate similar or related documents
across different sites.
 For example, research documents related to
“semi-structured data” will be available at many
sites and such sites should be visible to other
related sites by providing cross references by the
popular phrases such as “more related links”.
Dawak'99
copy-right@sanjaymadria
 Measuring the frequency of identical web tuples
that appear in a web table or among the web tables.
 measures the replication of web documents and may
help in identifying the mirrored sites.
 What is the in-degree and out-degree of each node
(web document)? What is the meaning of high and
low in- and out-degrees?
 Locating links to popular web sites in the web
tuples in a table.
 Number of web tuples are returned in response to a
query on some popular phrases such as “Bioscience” with respect to queries containing
keywords like “earth-science”.
Dawak'99
copy-right@sanjaymadria
 discover the nature of the hyperlinks in the web sites
of a particular domain.
 What information do they provide and how are they
related conceptually.
 Is it possible to extract a conceptual hierarchical
information for designing web sites of a particular
domain.
 generalizing the flow of information in web sites
representing some particular domain.
Dawak'99
copy-right@sanjaymadria
Web Bags and Web Structure Mining
• Most of the search engines fail to handle the
following knowledge discovery goals:
 locate the most visible web sites or documents for
reference. Many paths (high fan in) can reach that
sites or documents.
 locate the most luminous web sites or documents
for reference. web sites or documents which have
the most number of outgoing links.
 find the most traversed path for a particular query
result. To identify the set of most popular
interlinked web documents that have been traversed
frequently to obtain the query result.
Dawak'99
copy-right@sanjaymadria
Applications of Visibility
• Association rules
• e-commerce
Dawak'99
copy-right@sanjaymadria
Consider a query graph involving some keywords such as " types of restaurants"
and "items" and the results returned.
www.test.com
items
a
x
z
restaurants
www.test.com
X
1
Pizza
Italian
restaurants
Z1
Milano-R
www.test.com
X
1
Pizza
Z1
European
Restaurants
Paris-R
www.test.com
X
1
Dawak'99
Pesta
Italian
Restaurants
Z2
Milano-R
copy-right@sanjaymadria
• From the results returned, find most visible
pages. Assume Z1 is the most visible page
with the given threshold.
• This gives estimates about different
restaurants selling pizzas.
• Lower threshold gives you set (Z1, Z2) as
visible pages, which sells both pizza and
pasta.
• Generalize rules such as out of 66% of
restaurants which offer pizza to their
customers, 33% also offers pasta.
Dawak'99
copy-right@sanjaymadria
E-commerce application
• My web site’s visibility is going down!!!!
Dawak'99
copy-right@sanjaymadria
Application - Luminosity
• Association rules such as X% of all the
companies which makes a product “A”, Y%
of them also makes a set of products “B and
C”.
• Exmple - certain companies (33%) if they
make a product A also make products B and
C.
• the company C makes only the product A.
• That is, 66% of companies which make a
product “A” , 33% of them also make
products B and C.
Dawak'99
copy-right@sanjaymadria
Consider the following web tuples in a web table.
www.eleccompany.com
X1
company A
www.eleccompany.com
X1
company A
www.eleccompany.com
X1
company A
www.eleccompany.com
X1
company B
www.eleccompany.com
X1
Dawak'99
company C
www.elecproduct.org/productA
product A
Z1
Product A
www.elecproduct.org/productB
product B
Z2
Product B
www.elecproduct.org/productC
product C
Z3
Product C
www.elecproduct.org/productB
product B
Z2
Product B
www.elecproduct.org/productA
product A
Z2
Product A
copy-right@sanjaymadria
Web Content Mining
 what does it mean to mine content from the web?
 Is extracting information from a very small subset of
all HTML web pages is also an instance of web data
mining?
 mining a subset of web pages stored in one or more
web tables is more feasible option.
 Similarity and difference between web content
mining in web warehouse context and conventional
data mining.
Dawak'99
copy-right@sanjaymadria
 Selection of type of data in the WWW to do web
content mining.
 Cleaning of selected data to mine effectively.
 Types of knowledge that can be discovered in a
web warehouse context.
 Discovery of types of information hidden in a web
warehouse which are useful for decision making.
 specify, measure and justify the interestingness of
the discovered knowledge
 knowledge to be discovered are as follows:
generalized
relation,
characteristic
rule,
discriminate rule, classification rule, association
rule, and deviation rule.
Dawak'99
copy-right@sanjaymadria
 Do the data mining techniques applicable to web
mining and if yes, how? For example, we are
interested in generating the following types of
rules: 40% of web tuples (i..e, web pages) in
response to a “travel information query from
Hong Kong to Macau” suggest that popular means
of traveling is by ferry.
 To derive some additional knowledge in a web
warehouse for web content mining.
 mining previously unknown knowledge in a web
warehouse.
 Presentation of discovered knowledge to the users
to expedite complex decision making.
Dawak'99
copy-right@sanjaymadria
Web Usage Mining
• discovery of user access patterns from web servers;
user profile, access pattern for pages, etc. used for
efficient and effective web site management and the
user behavior.
 In WHOWEDA, the user initiates a coupling
framework to collect related information.
 For example, coupling a query graph “to find the
hotel information” with the query graph “to find the
places of interest”.
 From this query graph, we can generate some user
access pattern of coupling framework like “50% of
users who query “hotel” also couple their query with
“places of interest”.
Dawak'99
copy-right@sanjaymadria
 find coupled concepts from the coupling
framework.
 helps in organizing web sites.
 For example, web documents that provide
information on “hotels” should also have
hyperlinks to web pages providing information
on “places of interest”.
Dawak'99
copy-right@sanjaymadria
Warehouse Concept Mart
• Knowledge discovery in web data becomes more
and more complex due to the large number of data
on WWW.
 build the concept hierarchies involving web data
to use them in knowledge discovery.
 collection of concept hierarchies a Warehouse
Concept Mart (WCM).
 concept mart is build by extracting and
generalizing terms from web documents to
represent classification knowledge of a given class
hierarchy.
Dawak'99
copy-right@sanjaymadria
 For unclassified words, they can be clustered based
on their common properties. Once the clusters are
decided, the keywords can be labeled with their
corresponding clusters, and common features of the
terms are summarized to form the concept
description.
 associate a weight at each level of concept marts to
evaluate the importance of a term with respect to the
concept level in the concept hierarchy.
Dawak'99
copy-right@sanjaymadria
Web Concept Mart Applications
• Intelligent answering of web queries
 supply the threshold for a given key word in the
warehouse concept mart and the words with the
threshold more than the given value can be taken
into consideration when answering the query.
 use different levels of concepts in the warehouse
concept mart or can provide approximate answers.
 provide the user some knowledge in framing the
global coupling query graph.
• Example - DBMS and Oracle
Dawak'99
copy-right@sanjaymadria
•Web mining and Concept Mart
• mine the association between words appearing in
the concept mart at various levels and in the web
tuples returned as the result of a query.
 Mining knowledge at multiple levels may help
WWW users to find some interesting rules that are
difficult to be discovered otherwise.
 A knowledge discovery process may climb up and
step down to different concepts in the warehouse
concept mart’s level with user’s interactions and
instructions including different threshold values.
 capture the flow of web sites of particular domain;
helpful in location information
Dawak'99
copy-right@sanjaymadria
Conclusions
 web mining issues in context of the web
warehousing project called WHOWEDA
(Warehouse of Web Data).
 discussed web mining issues with respect to
web structure, web content and web usage.
 Our focus is to design tools and techniques
for web mining to generate some useful
knowledge from the WWW data.
 We are working on formal algorithms to
generate association rules and classification
rules.
Dawak'99
copy-right@sanjaymadria