whoweda-talk

Transcript whoweda-talk

WHOWEDA : Warehouse of Web Data
Sanjay Kumar Madria
Department of Computer Science
Purdue University, West Lafayette, IN 47907
[email protected]
1
2
WWW
• collection of multimedia
documents in the form of web
pages connected via hyperlinks.
3
Characteristics of WWW
• WWW is a set of directed graphs
• data in the WWW has a heterogeneous
nature
• unstructured versus structured information
• no central authority to manage information
• Dynamic verses static information
• Web information discoveries - search
engines
4
As WWW grows, more chaotic it
becomes
• Web is fast growing, distributed, nonadministered global information resource
• WWW allows access to text, image, video,
sound and graphic data
• more business organizations creating web
servers
• more chaotic environment to locate
information of interest
• lost in hyperspace syndrome
5
Does it affect the corporate
world?
• Lack of credibility of data
– Different sites with different data
– Same site different data
• Historical information is not available
– Previous versions of web data
– How does web data change with time
– Summarization over time
• Data to information
• Reduction in productivity
– Analysis is manual
6
How users find web sites
•
•
•
•
•
•
•
•
Indexes and search engines
UseNet newsgroups
Cool lists
New lists
Listservers
Print ads
Word-of-mouth and e-mail
Linked web advertisement
75
44
27
24
23
21
17
4
7
Limitations of Search Engines
• Do not exploit hyperlinks
• search is limited to string matching
• Queries are evaluated on archived data
rather than up-to-date data; no indexing on
current data
• low accuracy
• replicated results
• no further manipulation possible
8
Limitations of Search Engines
•
•
•
•
ERROR 404!
No efficient document management
Query results cannot be further manipulated
No efficient means for knowledge discovery
9
Current Research Projects
• Web Query System
– W3QS, WebSQL, AKIRA, NetQL, RAW,
WebLog
• Semistructured Data
– LOREL, UnQL, WebOQL
• Website Management System
– STRUDEL
• Web Warehouse
- WHOWEDA
10
WHOWEDA -Key Objectives
• Design a suitable data model to represent
web information
• development of web algebra and query
language
• Maintenance of Web data
• Development of knowledge discovery and
web mining tools
• Web warehouse
11
WHOWEDA - What?
• WareHouse Of Web Data
–
–
–
–
–
–
–
Subject - oriented
Integrated
Temporal
Granularity - Lower, higher
Some summary
Not updatable
Alternative information sources
12
What is a Web Warehouse?
• Subject-oriented, integrated, time-variant,
non-volatile repository of web data for
direct querying and analysis for some sort
of decision making
• A process whereby organizations or
individuals extract value from their Web
informational assets through the use of
special stores called web warehouses
13
WHOWEDA!
www.cais.ntu.edu.sg:8000/~whoweda
• A WareHouse Of WEb DAta
• Web Information Coupling Model (WICM)
– Web Objects
– Web Schema
• Web Information Coupling Algebra
• Web Information Maintenance
• Web Mining and Knowledge discovery
14
WWW
User
Warehouse
Concept
Mart
Web Information
Mining System
Web Querying
& Analysis Component
Web
Information
Coupling
System
Web
Mart
Web
Mart
Web Information
Maintenance System
Web
Mart
Web
Warehouse
Web
Mart
User
WWW
Web Query & Display
Warehouse
Concept
Mart
Global Ranking
Data Visualization
Global Web
Manipulation
Global Web
Coupling
Pre processing
Schema Tightness
Web
Warehouse
Web Select
Web Project
Local Web Coupling
Local Ranking
Web Join
Local Web
Manipulation
Data Visualization
Web Union
Web Intersection
Schema Tightness
Schema Search
Schema Match
Web Objects
•
•
•
•
•
•
Node - url, title, format, size, date, text
Link - source-url, target-url, label, link-type
Web tuple
Web table
Web schema
Web database
17
Web Schema
•
•
•
•
•
Metadata in the warehouse
Structural ‘summary’ of web table
Information Coupling using a Query graph
Query graph ->Web schema
directed graph represented by Ordered 4tuple:
–
–
–
–
Set of node variables
Set of link variables
Connectivities
Predicates
18
19
Brief Organization of Information Space's Web Site
Information Headline
Square's
article 1
homepage
Headline
article n
News@
TCS
News
specials
Airport
info
(List of
video files)
List of
links to
local news
List of
links to
world
news
Local news
1
Local news
k
World news
1
World news
t
20
e
x
target_url
CONTAINS
"article”
f
y
url contains
“headlines”
g
z
target_URL
CONTAINS
"newshub/specia
ls"
label
CONTAINS
"Local News"
h
url
CONTAINS
"local"
w
label
CONTAINS
"World News"
url
CONTAINS
"world" 21
Information
Square's
homepage
Headline article
List of links to
1
local news
Local news 1
News
specials
World news 1
List of links to
world news
22
Schema- example
• Node variables:
Xn =
• Link variable:
Xl =
• Connectivities:
C= {
x<fg->z and x<fh->w }
{ x, y, z, w }
{ e, f, g }
x<e>y and
– The symbol represents an anonymous node
variable, a node variable not restricted by any
predicate.
23
• Predicates
• P={x.url=”http://www.mediacity.com.sg/isquare”,
• y.url CONTAINS “headlines”
• e.target_url CONTAINS "article",
• f.target.url CONTAINS "newshub/specials",
• g.label CONTAINS "Local News",
• z.url CONTAINS "local",
• h.label CONTAINS "World News",
• w.url CONTAINS "world" }
24
Query Graph - Example 1
• Query graph - same as schema except that it
has one more parameter to control the
results returned.
• Informally, it is directed connected graph
consists of nodes, links and keywords
imposed on them.
• Produce a list of diseases with their
symptoms, evaluation procedures and
treatment starting from the web site at
http://www.panacea.org/
25
• Web table Diseases
Treatment list
q
g Treatment
http://www.panacea.org/
Issues
y
x
Symptoms list
f
Symptoms
z
List of Diseases
e
Evaluation
Evaluation
w
p
q1
g1
Treatment list
Treatment
http://www.panacea.org/
Issues
x0
y1
AIDS
List of Diseases
e1
f1
Symptoms
z1
Symptoms
list
Evaluation
Evaluation
w1
Elisa Test
p2
Example 2
• Produce a list of drugs, and their uses and
side effects starting from the web site at
http://www.panacea.org/
• Web table Drugs
28
http://www.panacea.org/
a
Drug
list
Side
effects
Issues
c
b
r
Side effects
List of Diseases
Use
s
k
Uses
d
http://www.panacea.org/
a0
List of
Diseases
AIDS
Drug
list
b1
Side effects
of Indavir
Issues
Indavir
c1
r1
Side effects
Use
s1
k1
Uses of
Indavir
d1
Query Language
• Starting from the CS deptt home page at
NTU, find all documents that are linked
through paths of length less than two
containing only local links, and have in
their text “database”.
31
• COUPLE WEBTABLE W FROM WWW
SUCH THAT NODE I, j IN WWW and LINK
e,f,g IN WWW AND I<e|f,g>j WHERE
I.url EQUALS “http://www.ntu.edu.sg”
AND j.text CONTAINS “database” AND
f.link-type EQUALS local AND g.link-type
EQUALS local;
32
Web Algebra
• Formal foundation of data representation
and manipulation in a web warehouse
• Web operators:
–
–
–
–
Information access operator
Information manipulation operators
Web schema operators
Data visualization operators
33
Information access operator
• Global Web Coupling
34
Information Manipulation
- Web select
–
–
–
–
–
–
–
Web project
Local web coupling
Web join
Web cartesian product
Web union
Web intersect
Local Web coupling
35
Web Select
• Extracts web tuples from web tables
satisfying certain conditions on node and
link variables and on connectivities
• Input is select Schema
• Output is a web table satisfying the select
schema
36
• select W1 tuples that contain world news
about Indonesia since May 1 1998.
•
sMsW1
where
Ms =
< Xsn, Xsl, Cs, Ps >,
Xsn = { x, w }, Xsl = { },
Cs = { },
Ps = { x.date > "1May1998", w.text
CONTAINS “Indonesia”}
37
• Xn’ = { x, y, z, w },Xl’ =
{ e, f, g }
• C’ = { x<e>y and x<fg->z and x<fh->w }
• P’={x.url=”http://www.mediacity.com.sg/isquare”, x.date > "1May1998",
• e.target_url CONTAINS "article",
f.target.url
CONTAINS
"newshub/specials",
• g.label CONTAINS "Local News",
• z.url CONTAINS "local",
• h.label CONTAINS "World News",
• w.url CONTAINS "world",
• w.text CONTAINS “Indonesia” }
38
Web Information Coupling
System
• A database system to couple related web
information
• Global web Coupling and Local Web
Coupling
39
Global Coupling - Information
Access
• To integrate data from the Web
• To create historical data
• To couple related information from the
WWW satisfying a query graph
• Operator to create web tables
• From web with no schema to web table with
web schema
40
Why local web coupling?
• Directly querying the WWW to gather these
information is an expensive and repetitive
affair
• Web documents containing similar
information can reside in different web
tables in a web warehouse
• A mechanism to gather these similar
information by additional manipulation of
the materialized web tables
41
Local Web Couple operator
• Two web tuples wi and w j can be
coupled if there exist atleast one pair of
nodes from wi and w j which contains
similar information.
42
Local Web Couple operator
• The web couple operator is basically a
web cartesian product followed by web
select:
W   Wi  W j
W  s W 
• We denote web couple by the symbol:
W  Wi  Wj
43
Web Coupling
44
•
•
•
•
M2 = < Xn”, Xl”, C”,P” > for W2
Xn” = { s, t, u}, Xl” = { k, l, m, n },
C” =
{ s<kl>t and s<mn>u },
P”{s.url=
“http://www.asia1.com.sg/straitstimes/”,
•
k.label = “REGION”,
• l.target_url=
“http://www.asia1.com.sg/straitstimes/page
s/sea*.html”, m.label = “WORLD”,
• n.target_url=“http://www.asia1.com.sg/stra
itstimes/pages/wrld*.html”}
45
• W1 qq W2 where
• q=
(x.date=s.date)
&
CONTAINS
“Indonesia”)
&
CONTAINS “Indonesia”)
(w.text
(t.text
46
• Xn* = { x, y, z, w, s, t, u }, Xl* = { e, f, g, k, l, m,
n }, C*= { x<e>y and x<fg->z and x<fh->w and
s<kl>t and s<mn>u }
• P* = { x.url=”http://www.mediacity.com.sg/isquare”, e.target_url CONTAINS "article",
• f.target.url CONTAINS "newshub/specials",
• g.label CONTAINS "Local News",
• z.url CONTAINS "local",
• h.label CONTAINS "World News",
• w.url CONTAINS "world",
• s.url = “http://www.asia1.com.sg/straitstimes/”,
47
• k.label = “REGION”, l.target_url =
“http://www.asia1.com.sg/straitstimes/page
s/sea*.html”,
• m.label = “WORLD”,
• n.target_url =
“http://www.asia1.com.sg/straitstimes/page
s/wrld*.html”,
• x.date = s.date,
• w.text CONTAINS “Indonesia”,
• t.text CONTAINS “Indonesia"}
48
Local Web Coupling
• Initiated explicitly by the user
• User provides the pair of node variables and
the keyword set based on which coupling is
to be performed
• Coupling nodes in each pair of web tuples
in the input web tables must satisfy one of
the coupling conditions
49
Construction of coupled table
• First perform a web cartesian product on the
two web tables
• For each web tuple in the resultant web
table
– the specified instances of node variables are
inspected to determine whether the web
tuple satisfy coupling compatibility
condition(s)
50
Construction of coupled table
– If a pair of nodes satisfy none of the
conditions, the corresponding web tuple is
rejected
– Otherwise, the web tuple is stored in a
separate web table
51
Types of web coupling
• System driven web coupling: In this
case the system to decide which are the
node variables to be coupled (coupling
nodes). If atleast a pair of coupling
nodes cannot be identified then the web
tables cannot be coupled.
52
Types of web coupling
• User driven web coupling: In this case
the user decides which are the node
variables to be coupled (coupling
nodes).
• Coupling is performed only on those
user specified node variable(s).
53
Types of web coupling
• Attribute driven web coupling: In this
case the user specifies the coupling
attributes.
• Coupling is performed only on those
user specified coupling attribute(s).
54
Attribute driven web
coupling
COUPLE TABLE3
FROM TABLE1 AND TABLE 2
ON ATTRIBUTE “TEXT”
AT SCHEMA/TUPLE(optional)
55
Types of web coupling
• Value driven web coupling: In this
case the user specifies the values of the
attributes of the nodes on which
coupling should be performed.
• Coupling is performed only on those
user specified attribute values.
56
Value driven web coupling
COUPLE TABLE3
FROM TABLE1 AND TABLE 2
ON VALUE “Software Agents”
AT SCHEMA/TUPLE(optional)
57
Schema level web coupling
• We inspect the schemas to decide
whether the two web tables can be
coupled.
• If coupling conditions cannot be
identified then the two web tables
cannot be coupled.
• We do not inspect the web tuples in the
web table.
• Number of web tuples coupled will be
58
n*m.
Tuple level web coupling
• We inspect the web tuples of the two
input web tables to identify nodes with
similar information.
• The number of web tuples in the
coupled web table <=n*m
59
Why two levels?
• A schema does not capture all the
information of the web documents in a
web table; not always possible to
identify coupling condition by inspecting
the schemas.
• possible to find existence of coupling
nodes which are not defined in the
schemas.
60
Why two levels?
• Tuple level coupling gives us a mean to
correlate web documents containing
similar information from the web tables
(that cannot be identified from their
schemas) at the expense of additional
processing.
61
Join Processing in Web
Databases
62
Web Join
• Concatenate tuples based on identical nodes
or documents
• Input are two web tables and their schemas
• Output is a joined table
• Types
– Pi-web join, theta-web join, outer joins, web
composition, semi web join
63
Web Join
• Used for combining related data from
various web tables
• Mechanism to detect changes
• Mechanism to find alternative web
document in case of “Document Not
Found” error
64
Web Join Operator
• Information manipulation operator
• Manipulate information residing in a web
database to derive additional information
• Harness useful, composite information from
two web tables
• Capitalize on the reuse of retrieved data
from the WWW in order to reduce
execution time of queries
65
Joinable Nodes
• Node variables participating in the web join
process
• Expressed as a pair
• Each node in the pair should have identical
URLs
66
Web Join
• Combine two web tables by concatenating a
web tuple of one web table with a web tuple
of other web table whenever there exist
joinable nodes
• Joinable nodes are identified from the
schemas of the two web tables
• URLs of the joinable nodes are identical
67
Treatment list
q
g Treatment
http://www.panacea.org/
List of
Diseases
f
Symptoms
y
x
Symptoms list
Issues
e
z
Evaluation
Evaluation
Drug
list
w
p
Issues
b
c
r
Side effects
s
Use
k
Uses
d
Side
effects
AIDS treatment
q1
g1
Symptoms
of AIDS
http://www.panacea.org/
y1
x0
f1
z1
AIDS
e1
AIDS
w1
b1
c1
Elisa Test
r1
Indavir
s1
k1
Uses of
Indavir
Evaluation
p2
d1
Side effects
of Indavir
Join Existence
• Given two web tables, we determine if these
two web tables are joinable
• Inspect the schemas of the web tables
• Satisfy joinability conditions based on:
–
–
–
–
node predicates
link predicates
node and link predicates
locus of a node relative to a joinable node
70
Join Construction
• To construct a joined schema, we construct:
–
–
–
–
node set
link set
connectivity set
predicate set
• Construction of joined table
– Concatenating the web tuples of the two input
tables over the joinable nodes
71
Web Bags
•
•
•
•
Existence of identical web tuples.
Created due to web project operation.
Structure based mining
Used for discovering
– Visible nodes
– Luminous nodes
– Luminous paths
72
Definitions
• Visibility of a web document or node D in a
web table W measures the number of
different web documents in W that have
links to D
• Luminosity - Reverse of visibility, the
number of other distinct documents that are
linked from D
• Luminous paths - a set of inter-linked
nodes which occurs number of times in a
web table
73
Steps to find visible nodes
• Input: Web table W, node variable x,
visibility threshold v
• Output: Set of visible nodes
• Create a web table from W where each web
tuple contains distinct instances of node x
and the preceeding node which is linked to
x
• Eliminate the nodes linked to x in each
tuple of the web table using web project
74
Steps to find visible nodes
• Input: Web table W, node variable x,
visibility threshold v
• Output: Set of visible nodes
• Create a web table from W where each web
tuple contains distinct instances of node x
and the preceeding node which is linked to
x
• Eliminate the nodes linked to x in each
tuple of the web table using web project
75
Steps to find visible nodes
• Check if the collection of web tuples of
node x thus created is a web bag by
comparing their URLs
• Create multiplets for each collection of
identical nodes
• For each multiplet calculate the node
visibility
• Determine the multiplets with node
visibility greater than the threshold
• Create the visible node set
76
Steps to find luminous nodes
• Input: Web table W, node variable x,
luminosity threshold l
• Output: Set of luminous nodes
• Steps are similar to that of visible node
discovery
• We consider the nodes linked from x in
place of nodes linked to x
77
Steps to find luminous nodes
• Input: Web table W, node variable x,
luminosity threshold l
• Output: Set of luminous nodes
• Steps are similar to that of visible node
discovery
• We consider the nodes linked from x in
place of nodes linked to x
78
Steps to find luminous paths
• Create the collection of multiplets
• Compute path luminosity for each
multiplet
• If the path luminosity value of a multiplet
is greater than or equal to threshold then a
path in the multiplet is a luminous path
• Otherwise, we create a collection of linear
web tuples from the above collection of web
79
tuples
Steps to find luminous paths
• This is to identify if there exist a subset of
inter-linked nodes between x and y that are
luminous paths
• We repeat the procedure to compute path
luminosity for these set of inter-linked
nodes
80
Web Schema
Cancer
http://www.panacea.org/
e
x
Diseases
y
f
Cancer
z
http://www.panacea.org/
Diseases
x0
y0
e0
x0
x0
Diseases
e0
Diseases
e0
y0
y0
Cancer
f0
Cancer
f0
Cancer
f0
z1
http://www.cancer.org/desc.html
Cancer
z1
http://www.cancer.org/desc.html
Cancer
z2
Cancer
Cancer
x0
x0
Diseases
e0
Diseases
e0
y0
y0
f0
Cancer
f0
Cancer
Web Table
z1
http://www.cancer.org/desc.html
Cancer
z4
Projected schema
Cancer
z
Cancer
z1
http://www.cancer.org/desc.html
Cancer
z1
http://www.cancer.org/desc.html
Cancer
z2
Cancer
z1
http://www.cancer.org/desc.html
Cancer
z4
Web Table after eliminating x and y
Projected schema
Cancer
http://www.panacea.org/
e
x
Diseases
y
z
http://www.panacea.org/
x0
Diseases
Cancer
y0
z1
http://www.panacea.org/
x0
Diseases
Cancer
y0
z1
http://www.panacea.org/
x0
Diseases
http://www.cancer.org/desc.html
http://www.cancer.org/desc.html
Cancer
y0
z2
http://www.disease.com/cancer/skin.htm
http://www.panacea.org/
x0
Diseases
Cancer
y0
Diseases
http://www.cancer.org/desc.html
http://www.jhu.edu/medical/research/cancer.htm
http://www.panacea.org/
x0
z1
y0
Web Bag
z4
Cancer
After removal of identical tuples
http://www.panacea.org/
x0
Diseases
Cancer
y0
z2
http://www.disease.com/cancer/skin.htm
http://www.panacea.org/
x0
Diseases
Cancer
y0
http://www.panacea.org/
x0
Diseases
z1
Cancer
y0
z4
http://www.cancer.org/desc.html
http://www.jhu.edu/medical/research/cancer.htm
Cancer
z1
http://www.cancer.org/desc.html
Cancer
z1
http://www.cancer.org/desc.html
Cancer
z2
http://www.disease.com/cancer/skin.htm
Cancer
z1
http://www.cancer.org/desc.html
Cancer
z4
http://www.jhu.edu/medical/research/cancer.htm
Cancer
z1
http://www.cancer.org/desc.html
Cancer
z2
http://www.disease.com/cancer/skin.htm
Cancer
z1
http://www.cancer.org/desc.html
Cancer
z4
http://www.jhu.edu/medical/research/cancer.htm
Visible Nodes
Cancer
z1
http://www.cancer.org/desc.html
Cancer
z2
http://www.disease.com/cancer/skin.htm
Cancer
z1
http://www.cancer.org/desc.html
Cancer
z4
http://www.jhu.edu/medical/research/cancer.htm
Luminous Paths
More Operators . . .
• Web schema operators:
– Schema tightness operator, Schema match
operator, Schema search operator
• Data visualization operators:
– Ranking operators (Global & Local), Web Nest,
Web Un-nest, Web Coalesce, Web Expand, Web
Pack, Web Unpack, Web Sort
92
Partitioning of web tables
• Partitioning web tables
–
–
–
–
restructured easily
indexed easily
monitored easily
reorganized easily
• By
– time
• schema tree structure
• keywords
93
Warehouse Concept Mart
(WCMart)
•
•
•
•
Subject oriented
Concept generation.
Manually -> Autonomous.
Used for:
– Ranking tuples
– Global web coupling
– Content based mining
94
Mining in Web Warehouse
• Web Structure Mining
• Web Content Mining
• Web usage Mining
95
Web Data Refinement
• Improve web schema - schema tightness
operator
• Partition web tables based on content and
structure
96
Partitioning of web tables
• Partitioning web tables
–
–
–
–
restructured easily
indexed easily
monitored easily
reorganized easily
• By
– time
• schema tree structure
• keywords
97
WWW
Warehouse
Concept
Mart
Lower-level
Granularity
Higher level
Granularity
Web Information
Manipulation
Operators
User
Warehouse
Concept
Mart
Web Information
Mining System
WWW
Web Querying
& Analysis Component
Web
Information
Coupling
System
Web
Warehouse
What type of information can be
summarized?
• Structural
• Content-based
–
–
–
–
time-variant analysis
snapshot analysis
compare one period with another
trend analysis
101
Structural Summarization
• Most volatile documents
– Sites which change frequently
– Rate of change over time
– a pointer to directly access documents which
change rapidly
• Most visible nodes, luminous nodes,
luminous paths
– Change with time
– Decrease or increase - Analyze the reason
102
Content Summarization
• What can be aggregrated in a web page?
– Number of links with identical labels
– Number of keywords
• Changes in content with time
– Comparing the changes
• Open question
• XML will improve the ability of analysis of
web data
103
Summary
• Current status:
– Mechanism for accessing and manipulating
web information in WHOWEDA
– Implementing various web operators and query
language
• Future research
– What types of information can be summarized?
– What types of knowledge can be mined?
– Refine web warehouse architecture
• www.cais.ntu.edu.sg:8000/~whoweda
104

whoweda-talk

Transcript whoweda-talk

Directory