Mining Indirect Associations in Web Data

Download Report

Transcript Mining Indirect Associations in Web Data

Discovery of Indirect Association from Web Usage Data
Vipin Kumar
Army High Performance Computing Research Center
Department of Computer Science & Engineering
University of Minnesota
[email protected]
Homepage: http://www.cs.umn.edu/~kumar
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›
Overview
Application/
Web Server
Web
User
Internet
Knowledge
© Vipin Kumar, PAKDD (May 2002)
Web Usage
Mining
University of Minnesota / Army High Performance Research Center
Web
usage
data
‹#›
Overview…

Web Usage Mining:
– the automatic extraction of non-trivial patterns from
Web usage data

Main techniques:
– Clustering:
 to find groups of users who share similar browsing behavior
– Classification:
 to categorize Web users according to their past access
history
– Association:
 to determine what are the set of page views often accessed
together in the same server session
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›
Application of Web Usage Mining
Web Usage
Mining
Personalization
Site H elper
Letizia
Web Watcher
M obasher
Analog
Krishnapuram
System
Improvement
R exford
Schecter
Ag g arwal
Site
M odification
Adaptive Sites
WebSIFT
WU M
SpeedTracer
WebLog M iner
Shahabi
Business
Intellig ence
SurfAid
Buchner
Tuzhilin
U sag e
C haracterization
Pitkow
Arlitt
M anley
Almeida
Source: J. Srivastava, R. Cooley, M. Deshpande, PN Tan, Web Usage Mining: Discovery
and Applications of Usage Patterns from Web Data, SIGKDD Explorations (2000)
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›
Taxonomy of Web Usage Mining
Pro ject
Ap p licatio n
Data
Fo cu s
Serv er
W eb SIFT (C TS9 9 )
Gen eral
x
Sp eed Tracer (W YB 9 8 ,C PY9 6 ) Gen eral
x
W UM (SF9 8 )
Gen eral
x
Sh ah ab i (SZAS9 7 ,ZASS9 7 )
Gen eral
Site Help er (NW 9 7 )
Perso n alizatio n
x
Letizia (Lie9 5 )
Perso n alizatio n
W eb W atch er (JFM9 7 )
Perso n alizatio n
Krish n ap u ram(NKJ9 9 )
Perso n alizatio n
x
An alo g (YJGD9 6 )
Perso n alizatio n
x
Mo b ash er (MC S9 9 )
Perso n alizatio n
x
Tu zh ilin (PT9 8 )
B u sin ess
x
Su rfAid
B u sin ess
x
B u ch n er(B M9 8 )
B u sin ess
x
W eb Tren d s,Hitlist,Accru e,etc. B u sin ess
x
W eb Lo g Min er (ZXH9 8 )
B u sin ess
x
Pag eGath er,SC ML (PE9 8 ,PE9 9 )Site Mo d ificatio n
x
Man ley (Man 9 7 )
C h aracterizatio n
x
Arlitt(AW 9 6 )
C h aracterizatio n
x
Pitk o w(PIT9 7 ,PIT9 8 )
C h aracterizatio n
x
Almeid a(AB C 9 6 )
C h aracterizatio n
x
R ex fo rd (C KR 9 8 )
Sy stem Imp ro v e.
x
Sch ech ter(SKS9 8 )
Sy stem Imp ro v e.
x
Ag g arwal(AY9 7 )
Sy stem Imp ro v e.
So u rce
Pro x y C lien t
x
x
x
x
x
x
Data
Ty p e
User
Site
Stru ctu re C o n ten t Usag e Pro file Sin g le Mu lti Sin g le Mu lti
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Source: J. Srivastava, R. Cooley, M. Deshpande, PN Tan, Web Usage Mining: Discovery
and Applications of Usage Patterns from Web Data, SIGKDD Explorations (2000)
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›
Mining Association Patterns in Web Data
Input
Output
/home  /product  /product/electronics
Frequent
Itemsets
/home  /shipping  /shipping/status
Association
Rules
/home  /product  /product/offer
 /product/electronics/tv
/home  /product  /product/electronics
 /product/electronics/tv
Click-streams
© Vipin Kumar, PAKDD (May 2002)
Association
Mining
University of Minnesota / Army High Performance Research Center
Sequential
Patterns
Association
Patterns
‹#›
Mining Association Patterns in Web Data…

Provide the following information:
– What are the set of pages frequently accessed
together by Web users? (frequent itemsets)
– What page will be fetched next? (association rules)
– What are the paths frequently traversed by Web
users? (sequential patterns)

Web association patterns are useful:
–
–
–
–
To improve Web site design
To develop prefetching and Web caching policies.
To recommend related pages
To collect business intelligence about behavior of
Web users
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›
Association Mining is Support-based

Frequent itemset:
– combination of pages that have support greater than
a user-specified minimum threshold

Association Rule:
s,c
Support, s 
A B


Number of sessions that contain A and B
Total number of sessions
Confidence , c 
Number of sessions that contain A and B
Number of sessions that contain A
Higher support implies greater statistical significance
Support-based pruning constraints the exponential
complexity and makes the association rule computation
tractable
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›
Can Infrequent Patterns be Interesting?

If support between A and B is too small, then there
may be a negative correlation between A and B
– P(A,B) < P(A) P(B)
– If (A  B) has a high support, then support(A,B) will tend
to be low because:
 P(A B) = P(B) – P(A,B)
– Example: Coffee  Tea
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›
Approach 1: Using Negative Items


Session Id
A
A
B
B
C
C
D
D
1
1
0
0
1
1
0
0
1
2
1
0
0
1
0
1
0
1
3
1
0
0
1
1
0
0
1
4
1
0
1
0
0
1
0
1
5
1
0
0
1
0
1
1
0
Computationally expensive
Tends to produce too many negative associations
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›
Approach 2: Negative Itemsets

Savasere et al [1998]:
– A negative itemset is a set of items whose actual
support is significantly lower than its expected support
– Expected support can be computed using item
taxonomy
Suppose C and G are frequent:
A
F
C
B
D
G
E
© Vipin Kumar, PAKDD (May 2002)
J
H
K
sup( CG)  sup( E )  sup( J )
sup( C )  sup( G )
sup( CG)  sup( J )
Exp(sup( CJ )) 
sup( G )
sup( CG)  sup( H )
Exp(sup( CH )) 
sup( G )
Exp(sup( EJ )) 
University of Minnesota / Army High Performance Research Center
‹#›
Indirect Association
a
IF:
THEN:
M
b
a and M are frequent and highly dependent on each other
b and M are frequent and highly dependent on each other
a and b are expected to be frequent
If a and b are infrequent, it has an interesting negative association
 a and b are indirectly associated via mediator M
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›
Non-Sequential Indirect Association: Formulation
M
a

b
A pair of Web pages, a and b, are indirectly
associated via mediator set M, if :
– Sup({a,b}) < ts
– Sup({a}  M)  tf, Sup({b}  M)  tf
– Dep(a,M)  td, Dep(b,M)  td

where Dep(x,M) can be any “reasonable”
objective measures
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›
Finding Interesting Negative Associations
For all pairs of items:
With
Mediator
Frequent
FM
No
Mediator
FN
Minimum itempair support
Infrequent
IM
IN
If
IM/FM = IN/FN
mediator thresholds
© Vipin Kumar, PAKDD (May 2002)
then Indirect Association
is not surprising
University of Minnesota / Army High Performance Research Center
‹#›
Finding Interesting Negative Associations
With
Mediator
No
Mediator
FM
FN
Frequent
• IM/FM is small
• IM/IN is small
 Indirect Association
Infrequent
IN
is interesting
IM
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›
Finding Interesting Negative Association
1
10
Indirect Association is
interesting when
minimum itempair
support threshold is
small.
IM/FM
IM/IN
0
10
But, if threshold is too
low, very few indirect
associations are
obtained.
-1
10
-2
10
1
3
5
7
9
11 13 15 17 19 21
Minimum itempair support (ts )
© Vipin Kumar, PAKDD (May 2002)
23
25
27
29
University of Minnesota / Army High Performance Research Center
‹#›
Application: Market-Basket Analysis
Substitute items: up-sell
– Pavilion PC  17” Monitor  Pavilion multimedia PC
Competing items: competitive analysis
– Coke  Ruffles  Pepsi
Complementary items: cross-sell
– Tekken 3  Playstation Memory Card  Tomb Raider 2
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›
Other Applications: Information Retrieval
Identify synonyms and antonyms
Identify the different contexts of a queried word
– Useful to group together query results
Data
Mining
Web
Trade
Worker
© Vipin Kumar, PAKDD (May 2002)
Union
Gold
Soviet
University of Minnesota / Army High Performance Research Center
‹#›
Other Applications: Stock Market
Dow Jones (DJI)
% Change
5
0
Let v(t): value of time series on day t
% change (t) = (v(t) - v(t-1))/v(t-1)*100
-5
-10
11/8/99
Microsoft Corp (MSFT)
3/28/02
Event definition:
Up: % change > 2%
Down: % change < -2%
% Change
20
0
-20
11/8/99
RedHat Inc (RHAT)
3/28/02
% Change
20
Support counts:
{DJI-Down,MSFT-Down} : 26
{RHAT-Up,MSFT-Down} : 24
{DJI-Down,RHAT-Up}
:4
0
-20
11/8/99
3/28/02
Indirect Association can partition
events that are associated with the
movement of a stock price.
MSFTDown
DJI-Down
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
RHAT-Up
‹#›
Sequential Indirect Association: Formulation

A pair of Web pages, a and b, forms a
sequential indirect association via a mediator
sequence, w, if the following conditions are
satisfied:
– Support({a,b}) < ts
– Support(s1)  tf, Support(s2)  tf
– Dependence(a,w)  td,
Dependence(b,w)  td
– w does not contain a and b

(s1=aw or wa, s1=bw or wb)
discover groups of users who have different
interests but share a common traversal path
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›
Types of Sequential Indirect Association
Different entry points
a
Different entry and user interests
w
b
w
a
a
b
w
Different user interests
Type C (convergence)
© Vipin Kumar, PAKDD (May 2002)
Type D (divergence)
b
Type T (transitive)
University of Minnesota / Army High Performance Research Center
‹#›
Clustering vs Indirect Association

Clustering is another way to discover different groups of
users
–
A. Banerjee and J.Ghosh, Clickstream Clustering using Weighted Longest Common
Subsequences, Workshop on Web Mining (2001)
–
TW Yan, M Jacobsen, H Garcia-Molina and U Dayal, From User Access Patterns to Dynamic
Hypertext Linking, Proc of the 5th International World Wide Web Conference (1996).
–
YJ Fu, K Sandhu, MY Shih, A Generalization-Based Approach to Clustering of Web Usage
Sessions, Web Usage Analysis and User Profiling, LNAI (2001)


Clustering cannot find distinct groups of users who share
a similar traversal path since the support of the mediator is
large (mediator often contains navigation pages)
It is more likely that several indirect associations are
contained within a single cluster, rather than each indirect
association connects between two separate clusters
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›
Indirect Association for Web Data
Site Structure
A
C
Large 3-itemset
(minsup = 40%)
Pattern
{A,B,D}
{A,B,E}
{B,C,D}
B
Support
2
2
2
{A,B}
D
E
D
E
Web sessions
Session Id
1
2
3
4
5
Non-sequential
Indirect Association
Sequence
<A,B,C,B,D>
<A,B,E>
<B,C>
<A,B,E,B,D>
<B,D,B,C>
© Vipin Kumar, PAKDD (May 2002)
Sequential Indirect
Association
Large 3-sequence
(minsup = 40%)
Pattern
A->B->D
A->B->E
A
Support
2
2
B
D
University of Minnesota / Army High Performance Research Center
E
‹#›
Impact of Site Structure on Associations

Navigation pages (e.g. home pages and hub pages)
tend to have higher support than content pages
5
5
10
10
4
4
10
10
3
3
10
Support
Support
10
2
10
1
1
10
10
0
10
2
10
0
1
2
3
4
5
6
7
8
Number of hops away from the home page
9
10
10
1
2
3
4
5
6
7
Number of outgoing links
8
9
10
Source: U of Minnesota Computer Science department Web logs (Jan 1-31, 2001)


If threshold is too high, most of the patterns contain
only navigation pages
If threshold is too low, too many patterns!
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›
How Indirect Association helps

Indirect association groups together patterns that
have similar substructures
– The common substructures (mediators) often contain
the navigation pages
Mediator
A
Navigation pages
B
D
© Vipin Kumar, PAKDD (May 2002)
E
University of Minnesota / Army High Performance Research Center
‹#›
Grouping Indirect Associations

Indirect associations can also be grouped
together into more compact structures if they
share a common mediator
A,B
A,B
D
E
D
F
E
A,B
Check degree of association
D
F
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›
Viewer for Non-Sequential Indirect Association
Indirect
Pairs
(dashed
line)
List of
Mediators
Frequent
Pairs (solid
line)
Currently
Selected
Mediator
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›
Experimental Setup

Data sources:
– UMN: University of Minnesota Web log

contains 91,443 page views and 34,526 sessions
– EC: E-commerce Web log


Contains 6664 page views and 143,604 sessions
Steps:
1. Preprocessing
– identify sessions, convert sessions into “transactions”
2. Extract frequent itemsets or sequences:
– apply Apriori algorithm to find frequent itemsets
– Apply GSP algorithm to find frequent sequences
3. Apply indirect association algorithm
4. Merge indirect associations that have common
mediator
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›
Experimental Results (UMN)
Prospective
graduate
students
Contact
information
Students taking CS
as minor subject
CS Graduate
student association
Students planning to
take PhD exam
Experimental Results (UMN)
Regular Seminar/
Colloquium
attendees
Contact information
Students taking CS as minor subject
Experimental Results (EC)
Experimental Results (EC)
Experimental Results (EC)
Experimental Results (EC)
Experimental Results (EC)
Experimental Results (EC)
#
a
Mediator, w
b
1
Stereo
HomeElectronicsAuto
Radar Detector
2
Hunting
HomeSporting Goods
Aerobics
3
Women’s gown set
HomeApparel
Men’s boots
4
Camcorders
HomeElectronicsVideo
13” TV and VCR combo
5
Scanner
HomeElectronicsComputer
Multimedia Computer
6
Shower curtains
DomesticsBath Shop
Towel set
7
Bedroom furniture frame
Home & Accessories  Furniture
Oak nightstand
8
Speakers
ElectronicsStereo
CD boombox
9
Men’s
HomeJewelry
Ladies
10
Cordless phone
HomeTelephones
Answering device
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›
Conclusion

Indirect Association provides an alternative
approach to capture interesting infrequent
patterns
– For Web data, indirect association represents the
distinct interests of Web users who share similar
traversal path
– Such patterns cannot be easily found using standard
association and clustering techniques

Indirect Association can be used to group
together patterns into more compact structures
– Navigation pages form the mediators
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›
References
1.
PN Tan, V Kumar, J Srivastava, Indirect Association : Mining
Higher Order Dependencies in Data, In Proc of the 4th European
Conference on Principles and Practice of Knowledge Discovery in
Databases, Lyon, France, Sept 13-16, 632-637 (2000)
2.
PN Tan, V Kumar, H Kuno, Using SAS for Mining Indirect
Associations in Data, In Proc of the Western Users of SAS
Software Conference (2001)
3.
PN Tan, V Kumar, Mining Indirect Associations in Web Data, In
Proc of WebKDD 2001: Mining Log Data Across All Customer
TouchPoints, August (2001)
4.
KDD Web Mining workshops: WebKDD 1999, WebKDD 2000,
WebKDD 2001
5.
SIAM Workshop on Web Mining 2001 and 2002
© Vipin Kumar, PAKDD (May 2002)
University of Minnesota / Army High Performance Research Center
‹#›