presentation

Download Report

Transcript presentation

Mining Click-stream Data With
Statistical and Rule-based
Methods
Martin Labský, Vladimír Laš, Petr Berka
University of Economics, Prague
The Clickstream Data

3 617 171 page requests containing:
unix time; IP address; session ID; page request; referer

522 410 sessions, out of them only 203 887
with length > 1



100 000 training set
60 000 testing set
43 887 heldout set
Discovery Challenge 2005
2
Clickstream Data Preprocessing
unix time ;IP address
; session ID
; page request; referee
1074589200;193.179.144.2 ;1993441e8a0a4d7a4407ed9554b64ed1;/dp/?id=124
1074589201;194.213.35.234;3995b2c0599f1782e2b40582823b1c94;/dp/?id=182
1074589202;194.138.39.56 ;2fd3213f2edaf82b27562d28a2a747aa;/
1074589233;193.179.144.2 ;1993441e8a0a4d7a4407ed9554b64ed1;/dp/?id=148
1074589245;193.179.144.2 ;1993441e8a0a4d7a4407ed9554b64ed1;/sb/
1074589248;194.138.39.56 ;2fd3213f2edaf82b27562d28a2a747aa;/contacts/
1074589290;193.179.144.2 ;1993441e8a0a4d7a4407ed9554b64ed1;/sb/

;www.google.cz;
;
;www.seznam.cz;
;/dp/?id=124;
;/dp/?id=148;
; /;
;/sb/;
Sequences of page visits in each session (same
sessionID) were constructed from the www log
data


Sequences of page types [start, dp, dp, sb, sb, end]
Sequences of products
[start, 124, 148]
Discovery Challenge 2005
3
Predicting New Page in a
Sequence

Problem


Observing a sequence of pages A1A2…An-1
what will be the next page An?
Methods


Markov n-gram models
Decision rules
Discovery Challenge 2005
4
Markov N-gram Predictor (1/5)

Probability of a sequence A1A2….An
n
P( A1 A2 ... An )   P( Ai | Ai  k 1... Ai 1 )
i 1
each term (interpolated k-gram
distribution) computed as
P( Ai | Ai  k 1... Ai 1 )  0 P0 ( Ai )  1 P1 ( Ai ) 
 2 P2 ( Ai | Ai 1 )    k Pk ( Ai | Ai  k 1  Ai 1 )
Discovery Challenge 2005
5
Markov N-gram Predictor (2/5)
where
Pk ( Ai  c | Ai k 1  a Ai 1  b) 
n ( a... bc)
n ( a... b )
n(xy) is the occurrence of sequence xy in data
and
k

i 0
i
1
Discovery Challenge 2005
6
Markov N-gram Predictor (3/5)

Building model using EM algorithm
1. compute Pi(i=1,…,k) from counts of sequences
observed in the training set DTR
2. assign non-zero initial values to weights i
3. repeat
3.1 compute the probability of the holdout set using
the interpolated distribution
3.2 modify the weights
Discovery Challenge 2005
7
Markov N-gram Predictor (4/5)

Results for page types
Discovery Challenge 2005
8
Markov N-gram Predictor (5/5)

Results for product types
Discovery Challenge 2005
9
Rule Induction Algorithms (1/5)

“Classical “ Decision rules in the form
Ant => Class (p)
where Ant is a conjunction of values of input attributes,
p = n(Ant  Class)/n(Ant)

Decision rules for clickstreams are in the form
Ant => page (p)
where Ant is a sequence of pages,
p = n(Ant//page)/n(Ant)
Discovery Challenge 2005
10
Rule induction algorithms (2/5)

Set-covering algorithm
1. find a rule that covers
some positive examples
and no negative
example of a given class
(concept)
2. remove covered examples
from the training set
DTR
3. if DTR contains some
positive examples (not
covered so far) goto 1,
else end

Compositional algorithm
1. add empty rule to the rule set KB
2. repeat
2.1 find by rule specialization a rule Ant
=> Class that fulfils the user given
criteria on lengths and validity
2.2 if this rule significantly improves the
set of rules KB build so far (we test
using the chi2 test the difference
between the rule validity and the
result of classification of an example
covered by Ant) then add the rule
to KB
Discovery Challenge 2005
11
Rule induction algorithms (3/5)

Set-covering algorithm
rule specialization extends the
antecedent sequence by any
sequence member from left
the decision if the antecedent
sequence should be specialized is
made by a chi2 test
adding a rule CDX to the rule
DX changes the rule DX into a
rule meaning (D but not CD)X

Compositional algorithm
rule specialization extends the
antecedent sequence by any
sequence member from left
the decision if new rule should
be added is made by a chi2
test
Discovery Challenge 2005
12
Rule induction algorithms (4/5)
Rule-based classification

Set-covering algorithm


apply single rule
Compositional algorithm

combine conritbutions of all applicable rules using
pseudo-bayesian formula
w1  w2
w1  w2 
w1  w2  (1  w1 )  (1  w2 )
Discovery Challenge 2005
13
Rule induction algorithms (5/5)
Examples of rules

For the page types sequences




dp, sb -> sb (Ant: 5174; AntSuc: 4801; P: 93%)
ct -> end (Ant: 5502; AntSuc: 1759; P: 32%)
faq -> help (Ant: 594; AntSuc: 127; P: 21%)
For the products sequences



loud-speakers -> video (Ant: 14840, AntSuc: 3785, P: 26%)
data cables -> telephones (Ant: 2560, AntSuc: 565, P: 22%)
PC peripheries -> telephones (Ant: 8671, AntSuc: 1823, P: 21%)
Discovery Challenge 2005
14
Results of testing
Discovery Challenge 2005
15
Conclusions

Comparison of methods




Comparison of results


N-gram models as exhaustive sets of compositional rules
Pi(c|a…b) ~ a…b  c
Set covering algorithm for exhaustive non comp. rules
Compositional algorithm for non exhaustive comp. rules
N-gram comparable with set covering (slightly better),
worst results for compositional algorithm
All algorithms can be applied by web servers to
recommend relevant pages to their users, and to
identify interesting patterns in their log files.
Discovery Challenge 2005
16
Thank you for your attention.