A Proposed Methodology for e-Business Intelligence

Download Report

Transcript A Proposed Methodology for e-Business Intelligence

PCI 2014
A Proposed Methodology for E-Business
Intelligence Measurement Using Data
Mining Techniques
Stavros Valsamidis,
Ioannis Kazanidis,
Sotirios Kontogiannis
Alexandros Karakos
{[email protected],
[email protected],
[email protected],
[email protected] }
PCI 2014
Outline

Introduction

Method

Results

Discussion

Limitations

Conclusions
PCI 2014
Introduction (1/7)
E-business
 Business Intelligence
 Knowledge Data Discovery
 Data Mining

PCI 2014
Introduction (2/7)
 E-business
E-business refers to any business that uses
the Internet and related technologies. Ebusiness is the conducting of business on
the Internet, not only buying and selling but
also servicing customers and collaborating
with business partners
 Intelligence
Luhn defined intelligence as: "the ability to
apprehend
the
interrelationships
of
presented facts in such a way as to guide
action towards a desired goal“.
PCI 2014
Introduction (3/7)
 Business
Intelligence
Business Intelligence (BI) is the emerging
discipline that aims at combining corporate
data with textual user-generated content
(UGC) to let decision-makers analyze their
business based on the trends perceived
from the environment
PCI 2014
Introduction (4/7)
 Knowledge
Data Discovery
The term Knowledge Data Discovery
(KDD) was coined in 1989 to refer to the
broad process of finding knowledge in
data, and to emphasize the “high-level”
application of particular data mining (DM)
methods
PCI 2014
Introduction (5/7)
 Data
Mining
Data mining main goal is the search for
relationships and distinct patterns that exist
in datasets but they are “hidden" among the
vast amount of data.
PCI 2014
Introduction (6/7)
Indexes and metrics proposed by authors
for the usage of web applications.
 There are not metrics specifically for
measuring e-business usage in terms of BI
 This study contributes to the area of web
usage analysis for e-business intelligence
by ‘marrying’ e-business with data mining
 Four metrics, applied innovatively for the
first time in the field of e-business

PCI 2014
Introduction (7/7)
This paper
 proposes an iterative method for designing
and maintaining BI applications that
reorganizes the activities and tasks
normally carried out by practitioners
 is completed by a case study to the
consumer goods area, aimed at proving
that the adoption of a structured
methodology positively impacts on the
project success
PCI 2014
Method
Logging data
Data pre-processing
Measures calculation
Data mining techniques
E-business usage
assesment
Purchase records
PCI 2014
Method – Steps (1/5)
 logging
data: logging of specific data from e-business
systems
 Specifically
eleven (11) fields (request_time_event,
remote_host, request_uri, remote_logname,
remote_user, request_method, request_time,
request_protocol, status, bytes_sent, referer, agent) and
user requests from different products
 Pre-processing:
The data contain noise such as URLs,
emoticons, symbols, like asterisks, hashes, etc.
PCI 2014
Method – Steps (2/5)
 Indexes,
metrics and rates:
Attribute name
Description of the attribute
Sessions
The number of sessions per product viewed by users
Pages
The number of pages per product viewed by users
Unique pages
The number of unique pages per product viewed by users
Unique Pages per ProductID
The number of unique pages per product viewed by users per
per Session (UPPS)
session
Homogeneity
Enrichment
Disappointment
Interest
Mean rate
Score
The homogeneity of products
The enrichment of products
The disappointment of users when they view pages of the
products
It is the one 's complement to the disappointment
It represents the mean rate of the usage combining Enrichment,
Homogeneity and Interest
It is the score of the product usage
PCI 2014
Method – Steps (3/5)
 Indexes,
metrics and rates:

Enrichment = 1- (Unique Pages/Total Pages)

Disappointment= Sessions/Total Pages

Interest=1-Disappointment

Homogeneity =Unique pages/Total Sessions

Mean rate = (Enrichment + Homogeneity + Interest) /3

Score = Mean rate * UPPS
PCI 2014
Method – Steps (4/5)
 Data
mining techniques:
 data mining techniques are applied so that
relevant data can be analyzed. Classification,
clustering and association rule mining are
used, based on the metrics of the third step.
 During this step the classification the algorithm
1R may be applied
 Product clustering is included in the clustering
step, this is established by the Purchases
attribute
 Clustering of user visits is performed with the
use of k-means algorithm
PCI 2014
Method – Steps (5/5)
 Data
mining techniques:
 Association rule mining enables relationships
to be found amongst attributes in databases,
revealing if-then statements regarding
attribute-values
 An association rule X  Y shows a close
correlation among items in a database. This
occurs when transactions in the database in
which X occurs, there is also a high probability
of having Y. In an association rule X and Y are
respectively named the antecedent and
consequent of the rule.
PCI 2014
Results (1/6)
Study population and context
The data of 40 products are ranked in descending order
according to the column Score
Product
ID
Sessio
ns
Page Unique
s
pages
UPP
S
Enrichmen Homogene
t
ity
Disappointm
ent
Interes
t
Mean
rate
0,591
PID105
94
299
12
218
0,960
0,128
0,314
0,686
PID35
PID132
PID36
PID129
PID125
PID41
PID66
PID17
PID111
89
158
76
78
96
101
59
55
35
339
235
219
211
166
188
148
221
144
9
8
8
7
9
9
9
12
9
182
198
134
132
136
132
109
92
81
0,973
0,966
0,963
0,967
0,946
0,952
0,939
0,946
0,938
0,101
0,051
0,105
0,090
0,094
0,089
0,153
0,218
0,257
0,263
0,672
0,347
0,370
0,578
0,537
0,399
0,249
0,243
0,737
0,328
0,653
0,630
0,422
0,463
0,601
0,751
0,757
0,604
0,448
0,574
0,562
0,487
0,501
0,564
0,638
0,651
Score
128,84
9
109,93
0
88,720
76,903
74,224
66,242
66,176
61,515
58,727
52,693
Purchas
es
58
54
53
61
48
49
57
34
38
24
PCI 2014
Results (2/6)

Data pre-processing and calculation of the
metrics and rates
The data are in ASCII form and are obtained from the
Apache server log file.

Application of data mining techniques the column
Score

The attributes of the table were inserted in .cvs format into Weka

The attributes Product ID and Disappointment were removed

Product_ID is different for each instance and Disappointment is
the complement to the Interest attribute. All the remaining
attributes were disretized.
PCI 2014
Results (3/6)

Classification
 In
the classification step, the algorithm 1R is applied.
 The
 The
attribute Purchases is used as class.
best attribute which describes the classification is
Score
PCI 2014
Results (4/6)

Clustering
The clustering step
contains products
clustering, based on
the Purchases
attribute with the use
of the SimpleKmeans
algorithm
PCI 2014
Results (5/6)

Association rule mining
 The
Apriori algorithm was used to find association rules
over the discretized data
 Because
of the obvious dependencies of the attributes
Sessions, Pages and Unique Pages with the attributes
Enrichment, Interest and Homogeneity, the latter group
of attributes was removed from the data table
 Weka
shows a list of 6 rules with the support of the
antecedent and the consequent (total number of items)
at 0.1 minimum, and the confidence of the rule at 0.9
PCI 2014
Results (6/6)

Association rule mining
 There
is an uninteresting rule, like rule 1.
 There
are some similar rules, rules with the same
element in antecedent and consequent but interchanged,
such as the couples of rules 3, 4 and 5, 6
 It
is proven that purchases of the products are dependent
on the scores
PCI 2014
Discussion (1/2)

The indication that many pages within useful paths
contribute to increased usage is fairly obvious.

The more and better content on a site, the more a
user might visit it. So the administrators should add
some useful and helpful pages to a site.

If there is an essentially blank site but it is required for
the customers to visit it every day and contribute a
comment, then the usage will be necessarily high. On
the other hand, if there is a very elaborate web site
with rich content but is not required reading, limited
usage of the site would be expected
PCI 2014
Discussion (2/2)

Rule 2 offers to the administrators a lot of action ability,
since they can pay more attention to the products with low
values of Score and Sessions.

An increase in sessions results in more users (customers)
using the e-business system

Of course, it cannot be denied that a certain number of
customers only attempt to read the product information just
before doing their purchases
PCI 2014
Limitations

The fact that only 40 products in one e-business system
were investigated is a limitation to the study.

Especially for the data mining techniques which demand
large datasets.

However, this was ineluctable since the e-business system
of the case study had this number of active online products.
PCI 2014
Conclusions (1/3)

The proposed iterative method uses existing
tools and techniques in a novel way to perform
e-business systems usage analysis.

The
metrics
enrichment,
homogeneity,
disappointment and interest are used.

It incorporates clustering, classification and
association rule mining.
PCI 2014
Conclusions (2/3)
Advantages
I.
It is independent of a specific e-business system, since
it is based on the Apache log files and not the ebusiness system itself. Thus, it can be easily
implemented for every e-business system.
II.
It uses indexes and metrics in order to facilitate the
evaluation of each product.
III.
It offers useful information for a company to have to
determine which parts of its web site to improve.
PCI 2014
Conclusions (3/3)
I.
This approach may be applied after a long time
period of data tracking
II.
The proposed approach may also be applied to
other web applications such as e-government,
e-learning, e-banking, blogs, social networks
etc.
PCI 2014
Thank You!
Stavros Valsamidis,
Ioannis Kazanidis,
Sotirios Kontogiannis
Alexandros Karakos
[email protected],
[email protected],
[email protected],
[email protected]
TEI of Kavala
Kavala, Greece