WEB Mining Presentation
Download
Report
Transcript WEB Mining Presentation
Ernestina Menasalvas
[email protected]
Facultad de Informatica
Univesidad Politecnica de Madrid
May 2004
Introduction and motivation
•
•
•
•
Internet as a communication channel.
Technology needed to develop new services, security, infraestructure,
analysis
Web Mining to analyze the patterns so the services reply to user needs
Most of the webmining projects that have been developed, have note
taken into account the context in which they have been developed:
– Competitive society
– Success criteria dependes both:
• User satisfaction
• Sponsors benefit increase
•
The gap between tecnology depelopment in the web and the business
factors is increasing and genetares as a side effect a separation on
what tecnologist develop and what the companies need.
•
•
Knowing that the problem exists is just the begining…
Technological projects have to be integrated in the global strategy of
the company
The problem
• Innovative ideas in e-commerce are vaguely defined so they
loose focus and precision
• New technologies are being applied consuming resources but
without appropriate finantial or economic benefits
• Growth of the web activity, participation in every daily activity
(commercial, educational news, ..) is not being replied by an
accordindly number of servicies
• Services are being considered insuficient.
• Thus, site sponsors have to improve offered services to satisfy
the increasing growth in demand.
• On the other hand, the growth in offers will bring a growth in
demand what will make that the consumer will ask for a better
service offer.
• Web Mining projects have to be planned as one more project in
the global strategy of the company
Web Site personalization
Optimization and personalization of user web experience is crucial for
attracting and retaining electronic, web-based commerce customers.
Try to maintain the one-to-one relationship
Identifying future behaviour is crucial for the site to act proactively.
Information about user experience is captured in clickstream logs:
pages viewed, timing, and sequence.
Solutions given:
–
–
–
–
–
•
Clustering of users
Cluster of pages
Most visited path
Recommender systems
…
The question:
–
–
–
–
How to deploy?
How has the method been evaluated?
How does it helps to the company
How does it evolves in time?
Web Mining project evaluation
•
•
•
Criteria being used to evaluate the success of a site takes not external
(commercial) aspects into account.
Site aspects such as: increasing volume of selling, fraud decrease,
customer retention, competitivie prizes are not explicitiy tackled
Success in web sites is a measure related to eficiency and quality:
– Efficiency: number of pages being accessed along one session, lenght of the
session and actions developed
– Quality: respose time of the site to the user requests, pages accesibility,
visitors per page …
•
Company success is evaluated in terms of:
– Incomes, Outcomes, Expenses
– ROI, Market presence
•
•
•
Differences between criteria used to evaluate the success of any project
in the entreprise compared to those in the case of a web project are in
the root of the problem of webmining not complete success
Site sponsors do no evaluate commercial and finantial aspects and are
only based on vague commertial notions
The success in terms of use, structure and content has to be linked to
company business goals achievement
Web Mining project management
• An enterprise is a system design to fulfil certain goals by means
of the integration of different resources.
• Subsistems are at the same time interrelated and inter
independent
• When the company uses the Web as a channel, all the services,
infraestructure, …, has to be seen as one of the subsystems.
• Success of solution in the web subsystem has to be related to
the behaviour of the rest of the subsistems
• Web Mining projects are concerned with the Web subsystem
• So web mining project is not only an IT problem
• Apply a project management methodology to control the
process: A project manager is needed-> different role from the
data miner
• Identify Data Mining problems.
• For each of them apply CRISP-DM
Web Mining Project management (cont)
•
To properly deal with a data mining project we need explicit information of the
company:
–
–
•
Company environment, identify:
–
–
•
Structure of the company (departments, sections, channels, …)
Goals of the company and success criteria (both at the higher level and at the
department level)
Resources, constraints, and any factor that can determine the goal analysis and the
development of a web project
Web Project goals and their relationship with the goals of the company
To evaluate if the web mining project results contribute to the company goals
fulfilment:
–
–
–
The web site is not usually the end but the means.
It is of the channels that the company uses to achieve goals.
So in order to establish a site as a sucessful site, then it is a must the activities being
developed through the site to generate value for the company
•
Traditional approaches only analyze the site from the user perspective, but the
actions of the users have to generate value for the company
•
It is a CRM project
•
Web Project plan generation
CRM project – the three legs
Customer
Interaction
ERP/ERM
Supply Chain
Mgmt.
Analytical CRM
Legacy
Systems
Order Manag.
Order Prom.
Service
Automation
Marketing
Automation
Sales
Automation
Mobile Sales
Field
Service
Voice
(IVR, ACD)
Closed-Loop Processing
(EAI Toolkits, Embedded/Mobile Agents
Mobile
Office
Front
Office
Back
Office
Operational CRM
Conferencing
E-mail
Web
Conferencing
Response
Management
Data
Warehouse
Customer
Activity
Customers
Products
Vertical Apps.
Marketing
Automation
Category Mgmt.
Campaign Mgmt.
Fax
Letter
Collaborative CRM
Direct
Interaction
Data Mining
Increasing potential
to support
business decisions
Making
Decisions
Relationship with End User
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
Fact Gap
“Fact Gap”: difference between the available
information and the ability to take decisions based on
these information. (Gartner Group)
Data Mining gives the
intelligence
• Data bases gives the data.
• But intelligence is needed to explore the data
to find the patterns, rules and ideas to explain
what is going on and to predict what will go
on
• Techniques and tools are needed to add this
intelligence to data in order to extract the
maximum benefit from data.
• But tools alone (nowadays) do not put the
intelligence, this has to be provided by
EXPERTS and translated into the data for
better understanding
Data warehouse and data bases
are the support
Data Mining Standard process
model : Crisp-DM
Problem
Understanding
Data
Understanding
Data
Preparation
Deployment
Modeling
Evaluation
Building the bridge
• In order to provide users with the most
appropriate solution, data to be analyzed
have to be enriched with business information
• Business problems have to be translated to
data mining problems
• Results have to be understable not only by
data mining experts but also by end users
• Underlying the data mining solution
semantics has to be settled
Deeper analisis of
Personalization
• What is personalization?
• Observe user-web page interactions to identify patterns that:
indicate high-level user activity,
anticipate future use activity,
Make it possible to proactively act
• What is going to be personalized?
– The site: this means pages according to the users behaviour or
pattern
• Why the personalization is needed?
– To improve the site performance
– The web is just another channel
– Site performance has to do with improving the goals of the
company
• Who is the user?
– Navigator
– Customer
Web Data to be analyzed
• In any web mining problem we have data related to:
– Pages
– Navigators and navigation
– Customers and their transactions
• Web Logs is just the begining
• Not only the data has to be taken into account but all
the circumstances under which the data were
collected:
• Environment
– General
– Organization-related
– Customer-related
Enviroment
• Affects both direct and indirectly to the way
activites occur. Between the factors to take
into account:
–
–
–
–
Legal conditions
Technological conditions
Demography
Ecological conditions (weather, transports,
communications)
– Cultural and social conditions
– Geographical situation
• Take into account the location of the site, of
the navigator, …
Information to be added
•
Departments:
–
–
•
Products, services:
–
–
–
•
–
–
–
Static data: gender, demographic information (varies over the time but in a particular
moment it is static)
Roles:…
Behavior with the company being analyzed: number and kind of transaction he/she
performs
Behavioural data related to the environment (economy, legal constraints, climate,…)
Navigators:
–
–
•
Data per se of the object: size, color, …
Data relevant for the company: margin of benefits, top ten, …
How it is presented in the web
People consumers in general:
–
•
The same concept can have different meaning depending on the department
Product for marketing is not the same than for production
Web Log: Location (IP), time, browser,…
Behaviour : comparative with the “normal” if any to discover : mood, different location,
…
Dates
–
–
–
Itself has no meaning
Legal and fiscal periods, holidays, weekend,
Opening, closure, ….
Data enrichment
•
•
•
There is no method, no model to follow. It is more an art
Only with experience
Projects for the same domain share the enrichment:
–
–
–
–
•
A model could be established
Evaluate if data are appropriate to mine
Evaluate kind of patterns that can be obtained
Evaluate if a certain pattern cannot be obtained
Metadata is needed about the data
– Meaning for the business of each value, attribute, page, action, …
•
Metadata for each attribute, has to include semantics:
– Meaning: group according to it: demographical, behavioural, enviromental,
social, cultural
– Business value
– Cirmcunstances
– Constraints
– Relationship with other concepts
•
•
Ontology of concepts ???
Integrate metadata so the mining activity deals with them.
Data Modelling and deployment
• Once enriched data, patterns extracted can
be interpreted according to:
– User profiles
– Session value (according to certain goals)
– Period of the day
• Solution has to be deployed and integrated in
the site structure.
• Patterns evolve in time as new data are
coming
• Models have to be refined
• Establish the basis for the model to be refined
without performance decrease
Web Mining infraestructure
User
HTTP Client
HTTP Response
Interface
Agent
HTTP Request
HTTP Response
Original
WEBSITE
DECISION LAYER
User
Agent
Action
Plan
USERS
CRM SERVICES PROVIDER LAYER
Agents
Planning
Planning
Planning
Agent
Agent VWi
Agent
User
Model
Services
Information
Operational
PLANS
SEMANTIC LAYER
Agents
Models
WebLogs
Case-study: act according to the
value of the current session
Patterns to help:
Predict user behavior based on current behavior, not identity.
Abstract user behavior with varying degrees of granularity =>
subsessions.
Estimate the value of the session to accordidly act
Subsessions capture/approximate user state
information.
Key concept: frequent behavior paths.
Markov model to predict next set of pages and
behaviour
Webhouse to store information about users
Modify APACHE: pop ups and precaching
Case-study
1. Find behavior rules
Partial tree:
Define break points as decision
points in the path. Use them
to create rules.
Break point
PIND
Knowing PIND allows us to
predict a set of pages to
follow....
PDEP
PDEP
Break point
Behaviour rules
– Página principal, Tablón
– Página principal, Tablón
– Página principal, Tablón
Exámenes
Prácticas, Material apoyo Práctica 1
Prácticas, Material apoyo Práctica 2
Exámenes
-3
...
Página
principal
3
2
Material de
apoyo Práctica 1
4
Tablón
Página de
Decisión
Prácticas
5
Página
Objetivo
Material de
apoyo Práctica 2
2. Find Subsessions
Sessions may be
described in terms of
subsessions.
E.g., browse catalog,
browse shipping
information, browse privacy
notices, perform purchase.
Subsessions may be
defined in a number of
PDEP
ways, according to the
desired semantics. E.g.,
use breakpoints.
PIND
PDEP
Click-path Subsession Figure
Real-time user web page access path, with identified frequent pa
Web page access path expressed as a sequence of subsession
3. Markov models to predict
behavior and paths
Behavior X
BK N
Behavior Y
BK M
BK P
..
.
session1
session2
session3
Dep2
session4
Dep1
session5
Dep3
session6
4. Per user analysis: average time spent in
page
60
50
Time
(secs) 40
30
20
10
0
1 2 3 4 5 6 7 8 9 1011121314151617181920212223
URLs
5. Online Value evolution
35
30
Value
25
20
15
10
5
0
-5
1
2
3
4
5
6
sesión 1
7
8
9 10 11 12 13 14 15
sesión 2
sesión 3
Traversed number of links
Benefits of the algorithm
• Makes it possible to know at any point if the ongoing
navigation would be beneficial for the site, so that the
site can be dynamically adjusted accordingly.
• Quantify the value of a user session while he or she
is navigating
• Makes relationship user - site closer to real life
relationships
• The algorithm integrates the site/department goals:
– Sends pop ups to students according to the exercises they
have already done
– Professors can establish preferences and the rules are
changed accordingly
– …
Conclusion
• Without a proper project management:
– Difficult to obtain significant patterns
– Difficult interpretation of the resutls
– The potential of the process is minimized
• Site goals have to be integrated
• Algorithms alone are of not use: The best
algorithm not always means the best result
• The patterns have to be deployed in a proper
architecture
THANKS!
QUESTIONS???