OMEGA - LIACS

Download Report

Transcript OMEGA - LIACS

Data Mining:
Knowledge Discovery in Databases
Peter van der Putten
ALP Group, LIACS
Pre-University College
LAPP-Top Computer Science
February 2005
Overview
• Lecture: data mining applications and internals
– Collaborative Filtering & Recommender Systems
– Decision Management Demo (optional)
– Research @LIACS
• Lab session:
– Review lab work
– Data mining projects
– Data mining project presentations
To Repeat:
What did we do in the first lecture
• Definitions of data mining
• Data mining tasks
– Predictive data ming
– Descriptive data mining
• Algorithms for classification
• Algorithm for association rules
To Repeat:
Some working definitions….
• ‘Data Mining’ and ‘Knowledge Discovery in Databases’
(KDD) are used interchangeably
• Data mining =
– The process of discovery of interesting, meaningful and
actionable patterns hidden in large amounts of data
• Multidisciplinary field originating from artificial
intelligence, pattern recognition, statistics, machine
learning, bioinformatics, econometrics, ….
To Repeat:
Some working definitions….
•
Concepts: kinds of things that can be learned
–
–
•
Instances: the individual, independent examples of a
concept
–
•
Example: a patient, candidate drug etc.
Attributes: measuring aspects of an instance
–
•
Aim: intelligible and operational concept description
Example: the relation between patient characteristics
and the probability to be diabetic
Example: age, weight, lab tests, microarray data etc
Pattern or attribute space
To Repeat:
Data mining tasks
• Predictive data mining
– Classification: classify an instance into a category
– Regression: estimate some continuous value
• Descriptive data mining
–
–
–
–
–
–
Matching & search: finding instances similar to x
Clustering: discovering groups of similar instances
Association rule extraction: if a & b then c
Summarization: summarizing group descriptions
Link detection: finding relationships
…
Case Data Mining in Practice:
Recommender Systems
What is a recommender system?
• Books, music etc.
– Amazon.com, BOL (nl.bol.com), Proxis.nl, CDR.nl, gnod.net.
Romanadvies.bibliotheek.nl
• Digital Video Recorders
– TIVO.com
• Movies
– IMDB.com, Movielens (http://movielens.umn.edu), reel.com,
gnod.net
• ….
• Down to recommending café’s in Utrecht
What is a recommender system?
• Recommender systems provide personalised
recommendations to users about products,
services or content based on his/her preferences
• Preferences are generated from feedback
– Explicit feedback: ratings ( or  ), I own this
book, etc.
– Implicit feedback: browsing, buying etc.
– The general attributes of the recommended object
are generally not used to make the recommendation
Data Mining Tasks Revisited: Search
Finding best matching instances
Every instance is a point in
pattern space. Attributes are the
dimension of an instance, f.e.
Age, weight, gender etc.
f.e. weight
Pattern spaces may be high
dimensional (10 to thousands of
dimensions)
f.e. age
Paradox
• How can we recommend objects if we don’t
know the attributes
– What should be the dimensions?
– How can we recommend books if we don’t know or
can’t use genre, nr of pages, etc etc
• Collaborative Filtering:
– Recommending objects without knowing intrinsic
attributes
– Recommend attributes that are bought (viewed etc)
together
Simple Collaborative Filtering
• Given person with a profile of items
• Find those nearest neighbor persons that have
bought similar items (matching/search)
• Recommend the products that are bought by
these nearest neighbors
• Blackboard example
Challenges
•
•
•
•
Large numbers of products and users (millions)
Recommendations have to be made in real time
A lot of users have rated only few items
Some products are very popular, others are very
rare
• User profiles are changing very dynamically
Solutions
• Quick fixes
– Remove the most popular and most rare products
– Remove users with few ratings
– Weight products by their popularity
• Abstract from the user profiles: model based
collaborative filtering
– Clustering
– Item to Item recommendations rather than User to
Item recommendation
Data Mining Tasks Revisited: Clustering
Clustering is the discovery of
groups in a set of instances
Groups are different, instances
in a group are similar
f.e. weight
In 2 to 3 dimensional pattern
space you could just visualise
the data and leave the
recognition to a human end
user
f.e. age
Data Mining Tasks Revisited: Clustering
Clustering is the discovery of
groups in a set of instances
Groups are different, instances
in a group are similar
f.e. weight
In 2 to 3 dimensional pattern
space you could just visualise
the data and leave the
recognition to a human end
user
f.e. age
In >3 dimensions this is not
possible
Data Mining Tasks Revisited: Clustering
K-Means, a simple clustering algorithm
1. Randomly distribute k ‘prototype vectors’ into
patterns space
2. Allocate all instances to nearest prototype
vector
3. Move prototype vector in direction of the mean
of all allocated instances
4. Repeat process until convergence
Clustering for recommender systems
• Perform a clustering on the pattern space of user
profiles down to a smaller number of profile
prototypes
• When making recommendations, search for the
nearest prototype / cluster and generate
recommendations from the cluster
• Problem: how much clusters to use?
Alternative approach: item to item filtering
• Record pairs of items bought by the same
person
– This computation is done offline for all items.
• Use this information to recommend similar or
popular books bought by others.
– Rather than finding similar persons, find similar items
for each item in the profile
– This computation is fast and done online.
Questions about recommender
systems?
Research @LIACS
• Studies
– Computer Science, Bio Informatics, Mediatechnology, ICT in
Business
• Research groups
–
–
–
–
–
–
–
Algorithms and Programmethodology
Digital Life Technologies
Imaging & BioInformatics
High Performance Computing
Leiden Embedded Research Center
Software Engineering & Information Systems
Theoretical Computer Science
Some examples of my research areas
(Jointly with students)
• Mix between applications and new algorithms
– Video mining: recognize settings, porn filtering
– Artificial Immune Systems: copying learning ability of immune
systems
– Predicting Survival Rate for Throat Cancer Patients
– Crime Data Mining
– Fusing Data from Multiple Sources
– Decisioning: offering the right product to the right customer
using predictions
– Bias variance evaluation: distinguish between different
sources of error for a classifier
What have we learned so far?
• Day 1
– Data mining fundamentals
– Basic hands on experience using WEKA
• Day 2
– Delving deeper into selected applications &
algorithms
– Zoom in on a data mining case using WEKA