Transcript Document
Data Mining
Overview
Business Intelligence
Data Mining Defined
Knowledge discovery in databases
Extracting implicit, previously
unknown information from large
volumes of raw data
Instances and Features
Typically, the database will be a
collection of instances
Each instance will have values for a
given set of features
From database theory:
instances are rows,
features are columns
Classification
Supervised learning
Suppose instances have been
categorized into classes and the
database includes this categorization
Goal: using the “knowledge” in the
database, classify a given instance
Classifiers
feature
values
X1
X2
X3
…
Y
Classifier
category
Xn
DB
collection of instances
with known categories
Classifier intelligence
A classifier’s intelligence will be based
on a dataset consisting of instances
with known categories
Typical goal of a classifier: predict the
category of a new instance that is
rationally consistent with the dataset
BI Examples
A loans officer in a bank uses a system that
automatically approves or disapproves a
loan application based on previous loan
applications and decisions
An admissions officer in a university uses a
system that automatically makes an
admission decision (accept, reject, wait-list),
based on previous applicants’ data and
decisions made on them
Data mining method example:
k - nearest neighbors
For a given instance T, get the top k
database instances that are “nearest”
to T
Select a reasonable distance measure
Inspect the category of these k
instances, choose the category C that
represent the most instances
Conclude that T belongs to category C
Clustering
Unsupervised learning
Classes/categories are not known, but
unexpected groupings (clusters) are
discovered
Clustering provides insight into the
population segments
Feature 2
Clustering
Feature 1
Goal of Clustering
Input: the database of instances, and
possibly some predetermined number
of clusters
Output: the same database of
instances partitioned into clusters
BI Examples
After clustering the current university
student population, it was discovered that
there is a large group of female marketing
majors coming from a particular exclusive
school who tend to get high grades
business response: focus recruitment on
that school; push the university’s marketing
program
Customer segment characteristics and
spending patterns can direct business
strategies
Data mining method
example: k-means
Guess the number of clusters (k)
Guess cluster centers from the
samples (these will be called
centroids)
Determine cluster membership based
on the distance from the centroids
Repeatedly refine the centroids by
getting the average (mean) of the
members of each cluster
Summary
Two sub-areas of data mining have
been discussed: supervised
(classification) and unsupervised
(clustering) learning methods
For both types of methods, intelligent
systems can be created to support
business decision making