Brandon_Leonardo_Data_mining

Download Report

Transcript Brandon_Leonardo_Data_mining

Data Mining
Brandon Leonardo
CS157B (Spring 2006)
What Is Data Mining?
• A way to discover knowledge
• “Semiautomatically analyzing large
databases to find useful patterns”
• Notable Characteristics
• Large amounts of data
• Data Stored on Disk
What Are We Looking For?
• Rules
• Use sets of rules to predict/classify
objects
• Ex. “Students with annual income less
than $20,000 year are most likely to get
a student loan”
• Patterns
• Different kinds of patterns
• Multiple patterns in one data set
What Can Data Mining Do?
• Applications
• Prediction
• What class the data will belong in or what
the value will be based on attributes
• What kind of animal will this be, considering that
it has stripes, 4 legs, and talks?
• What customers are likely to switch to a
competitor?
What Can Data Mining Do?
• Applications
• Association
• Data that goes together in a class
• Amazon – books that are bought together
• Causality
• Whether riding a motorcycle increases your
chances of dying in an accident
• Descriptive patterns
• Clusters
Classification
• Taking a new item (training instance)
and, given past instances, figure out
which class the new item belongs in
• How?
• Rules
• Decision Trees
• Bayesian Classifiers
Rule Classifiers
• Break down what classes some data
belongs in based on rules
• Ex.
• If a new customer signs up for a credit
card, and makes less than $30,000 a
year, then place them in a high risk
category
Decision Tree Classifiers
• Traverse the tree based on attributes,
making a decision at each node until
a leaf is reached
• Ex. Being Hired At Google
Degree
Bachelors
School
Not Stanford
Not Hired
PhD
School
Stanford
Hired
Not Stanford
Not Hired
Stanford
Hired
Bayesian Classifiers
• Bayesian
• Predict the probability of an item being
in a class for every class
• The class with the largest probability
“wins”
• P(cj|d) = p(d|cj)p(cj) / p(d)
• P(d|cj) – probability of generating instance d given class cj
• P(cj) – probability of getting class cj
• P(d) – probability of d occurring
• If a variable isn’t present, it isn’t included in probability
Regression
• Linear regression/Curve fitting
• Y = a0 + a1*X1 + a2*X2 + … + an * Xn
• You create the co-efficients a0, a1, a2,
…, an
• Find the best fit
• Not always exact
• noise in data
• relationship isn’t polynomial
Association Rules
• Rules denoted by ‘=>’
• Support
• What fraction of population has both the
antecedent and consequent of the rule
• Confidence
• How often the consequent is true when the
antecedent is true
• Ex. Owning car => Buying Gas
• Support – 99.9%
• Confidence – 99.9%
• Probably True
Association Rules
• Shortcomings
• Sometimes there are correlations that
aren’t really caused by each other
• Ex. Haircuts and Grocery Shopping
• 99% of population gets haircuts
• 100% of population goes grocery shopping
• Everybody who gets a haircut goes grocery
shopping, but does that mean that one correlates
with the other
• Deviation from existing patterns
• Correlation (positive and negative)
Clustering
• Clusters of points in a data set
• Break the set down into subsets
• Types
• Hierarchical clustering
• Based on different levels, break things
down as you go deeper
• Agglomerative clustering
• Start small, then create higher levels
• Divisive clustering
• Start big, then create lower levels
Other Types of Mining
• Text mining
• Mining text documents
• Data visualization
• Maps, charts, other graphical things
• Don’t analyze the data, just present it
for users (humans are good at seeing
patterns)
References
• Database System Concepts