Data Mining - University of Kentucky
Download
Report
Transcript Data Mining - University of Kentucky
CS 685
Special Topics in Data mining
Instructor: Jinze Liu
Spring 2014
Welcome!
Instructor: Jinze Liu
Homepage: http://www.cs.uky.edu/~liuj
Office: 235 Hardymon Building
Email: [email protected]
2
Overview
Time: TR 2pm-3:15pm
Office hour: by appointment
Credit: 3
Preferred Prerequisite:
Data structure, Algorithms, Database, AI, Machine Learning, Statistics.
3
Overview
Textbook:
Mining of Massive Datasets. Can
be accessed for free at
http://infolab.stanford.edu/~ullman/mmds/book.pdf
A collection of papers in recent conferences and journals
Other References
Data Mining --- Concepts and techniques, by Han and Kamber, Morgan
Kaufmann. (ISBN:1-55860-901-6)
Introduction to Data Mining, by Tan, Steinbach, and Kumar, Addison
Wesley. (ISBN:0-321-32136-7)
Principles of Data Mining, by Hand, Mannila, and Smyth, MIT Press.
(ISBN:0-262-08290-X)
The Elements of Statistical Learning --- Data Mining, Inference, and Prediction,
by Hastie, Tibshirani, and Friedman. (ISBN:0-387-95284-5)
4
Overview
Grading scheme
5
4-6 Homeworks
30%
1 Exam
20%
1 Presentation
20%
1 Project
30%
Overview
Project
Individual project or team project (no more than 2 students)
Options
Development of original algorithms
Application of existing algorithms to solve a real world
problem.
6
Examples
MIT big data challenge
http://bigdatachallenge.csail.mit.edu/datasets
Data visualization
SAP Lumira
http://scn.sap.com/community/lumira/blog/2014/0
1
Dataset examples: http://scn.sap.com/docs/DOC31433
Digging into reviews?
http://archive.ics.uci.edu/ml/datasets/OpinRank+Re
view+Dataset
7
Overview
Paper presentation
One per student
A talk on one of the latest research development in data mining. It should also
be related to your project.
Research paper(s)
Your own pick (upon approval)
Three parts
Motivation for the research
Review of data mining methods
Discussion
Questions and comments from audience
Class participation: One question/comment per student
Order of presentation: will be arranged according to the
topics.
8
Introduction to Data Mining
9
Why Mine Data?
Lots of data is being collected
and warehoused
Public web data
Social Networks
Reviews
Purchases at department/
grocery stores
Bank/Credit Card
transactions
……
Storage and computing have become cheaper and more powerful
The Growth of Data
11
The Big Data Economy
The basis for competitive advantage
Customer profiling and targeting as well as predictive
analytics
Optimize service and maintenance (e.g. GE)
New products, features and value-adding services (e.g.
LinkedIn, Facebook and others. )
http://www.ijento.com/blog/what-is-big-data-a-guide-for-cmos/
12
Data Scientist
Data scientist – the sexiest job of the 21st century
(Harvard business review)
To find insight in data
Skills needed?
writing code
being curious about discovery
solid foundation in math, statistics, probability and computer
science
communication
http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
13
What is Data Mining?
Non-trivial extraction of implicit, previously unknown
and potentially useful information from data
Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns
Cultures
Databases: concentrate on large-scale (non-
main-memory) data.
AI (machine-learning): concentrate on complex
methods, small data.
Statistics: concentrate on models.
15
Models vs. Analytic Processing
To a database person, data-mining is an extreme
form of analytic processing – queries that
examine large amounts of data.
Result is the query answer.
To a statistician, data-mining is the inference of
models.
Result is the parameters of the model.
16
(Way too Simple) Example
Given a billion numbers, a DB person would
compute their average and standard deviation.
A statistician might fit the billion points to the
best Gaussian distribution and report the mean
and standard deviation of that distribution.
17
Examples
Discuss whether or not each of the following
activities is a data mining task.
(a) Dividing the customers of a company according to
their gender.
(b) Dividing the customers of a company according to
their profitability.
(c) Predicting the future stock price of a company using
historical records.
Examples
(a) Dividing the customers of a company according to their gender.
No. This is a simple database query.
(b) Dividing the customers of a company according to their profitability.
No. This is an accounting calculation, followed by the application of a
threshold. However, predicting the profitability of a new customer would be
data mining.
(c) Predicting the future stock price of a company using historical records.
Yes. We would attempt to create a model that can predict the continuous
value of the stock price. This is an example of the area of data mining known
as predictive modelling. We could use regression for this modelling, although
researchers in many fields have developed a wide variety of techniques for
predicting time series.
Meaningfulness of Answers
A big data-mining risk is that you will
“discover” patterns that are meaningless.
Statisticians call it Bonferroni’s principle:
(roughly) if you look in more places for
interesting patterns than your amount of data
will support, you are bound to find crap.
20
Examples of Bonferroni’s
Principle
A big objection to TIA was that it was looking
for so many vague connections that it was sure
to find things that were bogus and thus violate
innocents’ privacy.
The Rhine Paradox: a great example of how not
to conduct scientific research.
21
The “TIA” Story
Suppose we believe that certain groups of evil-
doers are meeting occasionally in hotels to plot
doing evil.
We want to find (unrelated) people who at least
twice have stayed at the same hotel on the same
day.
22
The Details
109 people being tracked.
1000 days.
Each person stays in a hotel 1% of the time (10
days out of 1000).
Hotels hold 100 people (so 105 hotels).
If everyone behaves randomly (I.e., no evildoers) will the data mining detect anything
suspicious?
23
p at
some
hotel
q at
some
hotel
Calculations – (1)
Same
hotel
Probability that given persons p and q will be at
the same hotel on given day d :
1/100 1/100 10-5 = 10-9.
Probability that p and q will be at the same hotel
on given days d1 and d2:
10-9 10-9 = 10-18.
Pairs of days:
5105.
24
Calculations – (2)
Probability that p and q will be at the same hotel
on some two days:
5105 10-18 = 510-13.
Pairs of people:
51017.
Expected number of “suspicious” pairs of
people:
51017 510-13 = 250,000.
25
Conclusion
Suppose there are (say) 10 pairs of evil-doers
who definitely stayed at the same hotel twice.
Analysts have to sift through 250,010 candidates
to find the 10 real cases.
Not gonna happen.
But how can we improve the scheme?
26
Moral
When looking for a property (e.g., “two people
stayed at the same hotel twice”), make sure that
the property does not allow so many possibilities
that random data will surely produce facts “of
interest.”
27
Rhine Paradox – (1)
Joseph Rhine was a parapsychologist in the 1950’s
who hypothesized that some people had ExtraSensory Perception.
He devised (something like) an experiment where
subjects were asked to guess 10 hidden cards – red
or blue.
He discovered that almost 1 in 1000 had ESP –
they were able to get all 10 right!
28
Rhine Paradox – (2)
He told these people they had ESP and called
them in for another test of the same type.
Alas, he discovered that almost all of them had
lost their ESP.
What did he conclude?
Answer on next slide.
29
Rhine Paradox – (3)
He concluded that you shouldn’t tell people
they have ESP; it causes them to lose it.
30
Moral
Understanding Bonferroni’s Principle will help
you look a little less stupid than a
parapsychologist.
31
Data Mining Tasks
Prediction Methods
Use some variables to predict unknown or future values of
other variables.
Description Methods
Find human-interpretable patterns that describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data Mining Tasks...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Regression [Predictive]
Semi-supervised Learning
Semi-supervised Clustering
Semi-supervised Classification
Data Mining Tasks Cover in this
Course
Classification [Predictive]
Association Rule Discovery [Descriptive]
Clustering [Descriptive]
Deviation Detection [Predictive]
Semi-supervised Learning
Semi-supervised Clustering
Semi-supervised Classification
Survey
Why are you taking this course?
What would you like to gain from this course?
What topics are you most interested in learning about
from this course?
Any other suggestions?
KDD References
Data mining and KDD (SIGKDD: CDROM)
Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
Journal: Data Mining and Knowledge Discovery, KDD Explorations
Database systems (SIGMOD: CD ROM)
Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT,
DASFAA
Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.
AI & Machine Learning
Conferences: Machine learning (ICML), AAAI, IJCAI, COLT (Learning Theory), etc.
Journals: Machine Learning, Artificial Intelligence, etc.
36
KDD References
Statistics
Conferences: Joint Stat. Meeting, etc.
Journals: Annals of statistics, etc.
Bioinformatics
Conferences: ISMB, RECOMB, PSB, CSB, BIBE, etc.
Journals: J. of Computational Biology, Bioinformatics, etc.
Visualization
Conference proceedings: InfoVis, CHI, ACM-SIGGraph, etc.
Journals: IEEE Trans. visualization and computer graphics, etc.
37