Transcript 1 - MMLab

Data mining in Wikipedia
2011-09-26
Sinchoo Kim
Terms
• Data
– Unorganized and unprocessed fact
• Information
– Data that are processed to be useful
– Provides answers to "who", "what", "where",
and "when" questions
• Knowledge
– Application of data and information
– Answers "how" questions
KDD
• KDD (Knowledge Discovery in Database)
– Describes the process of automatically
searching large volumes of data that can be
considered knowledge about the data for
patterns
•
•
•
•
•
Selection
Preprocessing
Transformation
Data Mining
Interpretation/Evaluation
Data mining
• Definition
– The analysis step of the Knowledge Discovery
in Databases process
– Discovering previously unknown pattern
– Example
• Home equity loan
Case : Home equity loan
• Select subset of customer records who
have received home equity loan offer
Incoming
Number of
children
Average
Checking
Account Balance
Response
$40,000
2
$1500
Yes
$75,000
0
$5000
No
$50,000
1
$3000
No
Case : Home equity loan
• Find rules to predict whether a customer
would respond to home equity loan offer
note or note
IF (Salary < 40k) and
(numChildren > 0) and
(ageChild1 > 18 and ageChild1 < 22)
THEN YES
Case : Home equity loan
• Group customers into clusters and
investigate clusters
Group 2
Group 3
Group 1
Group 4
Case : Home equity loan
• Evaluate results
– Many “uninteresting” clusters
– One interesting cluster! Customers with both
business and personal accounts; unusually
high percentage of likely respondents
Common classes of tasks
• Association rule learning
– Searches for relationship between variables
• Clustering
– Discover groups and structures in data are in
some way similar
• Anomaly detection
– Identification of unusual data records
Common classes of tasks
• Classfication
– Generalizing known structure to apply to new
data
• Regressions
– Find a function which models the data with
the least error
• Summarizations
– Provide a more compact representation of the
data set
Notable uses
• Business
- Customer management
• Marketing
• Identify purchase pattern
- In human resource department
• Identifying the characteristics of their most
successful employees
- In Decision making support
• Integrated-circuit production line
Notable uses
• Science and engineering
- Human genetics
• Relation between genetics and deseases
- Electrical power engineering
• Detect abnormal conditions
• Estimate the nature of the abnormalities
Notable uses
• Visual Data Mining
– Large data set have been generated,
collected, and stored
– Find trends and information which is hidden in
data set
Issues
• Reliable data set
– Overfitting
• Training set which are not present in the general
data set
Issues
• Privacy concerns and ethics
– The term data mining has no ethical
implications
– Compiled data cause anyone who has access
• to the newly compiled data set
• to be able to identify specific individuals, especially
when originally the data were anonymous