CSE591 Data Mining - College of Engineering and Computer Science
Download
Report
Transcript CSE591 Data Mining - College of Engineering and Computer Science
CS499/699-10 Data Mining
Fall 2003
Professor Guozhu Dong
Computer Science & Engineering
WSU
9/03
Data Mining – Introduction
G Dong (WSU)
1
Introduction
Introduction to this Course
Introduction to Data Mining
9/03
Data Mining – Introduction
Guozhu Dong
2
Introduction to the Course
First, about you - why take this course?
Your background and strength
AI, DBMS, Statistics, Biology, Business, …
Your interests and requests
What is this course about?
Problem solving
Handling data
transform data to workable data
Mining data
9/03
turn data to knowledge
validation and presentation of knowledge
Data Mining – Introduction
Guozhu Dong
3
This course
What can you expect from this course?
How is this course conducted?
Knowledge and experience about DM
Problem solving skills
Home works, projects, exams, classes
Course Format
Individual Projects: 30%
Exams and/or quizzes: 60%
Homeworks: 10%
9/03
Data Mining – Introduction
Guozhu Dong
4
Course Web Site
cs.wright.edu/~gdong/mining03/WSUCS499DataMining.htm
My office and office hours
RC 430
4:30-5:30, T Th
My email: [email protected]
Slides and relevant information will be made available
at the course web site
9/03
Data Mining – Introduction
Guozhu Dong
5
Any questions and suggestions?
Your feedback is most welcome!
Share your questions and concerns with the
class – very likely others may have the same.
No pain no gain – no magic for data mining.
9/03
I need it to adapt the course to your needs.
Please feel free to provide yours anytime.
The more you put in, the more you get
Your grades are proportional to your efforts.
Data Mining – Introduction
Guozhu Dong
6
Introduction to Data Mining
Definitions
Motivations of DM
Interdisciplinary Links of DM
9/03
Data Mining – Introduction
G Dong (WSU)
7
What is DM?
Or more precisely KDD (knowledge discovery
from databases)?
Many definitions
An iterative process, not plug-and-play
raw data transformed data preprocessed data
data mining post-processing knowledge
One definition is
9/03
A non-trivial process of identifying valid, novel,
useful and ultimately understandable patterns in
data
Data Mining – Introduction
Guozhu Dong
8
Need for Data Mining
9/03
Data accumulate and double every 9 months
There is a big gap from stored data to knowledge;
and the transition won’t occur automatically.
Manual data analysis is not new but a bottleneck
Fast developing Computer Science and Engineering
generates new demands
Seeking knowledge from massive data
Any personal experience?
Data Mining – Introduction
Guozhu Dong
9
When is DM useful
Data rich world
Large data (dimensionality and size)
Little knowledge about data
(exploratory data analysis)
9/03
Image data (size)
Gene chip data (dimensionality)
What if we have some knowledge?
Data Mining – Introduction
Guozhu Dong
10
DM perspectives
KDD “goals”: Prediction, description, explanation,
optimization, and exploration
Knowledge forms: patterns vs. models
Understandability and representation of knowledge
Some applications
Business intelligence (CRM)
Security (Info, Comp Systems, Networks, Data,
Privacy)
Scientific discovery (bioinformatics, medicine)
9/03
Data Mining – Introduction
Guozhu Dong
11
Challenges
Increasing data dimensionality and data size
Various data forms
New data types
9/03
Streaming data, multimedia data
Efficient search and access to
data/knowledge
Intelligent update and integration
Data Mining – Introduction
Guozhu Dong
12
Interdisciplinary Links of DM
Statistics
Databases
AI
Machine Learning
Visualization
High Performance Computing
supercomputers, distributed/parallel/cluster
computing
9/03
Data Mining – Introduction
Guozhu Dong
13
Statistics
Discovery of structures or patterns in data sets
hypothesis testing, parameter estimation
Optimal strategies for collecting data
efficient search of large databases
Static data
constantly evolving data
Models play a central role
algorithms are of a major concern
patterns are sought
9/03
Data Mining – Introduction
Guozhu Dong
14
Relational Databases
A relational database can contain several tables
The goal in data organization is to maintain data
and quickly locate the requested data
Queries and index structures
Query execution and optimization
Tables and schemas
Query optimization is to find the “best” possible
evaluation method for a given query
Providing fast, reliable access to data for data
Data Mining – Introduction
mining
9/03
Guozhu Dong
15
AI
Intelligent agents
Search
Uniform cost and informed search algorithms
Knowledge representation
Perception-Action-Goal-Environment
FOL, production rules, frames with semantic
networks
Knowledge acquisition
Knowledge maintenance and application
9/03
Data Mining – Introduction
Guozhu Dong
16
Machine Learning
Focusing on complex representations, data-intensive
problems, and search-based methods
Flexibility with prior knowledge and collected data
Generalization from data and empirical validation
statistical soundness and computational efficiency
constrained by finite computing & data resources
Challenges from KDD
scaling up, cost info, auto data preprocessing,
more knowledge types
9/03
Data Mining – Introduction
Guozhu Dong
17
Visualization
Producing a visual display with insights into the
structure of the data with interactive means
zoom in/out, rotating, displaying detailed info
Various types of visualization methods
show summary properties and explore relationships
between variables
investigate large DBs and convey lots of information
analyze data with geographic/spatial location
A pre- and post-processing tool for KDD
9/03
Data Mining – Introduction
Guozhu Dong
18
Bibliography
9/03
J. Han and M. Kamber. Data Mining – Concepts and
Techniques. 2001. Morgan Kaufmann.
D. Hand, H. Mannila, P. Smyth. Principals of Data
Mining. 2001. MIT.
W. Klosgen & J.M. Zytkow, edited, 2001, Handbook of
Data Mining and Knowledge Discovery.
Data Mining – Introduction
Guozhu Dong
19