CSE591 Data Mining - College of Engineering and Computer Science

Download Report

Transcript CSE591 Data Mining - College of Engineering and Computer Science

CS499/699-10 Data Mining
Fall 2003
Professor Guozhu Dong
Computer Science & Engineering
WSU
9/03
Data Mining – Introduction
G Dong (WSU)
1
Introduction


Introduction to this Course
Introduction to Data Mining
9/03
Data Mining – Introduction
Guozhu Dong
2
Introduction to the Course

First, about you - why take this course?

Your background and strength



AI, DBMS, Statistics, Biology, Business, …
Your interests and requests
What is this course about?


Problem solving
Handling data


transform data to workable data
Mining data


9/03
turn data to knowledge
validation and presentation of knowledge
Data Mining – Introduction
Guozhu Dong
3
This course

What can you expect from this course?



How is this course conducted?


Knowledge and experience about DM
Problem solving skills
Home works, projects, exams, classes
Course Format



Individual Projects: 30%
Exams and/or quizzes: 60%
Homeworks: 10%
9/03
Data Mining – Introduction
Guozhu Dong
4
Course Web Site




cs.wright.edu/~gdong/mining03/WSUCS499DataMining.htm
My office and office hours
 RC 430
 4:30-5:30, T Th
My email: [email protected]
Slides and relevant information will be made available
at the course web site
9/03
Data Mining – Introduction
Guozhu Dong
5
Any questions and suggestions?

Your feedback is most welcome!




Share your questions and concerns with the
class – very likely others may have the same.
No pain no gain – no magic for data mining.


9/03
I need it to adapt the course to your needs.
Please feel free to provide yours anytime.
The more you put in, the more you get
Your grades are proportional to your efforts.
Data Mining – Introduction
Guozhu Dong
6
Introduction to Data Mining
Definitions
Motivations of DM
Interdisciplinary Links of DM
9/03
Data Mining – Introduction
G Dong (WSU)
7
What is DM?

Or more precisely KDD (knowledge discovery
from databases)?


Many definitions
An iterative process, not plug-and-play
raw data  transformed data  preprocessed data 
data mining  post-processing  knowledge

One definition is

9/03
A non-trivial process of identifying valid, novel,
useful and ultimately understandable patterns in
data
Data Mining – Introduction
Guozhu Dong
8
Need for Data Mining





9/03
Data accumulate and double every 9 months
There is a big gap from stored data to knowledge;
and the transition won’t occur automatically.
Manual data analysis is not new but a bottleneck
Fast developing Computer Science and Engineering
generates new demands
Seeking knowledge from massive data
 Any personal experience?
Data Mining – Introduction
Guozhu Dong
9
When is DM useful


Data rich world
Large data (dimensionality and size)



Little knowledge about data
(exploratory data analysis)

9/03
Image data (size)
Gene chip data (dimensionality)
What if we have some knowledge?
Data Mining – Introduction
Guozhu Dong
10
DM perspectives




KDD “goals”: Prediction, description, explanation,
optimization, and exploration
Knowledge forms: patterns vs. models
Understandability and representation of knowledge
Some applications
 Business intelligence (CRM)
 Security (Info, Comp Systems, Networks, Data,
Privacy)
 Scientific discovery (bioinformatics, medicine)
9/03
Data Mining – Introduction
Guozhu Dong
11
Challenges



Increasing data dimensionality and data size
Various data forms
New data types



9/03
Streaming data, multimedia data
Efficient search and access to
data/knowledge
Intelligent update and integration
Data Mining – Introduction
Guozhu Dong
12
Interdisciplinary Links of DM






Statistics
Databases
AI
Machine Learning
Visualization
High Performance Computing

supercomputers, distributed/parallel/cluster
computing
9/03
Data Mining – Introduction
Guozhu Dong
13
Statistics
 Discovery of structures or patterns in data sets
 hypothesis testing, parameter estimation
 Optimal strategies for collecting data
 efficient search of large databases
 Static data
 constantly evolving data
 Models play a central role
 algorithms are of a major concern
 patterns are sought
9/03
Data Mining – Introduction
Guozhu Dong
14
Relational Databases

A relational database can contain several tables


The goal in data organization is to maintain data
and quickly locate the requested data


Queries and index structures
Query execution and optimization


Tables and schemas
Query optimization is to find the “best” possible
evaluation method for a given query
Providing fast, reliable access to data for data
Data Mining – Introduction
mining
9/03
Guozhu Dong
15
AI

Intelligent agents


Search



Uniform cost and informed search algorithms
Knowledge representation


Perception-Action-Goal-Environment
FOL, production rules, frames with semantic
networks
Knowledge acquisition
Knowledge maintenance and application
9/03
Data Mining – Introduction
Guozhu Dong
16
Machine Learning




Focusing on complex representations, data-intensive
problems, and search-based methods
Flexibility with prior knowledge and collected data
Generalization from data and empirical validation
 statistical soundness and computational efficiency
 constrained by finite computing & data resources
Challenges from KDD
 scaling up, cost info, auto data preprocessing,
more knowledge types
9/03
Data Mining – Introduction
Guozhu Dong
17
Visualization


Producing a visual display with insights into the
structure of the data with interactive means
 zoom in/out, rotating, displaying detailed info
Various types of visualization methods




show summary properties and explore relationships
between variables
investigate large DBs and convey lots of information
analyze data with geographic/spatial location
A pre- and post-processing tool for KDD
9/03
Data Mining – Introduction
Guozhu Dong
18
Bibliography



9/03
J. Han and M. Kamber. Data Mining – Concepts and
Techniques. 2001. Morgan Kaufmann.
D. Hand, H. Mannila, P. Smyth. Principals of Data
Mining. 2001. MIT.
W. Klosgen & J.M. Zytkow, edited, 2001, Handbook of
Data Mining and Knowledge Discovery.
Data Mining – Introduction
Guozhu Dong
19