CSE591 Data Mining
Download
Report
Transcript CSE591 Data Mining
CSE591 (575) Data Mining
1/21/2003 - 5/6/2003
Computer Science & Engineering
ASU
1
Introduction
Introduction to this Course
Introduction to Data Mining
2
Introduction to the Course
First, about you - why take this course?
Your background and strength
AI, DBMS, Statistics, Biology, …
Your interests and requests
What is this course about?
Problem solving
Handling data
transform data to workable data
Mining data
turn data to knowledge
validation and presentation of knowledge
3
This course
What can you expect from this course?
How is this course conducted?
Knowledge and experience about DM
Problem solving and solution presentation
Presentations
Individual projects
Course Format
Individual Projects 40%
Exams and/or quizzes 40%
Class participation 20%
off-campus students?
4
Projects - Start NOW!
How to start?
Projects should be sufficiently challenging but
reasonable, suitable for one semester
How to choose your individual project
Real-world problems
Problems that might make differences
Two types of projects
Available projects
Self-proposed projects (Approval’s needed)
5
Some project ideas
Dealing with high dimensional data
Image mining
Feature extraction, clustering of images
Active sampling
Data of supervised, unsupervised learning
Various data structures (kd-trees, R-trees, Multi-Dimen Scaling)
Meta data (RDF, namespace) for mining
Ensemble learning
Sequence mining (HMM learning)
Bioinformatics and applications (feature selection)
Intelligent driving data analysis
Data integration, data reduction (random projection)
6
How is a project evaluated?
It depends on
What do you want to achieve
Its impact
Your effort
The sooner you start, the better
The beginning is not easy
7
Course Web Site
http://www.public.asu.edu/~huanliu/cse591.
html
My office and office hours
GWC 342
T 10:30 - 11:30am and Th 4:00-5:00pm
My email: [email protected]
Slides and relevant information will be made
available at the course web site
8
Any questions and suggestions?
Your feedback is most welcome!
I need it to adapt the course to your
needs.
Please feel free to provide yours anytime.
Share your questions and concerns with the
class – very likely others may have the same.
No pain no gain – no magic for data mining.
The more you put in, the more you get
Your grades are proportional to your efforts.
9
Introduction to Data Mining
Definitions
Motivations of DM
Interdisciplinary Links of DM
10
What is DM?
Or more precisely KDD (knowledge discovery
from databases)?
Many definitions
A process, not plug-and-play
raw data transformed data preprocessed data
data mining post-processing knowledge
One definition is
A non-trivial process of identifying valid, novel,
useful and ultimately understandable patterns in
data
11
Need for Data Mining
Data accumulate and double every 9 months
There is a big gap from stored data to
knowledge; and the transition won’t occur
automatically.
Manual data analysis is not new but a
bottleneck
Fast developing Computer Science and
Engineering generates new demands
Seeking knowledge from massive data
Any personal experience?
12
When is DM useful
Data rich
Large data (dimensionality and size)
Two invited talks so far have convincingly
demonstrate it
Image data (size)
Gene data (dimensionality)
Little knowledge about data (exploratory data
analysis)
What if we have some knowledge?
13
DM perspectives
Prediction, description, explanation,
optimization, and exploration
Completion of knowledge (patterns vs. models)
Understandability and representation of
knowledge
Some applications
Business intelligence (CRM)
Security (Info, Comp Systems, Networks, Data,
Privacy)
Scientific discovery (bioinformatics)
14
Challenges
Increasing data dimensionality and data size
Various data forms
New data types
Streaming data, multimedia data
Efficient search and data access
Intelligent update and integration
15
Interdisciplinary Links of DM
Statistics
Databases
AI
Machine Learning
Visualization
High Performance Computing
supercomputers, distributed/parallel/cluster
computing
16
Statistics
Discovery of structures or patterns in data
sets
Optimal strategies for collecting data
efficient search of large databases
Static data
hypothesis testing, parameter estimation
constantly evolving data
Models play a central role
algorithms are of a major concern
patterns are sought
17
Relational Databases
A relational databases can contain several tables
The goal in data organization is to maintain data
and quickly locate the requested data
Queries and index structures
Query execution and optimization
Tables and schemas
Query optimization is to find the best possible
evaluation method for a given query
Providing fast, reliable access to data for data
mining
18
AI
Intelligent agents
Search
uniform cost and informed search algorithms
Knowledge representation
Perception-Action-Goal-Environment
FOL, production rules, frames with semantic
networks
Knowledge acquisition
Knowledge maintenance and application
19
Machine Learning
Focusing on complex representations, data-intensive
problems, and search-based methods
Flexibility with prior knowledge and collected data
Generalization from data and empirical validation
statistical soundness and computational efficiency
constrained by finite computing & data recourses
Challenges from KDD
scaling up, cost info, auto data preprocessing
20
Visualization
Producing a visual display with insights into the
structure of the data with interactive means
zoom in/out, rotating, displaying detailed info
Various branches of visualization methods
show summary properties and explore relationships
between variables
investigate large databases and convey lots of
information
analyze data with geographic/spatial location
A pre- and post-processing tool for KDD
21
Bibliography
W. Klosgen & J.M. Zytkow, edited, 2001, Handbook of
Data Mining and Knowledge Discovery.
22