CSE591 Data Mining

Transcript CSE591 Data Mining

CSE591 (575) Data Mining
1/21/2003 - 5/6/2003
Computer Science & Engineering
ASU
1
Introduction
Introduction to this Course
Introduction to Data Mining
2
Introduction to the Course

First, about you - why take this course?

Your background and strength



AI, DBMS, Statistics, Biology, …
Your interests and requests
What is this course about?


Problem solving
Handling data


transform data to workable data
Mining data


turn data to knowledge
validation and presentation of knowledge
3
This course

What can you expect from this course?



How is this course conducted?



Knowledge and experience about DM
Problem solving and solution presentation
Presentations
Individual projects
Course Format



Individual Projects 40%
Exams and/or quizzes 40%
Class participation 20%

off-campus students?
4
Projects - Start NOW!



How to start?
Projects should be sufficiently challenging but
reasonable, suitable for one semester
How to choose your individual project



Real-world problems
Problems that might make differences
Two types of projects


Available projects
Self-proposed projects (Approval’s needed)
5
Some project ideas

Dealing with high dimensional data


Image mining






Feature extraction, clustering of images
Active sampling


Data of supervised, unsupervised learning
Various data structures (kd-trees, R-trees, Multi-Dimen Scaling)
Meta data (RDF, namespace) for mining
Ensemble learning
Sequence mining (HMM learning)
Bioinformatics and applications (feature selection)
Intelligent driving data analysis

Data integration, data reduction (random projection)
6
How is a project evaluated?

It depends on




What do you want to achieve
Its impact
Your effort
The sooner you start, the better

The beginning is not easy
7
Course Web Site




http://www.public.asu.edu/~huanliu/cse591.
html
My office and office hours
 GWC 342
 T 10:30 - 11:30am and Th 4:00-5:00pm
My email: [email protected]
Slides and relevant information will be made
available at the course web site
8
Any questions and suggestions?

Your feedback is most welcome!
I need it to adapt the course to your
needs.
Please feel free to provide yours anytime.
Share your questions and concerns with the
class – very likely others may have the same.
No pain no gain – no magic for data mining.






The more you put in, the more you get
Your grades are proportional to your efforts.
9
Introduction to Data Mining
Definitions
Motivations of DM
Interdisciplinary Links of DM
10
What is DM?

Or more precisely KDD (knowledge discovery
from databases)?


Many definitions
A process, not plug-and-play
raw data  transformed data  preprocessed data 
data mining  post-processing  knowledge

One definition is

A non-trivial process of identifying valid, novel,
useful and ultimately understandable patterns in
data
11
Need for Data Mining





Data accumulate and double every 9 months
There is a big gap from stored data to
knowledge; and the transition won’t occur
automatically.
Manual data analysis is not new but a
bottleneck
Fast developing Computer Science and
Engineering generates new demands
Seeking knowledge from massive data

Any personal experience?
12
When is DM useful

Data rich


Large data (dimensionality and size)



Two invited talks so far have convincingly
demonstrate it
Image data (size)
Gene data (dimensionality)
Little knowledge about data (exploratory data
analysis)

What if we have some knowledge?
13
DM perspectives




Prediction, description, explanation,
optimization, and exploration
Completion of knowledge (patterns vs. models)
Understandability and representation of
knowledge
Some applications



Business intelligence (CRM)
Security (Info, Comp Systems, Networks, Data,
Privacy)
Scientific discovery (bioinformatics)
14
Challenges



Increasing data dimensionality and data size
Various data forms
New data types



Streaming data, multimedia data
Efficient search and data access
Intelligent update and integration
15
Interdisciplinary Links of DM






Statistics
Databases
AI
Machine Learning
Visualization
High Performance Computing

supercomputers, distributed/parallel/cluster
computing
16
Statistics

Discovery of structures or patterns in data
sets


Optimal strategies for collecting data


efficient search of large databases
Static data


hypothesis testing, parameter estimation
constantly evolving data
Models play a central role


algorithms are of a major concern
patterns are sought
17
Relational Databases

A relational databases can contain several tables


The goal in data organization is to maintain data
and quickly locate the requested data


Queries and index structures
Query execution and optimization


Tables and schemas
Query optimization is to find the best possible
evaluation method for a given query
Providing fast, reliable access to data for data
mining
18
AI

Intelligent agents


Search



uniform cost and informed search algorithms
Knowledge representation


Perception-Action-Goal-Environment
FOL, production rules, frames with semantic
networks
Knowledge acquisition
Knowledge maintenance and application
19
Machine Learning




Focusing on complex representations, data-intensive
problems, and search-based methods
Flexibility with prior knowledge and collected data
Generalization from data and empirical validation
 statistical soundness and computational efficiency
 constrained by finite computing & data recourses
Challenges from KDD
 scaling up, cost info, auto data preprocessing
20
Visualization


Producing a visual display with insights into the
structure of the data with interactive means
 zoom in/out, rotating, displaying detailed info
Various branches of visualization methods




show summary properties and explore relationships
between variables
investigate large databases and convey lots of
information
analyze data with geographic/spatial location
A pre- and post-processing tool for KDD
21
Bibliography

W. Klosgen & J.M. Zytkow, edited, 2001, Handbook of
Data Mining and Knowledge Discovery.
22

CSE591 Data Mining

Transcript CSE591 Data Mining

Directory