Steven F. Ashby Center for Applied Scientific Computing
Download
Report
Transcript Steven F. Ashby Center for Applied Scientific Computing
Data Mining
Lecture 1
TIES445 Data mining
Nov-Dec 2007
Sami Äyrämö
These slides are additional material for TIES445
‹#›
Data Mining
12-14 lectures (on weeks 44-50)
Mondays 12:15-14:00
Tuesdays 10:15-12:00
NOTE: No lectures on week 47
3 x 2h demonstrations (one weeks 48-50 in a computer classroom)
Final exam in January 2008
3cr without seminar work
5cr with seminar work (will be held in January 2008)
These slides are additional material for TIES445
‹#›
About lectures
The lectures are based on:
Han and Kamber (based on Data Mining: Concepts and Techniques)
Tan, Steinbach and Kumar (based on Introduction to Data Mining)
http://www-faculty.cs.uiuc.edu/~hanj/bk2/slidesindex.html
http://www-users.cs.umn.edu/~kumar/dmbook/index.php#item4
Some slides by the lecturer
These slides are additional material for TIES445
‹#›
Literature
P-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Addison Wesley, 2005.
J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2005.
D. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001.
D. Pyle, Data Preparation for Data Mining, Morgan Kaufmann, 1999.
M. Berry, Data Mining Techniques: For Marketing, Sales, and Customer Relationship
Management, Wiley, 2004.
T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, Springer-Verlag, 2001.
U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in Knowledge
Discovery and Data Mining, MIT Press, 1996.
M.H. Dunham, Data Mining Introductory and Advanced Topics, Prentice Hall, 2003.
F. Witten, Data Mining: Practical Learning Tools and Techniques with Java Implementations,
Morgan Kaufmann, 2000.
J.P. Bigus, Data Mining with Neural Networks, McGraw-Hill, 1996.
J-M- Adamo, Data Mining for Association Rules and Sequential Patterns: Sequential and Parallel
Algorithms, Springer-Verlag, 2001.
H. Liu and H., Motoda, Feature Selection for Knowledge Discovery and Data mining, Kluwer,
1998.
These slides are additional material for TIES445
‹#›
Theses, publications etc.
M. Pechenizkiy, Feature Extraction for Supervised Learning in Knowledge Discovery Systems, PhD thesis,
University of Jyväskylä, 2005.
S. Äyrämö, Knowledge Mining using Robust Clustering, PhD thesis, University of Jyväskylä, 2006.
J. Mäkinen, Roskapostin älykäs suodattaminen, Pro gradu, Jyväskylän yliopisto, 2003.
M. Nurminen, Tiedonlouhinta rakenteisista dokumenteista, Pro gradu, Jyväskylän yliopisto, 2005.
K. Arkko, Assosiaatioiden ja sekvenssien louhinta suurista tietomassoista, Pro gradu, Jyväskylän yliopisto,
2006.
J. Hänninen, Batch- ja online-hermoverkko-opetusalgoritmien ominaisuudet ja eroavaisuudet, Pro gradu,
Jyväskylän yliopisto, 2006.
Kärkkäinen, T., MLP-network in a layer-wise form with applications to weight decay. Neural
Computing, 14 (6), 1451-1480, 2002.
Kärkkäinen, T. & Heikkola, E., Robust Formulations for Training Multilayer Perceptrons. Neural
Computation, 16 (4), 837-862, 2004.
Kärkkäinen, T. and Äyrämö, S., Robust Clustering Methods for Incomplete and Erroneous Data, in
Data Mining V: Data Mining, Text Mining and their Business Applications, 2004.
Äyrämö, S., Kärkkäinen, T. & Majava, K., Robust refinement of initial prototypes for partitioning-based
clustering algorithms. In C. Skiadas (Eds.), Recent Advances in Stochastic Modeling and Data
Analysis, pp. 473-482, World Scientific, 2007.
...many more!
These slides are additional material for TIES445
‹#›
Journals, conferences,…
Journals
– Data Mining and Knowledge Discovery, Springer
– The Transactions on Knowledge Discovery from Data (TKDD),
ACM
– IEEE Transactions on Knowledge and Data Engineering, IEEE
– SIGKDD Explorations
– Statistical Analysis and Data Mining, Wiley
– Data & Knowledge Engineering, Elsevier
– Computational Statistics & Data Analysis, Elsevier
Conferences, seminars, workshops
– ACM SIGKDD, PKDD, PAKDD, (IEEE) ICDM, SIAM data mining
(SDM), DMIN,...
– ICTAI, IJCAI, VLDB, ICDE, ICML, CVPR, MSR,...
These slides are additional material for TIES445
‹#›
Sample application
Operator
Laborant
Process data
Process data
Quality
Customer
Control data
Manager
d1m
d11
D d t1 d tj d tm
d n1
d nm
These slides are additional material for TIES445
Feedback
‹#›
Real-world data set
These slides are additional material for TIES445
‹#›
Mining Large Data Sets - Motivation
R. Grossman (2001):”During the next decade, the
amount of data will continue to explode, while the
number of scientists and engineers available to analyze
it will remain essentially constant.”
P.S. Bradley (2003) : “The ability of organizations to
effectively utilize this information for decision support
typically lags behind their ability to collect and store it.
But, organizations that can leverage their data for
decision support are more likely to have a competitive
edge in their sector of the market.”
These slides are additional material for TIES445
‹#›
Knowledge Mining (KM) process
These slides are additional material for TIES445
‹#›
Origins of Data Mining
Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
Statistics/
Numerical optimization
Machine Learning/
Pattern
Recognition/
Artificial Intelligence
Data Mining
Visualization
Database systems
These slides are additional material for TIES445
‹#›
Major Issues and Challenges in DM/KDD
Mining methodology
–
Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
–
Algorithmic requirements: Performance: efficiency, scalability, robustness, reliability
–
High dimensionality, complex and heterogeneous data
–
Pattern evaluation: the interestingness problem
–
Incorporation of background knowledge
–
Data quality: Handling noise and incomplete data (robustness, reliability)
–
Parallel, distributed and incremental mining methods
–
Integration of the discovered knowledge with existing one: knowledge fusion
–
Data Ownership and Distribution
User interaction
–
Expression and visualization of data mining results
–
Interactive mining of knowledge at multiple levels of abstraction
Applications and social impacts
–
Domain-specific data mining & invisible data mining
–
Protection of data security, integrity, and privacy
These slides are additional material for TIES445
‹#›