G54DMT – Data Mining Techniques and Applications http:/www.cs

Download Report

Transcript G54DMT – Data Mining Techniques and Applications http:/www.cs

G54DMT – Data Mining Techniques and
Applications
http://www.cs.nott.ac.uk/~jqb/G54DMT
Dr. Jaume Bacardit
[email protected]
Lecture 0: Introduction
Outline of the lecture
•
•
•
•
What is Data Mining?
Administrative bits
Module structure
Resources
We are buried in data….
And in business as well…
• Generating better movie
recommending methods
from customer ratings
• Training set of 100M ratings
from over 480K customers
on 18K movies
• Data collected from October
1998 and December, 2005
• 1M$ prize to generate a
recommender system 10% better
than the Netflix proprietary
method
• Took 3 years to solve the challenge
What is Data Mining?
• “The extraction of knowledge from large
amounts of data” (Han and Kamber, 2006)
• “Data mining is defined as the process of
discovering patterns in data. The process must be
automatic or (more usually) semiautomatic. The
patterns discovered must be meaningful in that
they lead to some advantage, usually an
economic advantage. The data is invariably
present in substantial quantities” (Witten and
Frank, 2005)
So what is the data?
• In its origin data can be heterogeneous, it can
have multiple sources and uncertainty (i.e.
distorted or missing entries)
• In most cases we will assume that data is
structured as a table where the rows are
instances and the columns are attributes
• And in certain cases the records will have one
or more labels associated to them, a class
Data can be… Piles of Records
• Datasets with a high number of records
– This is probably the most visible dimension of large scale
data mining
– GenBank (the genetic
sequences database
from the NIH)
contains (Feb, 2008)
more than 82 million
gene sequences and
more than 85 billion
nucleotides
Data can be… High Dimensionality
• High dimensionality domains
– Sometimes each record is characterized by hundreds, thousands (or
even more) features
– Microarray technology (as many other
post-genomic data generation
techniques) can routinely generate
records with tens of thousands of
variables
– Creating each record is usually very
costly, so datasets tend to have a very
small number of records. This unbalance
between number of records and number
of variables is yet another challenge
(Reinke, 2006, Image licensed under Creative Commons)
Data can be… Rare
• Class unbalance
– Challenge to generate accurate classification models
where not all classes are equally represented
– Contact Map prediction
datasets (briefly explained
later in the tutorial) routinely
contain millions of instances
from which less than 2% are
positive examples
– Tissue type identification is
highly unbalance—see figure
(Llora, Priya, Bhargava, 2009)
Data can be… Lots of Classes
• Yet another dimension of difficulty
• Reuters-21578 dataset is a text categorization task
with 672 categories
• Very related to the class unbalance problem
• Machine learning methods need to make an
extra effort to make sure that
underrepresented data is taken into account
properly
And what do we do with the data?
• The whole process
of integrating,
cleaning,
selecting, mining
and visualising the
data is generally
known as
Knowledge
Discovery in
Databases (KDD)
(Han and Kamber, 2006)
Fields related to Data Mining
• Machine Learning
– “How to construct programs that learn from experience”
(Mitchell, 1997)
– ML generally concentrates on the central part of the KDD
process, the pattern extraction.
– Also, ML is generally seen to focus on the algorithms, while DM
focuses on the process
• Pattern recognition
– Mathematical view of the pattern extraction process in
opposition to the computational view of ML
• Text mining
– Focused on analyzing human texts. Very specialised version of
DM
Educational aims
• To provide the students with a strong
knowledge of data mining, and its application
to real-world scenarios
• To understand the need of data mining to
analyse large-scale real-world data
• To provide the students with a sneak peak of
the challenges and opportunities of data
mining
• The objective of this module is to study the methods and
application of data mining techniques.
• The focus of the module will be on the technology, but by
illustrating their usage with challenging problems we aim at
providing a clear understanding of how these methods can be
applied in the real world
•
The successful completion of the module will endow a
student with:
– Strong understanding of core data mining problems (e.g. classification,
regression, clustering, feature and prototype selection, dimensionality
reduction) and the state-of-the-art methods for solving these
– Strong understanding of the application of data mining to important
real-world problems
– Familiarity with the operation and principles behind publicly available
data mining packages (e.g. Weka)
Lectures and labs
• Lectures: Thursdays, 15:00 – 17:00, JC-AMENB11+
• Labs: Mondays, 11:00 - 13:00, JC-COMPSCIB52 (labs start on the 11/2)
– The laboratory sessions will be used to develop
the coursework. I will be present to answer
questions
– Sometimes there will be directed sessions, but
these will be few, and advertised in advance
Coursework
• Coursework 1 (50% or the mark)
– Study in detail of one aspect of data preprocessing
– How to perform a proper ML evaluation protocol
– Deadline: 8/3/2013
• Project 1 (with 50% of the mark)
– I will give you a challenging large-scale dataset and you are
free to mine it using a combination of any of the techniques
described in the module
– Deadline: 10/5/2013
How to contact me?
• At lectures and lab sessions
• My office is B81 in the Computer Science
building. However, for many reasons the
chances are that if you just pop by randomly, I
can't attend you
• Thus, the preferred contact method is email:
[email protected]
Module structure
• Four topics (described in the next slides)
• Some topics will take several lectures to cover
• All lectures will be posted at
http://www.cs.nott.ac.uk/~jqb/G54DMT
• Take notes
– Not everything is in the slides
– I will use the whiteboard often
• After each lecture I will provide a list of resources
to complement the material
• Also, whenever necessary, I will introduce
background material
• If you feel that you are missing some background
material, tell me straight away!
Module structure
• Topic 1: Preliminaries
– This topic deals with several concepts that will be
used across the module
•
•
•
•
Data infrastructure: simple and advanced file formats
Experimental validation procedures
Statistical tests
Most popular data mining packages
Module structure
• Topic 2: Data Preparation
– Which steps do we follow to transform the data in
order to facilitate the pattern extraction process
– Many methods fall in this category
•
•
•
•
•
Feature selection
Instance selection
Dimensionality reduction
Missing values handling
Discretisation
Module structure
• Topic 3: Data Mining
– This topic deals with the central part of the KDD
pipeline, the extraction of patterns from data
– This process can be done in many different ways.
The most usual ones are
•
•
•
•
Classification
Regression
Clustering
Association Rules Mining
Module structure
• Topic 4: Applications
– We will see a few examples of how the methods
studied through the module are applied to challenging
real world problems
Resources
• Books
– J. Han and M. Kamber, Data Mining, Conceptes and techniques, Elsevier, 2006
– I Witten and E. Frank, Data Mining - Practical Machine Learning Tools and
Techniques, Elsevier, 2005
– Tom M. Mitchell, Machine Learning, McGraw-Hill, 1997
– Chris Bishop, Pattern Recognition and Machine Learning, Springer 2006
– Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of
Statistical Learning, 2nd ed., Springer, 2009
• Online resources
– KDNuggets, newsletter and website about data mining
• Software packages
– WEKA
– RapidMiner
– Keel
Questions?