Introduction to Data Mining

Download Report

Transcript Introduction to Data Mining

Introduction to Data Mining
by
Md. Altaf-Ul-Amin
Computational Systems Biology Lab
NAIST, JAPAN
Topics we will try to cover in this course
•Multivariate Data and Concepts Of Variance, Metrics, Similarities and
Distances
•Basic Matrix and vector Algebra
•Concept of Supervised and Unsupervised Learning
•Principal Component Analysis
•Hierarchical Clustering
•K-Mean Clustering
•Classification Trees
•Expectation Maximization Algorithm
•Naive Bayes Classifier
•Partial Least Square Regression
•Partial Least Square Discriminant Analysis
•Support Vector Machines
•Self Organizing Mapping
•Introduction to Neural Networks
•Introduction to Random Forest
•Receiver Operating Characteristic (ROC) Curves
•Statistical Tests and p-values
Classes: On Thursdays (10/6, 10/13, 10/20, 10/27, 11/10, 11/17, 11/24) (13:30-15:00)
What is data mining?
Discovery of models and patterns from Big
observational/experimental data sets
mainly by computation (preferably by
using modern computers)
In simple terms….two primary goals
Understanding
Prediction
Collected from internet (Slides by Padhraic Smyth, University of California, Irvine)
Collected from internet (Slides by Padhraic Smyth)
Collected from internet (Slides by Padhraic Smyth)
Collected from internet (Slides by Padhraic Smyth)
Usually, such data are called multivariate data
Collected from internet (Slides by Padhraic Smyth)
Time-course type data are also multivariate data
Collected from internet (Slides by Padhraic Smyth)
Image Data
Can be transformed
into multivariate data
Collected from internet (Slides by
Padhraic Smyth)
Relational data—networks
Many systems in nature can be represented as networks
Multivariate data can be transformed into networks
Part of Protein-protein interaction network of e.coli
Simple Example of pattern discovery
But, of course more complicated patterns can be extracted out from different data
by applying different algorithms
Collected from internet (Slides by Padhraic Smyth)
Collected from internet (Slides by Padhraic Smyth)
Mean, Median and Standard Deviation
Mean is average
(98+96+96+84+81+81+73)/7 = 609/7 = 87
Formula for standard deviation:
Median of a set of numbers:
10, 13, 4, 25, 8, 12, 9, 19, 18
Arrange them in descending order:
25, 19, 18, 13, 12, 10, 9, 8, 4
The middle value is 12, so the median = 12
Standard deviation of a set of numbers:
15, 21, 21, 21, 25, 30, 50, 29
Another case:
1, 2, 3, 4, 5, 6.
sd = 10.66369
Both 3 and 4 are in the middle. In this
case, we must take the average of the two
middle numbers. Since (3+4)/2 = 3.5, the
median = 3.5.
The square of standard
deviation is called variance
1 D histogram
Consider the following two sets of 50 integers between 0-200
Set a
61 148 64 115 113 110 174 33 44 60 144 190 97 52 45 175 3 29 10 104 134 78 63 191 130
172 116 102 28 85 101 100 2 57 117 162 131 119 18 24 4 51 111 39 187 182 25 142 8 55
Set b
60 53 43 19 86 182 183 89 139 158 35 200 155 26 106 150 116 132 101 143 157 148 112 152
190 99 135 33 115 156 104 76 58 163 8 153 10 48 125 91 81 97 11 185 133 170 27 159 59 69
2 D scatter plot with regression line (best fit line or least square line)
Two other set of 10 integers c and d
C={10, 13, 24, 56, 78, 34, 88, 65, 91, 7}
D={7, 17, 23, 51, 73, 38, 79, 69, 97, 5}
Correlation between a and b = 0.2432899
Correlation between c and d = 0.9885065
Formula of correlation:
Variance and covariance
X = (4, 6, 8, 9) Y= (10, 8, 17, 20)
Variance(X) = 4.92, Variance(Y) = 32.25, Covariance(X, Y) = 10.92
Collected from internet (slides by Aly A. Farag)
Relation between Correlation and covariance
You can verify by using the formulas presented in previous slides
Correlation and covariance thus reveal similar information
Collected from internet (slides by Aly A. Farag)
Collected from internet (slides by Aly A. Farag)
Collected from internet (slides by Aly A. Farag)
Collected from internet (slides by Aly A. Farag)
More about Covariance and correlation
•Covariance and correlation are measures of linear association i.e.
association along a line.
•Their values are less informative for non-linear association.
•These quantities are very sensitive to “outliers”.
•Despite these limitations Covariance and correlation are routinely
calculated and analyzed.
•These quantities are good when data do not have obvious non-linear
association and outliers.
Collected from internet (slides by Aly A. Farag)
Collected from internet (slides by Aly A. Farag)
Inner product of two vectors and related things
Collected from internet (slides by Aly A. Farag)
Dot product and its meaning
D = (4,4)
A.B = (2 x -4) + (5 x -3) = -8 - 15 = -23
B.C = (-4 x 5) + (-3 x -5) = -20 + 15 = -5
C.A = (5 x 2) + (-5 x 5) = 10 - 25 = -15
C.D = (5 x4) + (-5 x4) = 20 – 20 = 0
As Cos 90o = 0, when two vectors are
perpendicular their dot product is zero
As Cos 0o = 1, when two vectors are in the
same direction their dot product is product
of their magnitudes
Notice that input to dot product are two
vectors but its output is a scalar
Collected from internet (slides by Aly A. Farag)
Collected from internet (slides by Aly A. Farag)
Collected from internet (slides by Aly A. Farag)
Collected from internet (slides by Aly A. Farag)
Collected from internet (slides by Aly A. Farag)
Collected from internet (slides by Aly A. Farag)
Collected from internet (slides by Aly A. Farag)
Collected from internet (slides by Aly A. Farag)
Collected from internet (slides by Aly A. Farag)
Handling Multivariate data: Concept and types of metrics
Multivariate data format
Multivariate data example
Distances, metrics, dissimilarities and similarities are related concepts
A metric is a function that satisfy the following properties:
A function that satisfy only conditions (i)-(iii) is referred to as distances
Source: Bioinformatics and Computational Biology Solutions Using R and
Bioconductor (Statistics for Biology and Health)
Robert Gentleman ,Vincent Carey ,Wolfgang Huber ,Rafael Irizarry ,Sandrine Dudoit
(Editors)
These measures consider the expression measurements as points in some
metric space.
Example:
Let, X = (4, 6, 8) and Y = (5, 3, 9)
Cosine similarity (x,y) = 0.952
Widely used function for finding similarity is Correlation
Correlation gives a measure of linear association between variables and
ranges between -1 to +1
Statistical distance between points
Statistical distance /Mahalanobis distance between two vectors can be calculated if the
variance-covariance matrix is known or estimated.
The Euclidean distance between point Q and P is larger than that between Q and
origin but it seems P and Q are the part of the same cluster but Q and O are not.
Distances between distributions
Different from the previous approach (i.e. considering expression measurements as
points in some metric space) the data for each feature can be considered as independent
sample from a population.
Therefore the data reflects the underlying population and we need to measure
similarities between two densities/distributions.
Kullback-Leibler Information
Mutual information
KLI measures how much the
shape of one distribution
resembles the other
MI is large when the joint
distribution is quiet different
from the product of the
marginals.
Knowledge Discovery from Data/Databases (KDD)
Collected from internet (Slides by Ciro Donalek)
Data Mining
the most relevant/important DM tasks are:
– Exploratory data analysis (We already discussed)
– visualization (There are many tools)
– clustering
– classification
– regression
– Assimilation (beyond the scope of this course)
Collected from internet (Slides by Ciro Donalek)
Examples : Principal component analysis (PCA), Hierarchical Clustering, Network
clustering DPClus
Collected from internet (Slides by Ciro Donalek)
Examples : Neural networks, Support vector Machines, Decision trees
Collected from internet (Slides by Ciro Donalek)