Transcript Lecture 3

Data Basics
Data Matrix
• Many datasets can be represented as a data matrix.
• Rows corresponding to entities
• Columns represents attributes.
• N: size of the data
• D: dimensionality of the data
• Univariate analysis: the analysis of a single attribute.
• Bivariate analysis: simultaneous analysis of two attributes.
• Multivariate analysis: simultaneous analysis of multiple attributes.
Example for Data Matrix
Attributes
• Categorical Attributes
• composed of a set of symbols
• has a set-valued domain
• E.g., Sex with domain(Sex) = {M, F}, Education with
domain(Education) = { High School, BS, MS, PhD}.
• Two types of categorical attributes
– Nominal
• values in the domain are unordered
• Only equality comparisons are allowed
• E.g. Sex
– Ordinal
• Values are ordered
• Both equality and inequality comparisons are allowed
• E.g. Education
Attributes Cont.
• Numeric Attributes
– Has real-valued or integer-valued domain
– E.g. Age with domain (Age) = N, where N denotes the set of natural numbers
(non-negative integers).
• Two types of numeric attributes
– Discrete: values take on finite or countably infinite set.
– Continuous: values take on any real value
• Another Classification
– Interval-scaled
• for attributes only differences make sense
• E.g. temperature.
– Ratio-scaled
• Both difference and ratios are meaningful
• E.g. Age
Algebraic View of Data
• If the d attributes in the data matrix D are all numeric
• each row can be considered as a d-dimensional point
• or equivalently, each row may be considered a d-dimensional
column vector
• Linear combination of the standard basis vectors
Example of Algebraic View of Data
Geometric View of Data
Distance of Angle
Example of Distance and Angle
Mean and Total Variance
Centered Data Matrix
• The centered data matrix is obtained by
subtracting the mean from all the points
Orthogonality
• Two vectors a and b are said to be orthogonal
if and only if
• It implies that the angle between them is 90◦
or π/2 radians.
Orthogonal Projection
P: orthogonal projection of b on the
vector a;
R: error vector between points b and p
Example of Projection
Linear Independence and
Dimensionality
•
: the set of all possible linear combinations of
the vectors.
• If
spanning set for .
then we say that v1, · · · , vk is a
Row and Column Space
• The column space of D, denoted col(D) is the set of all linear
combinations of the d column vectors or attributes
• The row space of D, denoted row(D), is the set of all linear
combinations of the n row vectors or points
• Note also that the row space of D is the column space of
Linear Independence
Dimension and Rank
• Let S be a subspace of Rm.
• A basis for S: a set of linearly independent vectors v1, · · · , vk , and
span(v1, · · · , vk) = S.
• orthogonal basis for S: If the vectors in the basis are pair-wise
orthogonal
• If in addition they are also normalized to be unit vectors, then they
make up an orthonormal basis for S.
• For instance, the standard basis for Rm is an orthonormal
basis consisting of the vectors
• Any two bases for S must have the same number of vectors.
• Dimension: The number of vectors in a basis for S, denoted as
dim(S).
• For any matrix, the dimension of its row and column space are
the same, and this dimension is also called as the rank of the
matrix.
Data: Probabilistic View
• Assumes that each numeric attribute Xj is a random variable,
defined as a function that assigns a real number to each
outcome of an experiment.
• Given as Xj : O → R, where O, the domain of Xj , called as the
sample space
• R, the range of Xj , is the set of real numbers.
• If the outcomes are numeric, and represent the observed
values of the random variable, then Xj : O → O is simply the
identity function: Xj (v) = v for all v ∈ O.
Data: Probabilistic View
• A random variable X is called a discrete
random variable if it takes on only a finite or
countably infinite number of values in its
range.
• X is called a continuous random variable if it
can take on any value in its range.
Example
• Be default, consider the attribute X1 to be a continuous random variable,
given as the identity function X1(v) = v, since the outcomes are all
numeric.
• On the other hand, if we want to distinguish between iris flowers with
short and long sepal lengths, we define a discrete random variable A as
follows
• In this case the domain of A is [4.3, 7.9]. The range of A is {0, 1}, and thus
A assumes non-zero probability only at the discrete values 0 and 1.
Example: Bernoulli and Binomial Distribution
• only 13 irises have sepal length of at least 7cm
• In this case we say that A has a Bernoulli distribution with parameter p ∈
[0, 1]. p denotes the probability of a success, whereas 1− p represents the
probability of a failure
Example: Bernoulli and Binomial Distribution
• Let us consider another discrete random variable B, denoting
the number of irises with long sepal lengths in m independent
Bernoulli trials with probability of success p.
• B takes on the discrete values [0,m], and its probability mass
function is given by the Binomial distribution
• For example, taking p = 0.087 from above, the probability of
observing exactly k = 2 long sepal length irises in m = 10 trials
is given as
full probability mass function for different values of k
Probability Density Function
• If X is continuous, its range is the entire set of real numbers R.
• probability density function: specifies the probability that the
variable X takes on values in any interval [a, b] ⊂ R
Cumulative Distribution Function
• For any random variable X, whether discrete or continuous,
we can define the cumulative distribution function (CDF) F : R
→ [0, 1], that gives the probability of observing a value at
most some given value x
The following examples are from Andrew Moore
Probability Density Function f(x)
• What is P(X=x) when x is on a real domain
»
f(x) >=0 and
Normal Distribution
• Let us assume that these values follow a Gaussian or normal
density function, given as
Bivariate Random Variables
• considering a pair of attributes, X1 and X2, as
a bivariate random variable
In 2-Dimensions
Multivariate Random Variable
Multivariate Random Variable
Numeric Attribute Analysis
•
Sample and Statistics
•
Univariate Analysis
•
Bivariate Analysis
•
Multivariate Analysis
•
Normal Distribution
Random Sample and Statistics
• Population: is used to refer to the set or universe of all entities
under study.
• However, looking at the entire population may not be
feasible, or may be too expensive.
• Instead, we draw a random sample from the population, and
compute appropriate statistics from the sample, that give
estimates of the corresponding population parameters of
interest.
Univariate Sample
• Let X be a random variable, and let xi (1 ≤ i ≤ n) denote the
observed values of attribute X in the given data, where n is
the data size.
• Given a random variable X, a random sample of size n from X
is defined as a set of n independent and identically distributed
(IID) random variables S1, S2, · · · , Sn.
• since the variables Si are all independent, their joint
probability function is given as
Multivariate Sample
• xi: the value of a d-dimensional vector random variable Si =
(X1,X2, · · · ,Xd ).
• Si are independent and identically distributed, and thus their
joint distribution is given as
• Assume d attributes X1,X2, · · · ,Xd are independent, (1.43)
can be rewritten as
Statistic
• Let Si denote the random variable corresponding to
data point xi , then a statistic ˆθ is a function ˆθ : (S1,
S2, · · · , Sn) → R.
• If we use the value of a statistic to estimate a
population parameter, this value is called a point
estimate of the parameter, and the statistic is called
as an estimator of the parameter.
Numeric Attribute Analysis
•
Sample and Statistics
•
Univariate Analysis
•
Bivariate Analysis
•
Multivariate Analysis
•
Normal Distribution
Univariate Analysis
Univariate analysis focuses on a single attribute at
a time, thus the data matrix D can be thought of
as a n × 1 matrix, or simply a column vector.
Univariate Analysis
X is assumed to be a random variable, and each point xi (1
≤ i ≤ n) is assumed to be the value of a random variable Si ,
where the variables Si are all independent and identically
distributed as X, i.e., they constitute a random sample
drawn from X.
In the vector view, we treat the sample as an ndimensional vector, and write X ∈ Rn.
What can sample analysis do?
• Unknown f(X) and F(X)
• Parameters(μ,δ)
Empirical Cumulative Distribution Function
Where
Inverse Cumulative Distribution Function
Empirical Probability Mass
Function
Where
Measures of Central Tendency
(Mean)
Population Mean:
Sample Mean (Unbiased, not robust):
Measures of Central Tendency
Population Median: (Median)
or
Sample Median:
Measures of Central Tendency
(Mode)
Sample Mode:
1.

may not be very useful
but not affected by the outliers too much
Example
Measures of Dispersion (Range)
Range:
Sample Range:

Not robust, sensitive to extreme values
Measures of Dispersion (Inter-Quartile Range)
Inter-Quartile Range (IQR):
Sample IQR:

More robust
Measures of Dispersion
(Variance and Standard Deviation)
Variance:
Standard Deviation:
Measures of Dispersion
(Variance and Standard Deviation)
Variance:
Standard Deviation:
Sample Variance & Standard Deviation:
Normalization
Linear Normalization:
Z-Score:
Normalization Example
Topics
•
Sample and Statistics
•
Univariate Analysis
•
Bivariate Analysis
•
Multivariate Analysis
•
Normal Distribution
Bivariate Analysis
Bivariate analysis focuses on Two attributes at a
time, thus the data matrix D can be thought of as
a n × 2 matrix, or two column vectors.
Empirical Joint Probability Mass
Function
or
where
Measures of Central Tendency
(Mean)
Population Mean:
Sample Mean:
Measures of Association (Covariance)
Covariance:
Sample Covariance:
Measures of Association (Correlation)
Correlation:
Sample Correlation:
Measures of Association (Correlation)
Correlation Example
Topics
•
Sample and Statistic
•
Univariate Analysis
•
Bivariate Analysis
•
Multivariate Analysis
•
Normal Distribution
Multivariate Analysis
Multivariate analysis focuses on multiple attributes
at a time, thus the data matrix D can be thought of
as a n × d matrix, or d column vectors.
Measures of Central Tendency
(Mean)
Population Mean:
Sample Mean:
Measures of Association
(Covariance Matrix)
Measures of Association (Correlation)
Correlation:
Sample Correlation:
Topics
•
Sample and Statistic
•
Univariate Analysis
•
Bivariate Analysis
•
Multivariate Analysis
•
Normal Distribution
Univariate Normal Distribution
Multivariate Normal Distribution
• Thank You!