#### Transcript estimation

Samples and Statistical Inference CS6762 53 Motivation • Used mainly for modeling the data set • Why sample – Some data mining tasks involve only part of population – Even if a task involves entire population, only a portion can be dealt with • Why statistical inference – Make statements about the entire population structures (size estimation, prob distribution, our own confidence, etc) – The statement is made based on samples – We must have a model (hypothesis) in mind – For example: Z = f(X, Y) CS6762 54 Statistical Inferences • Samples must be drawn from the population randomly – Each member of the population had a particular probability of appearing in the sample – Purposed sampling may be away from the true structure • If the distribution is normal with mean of 0 and SD of 1, what can you get by putting a lot of 20+ values in the sample? • Let M be the model and be the parameter (neither is known to us), and D = {x(1), …, x(n)} be a random sample (note: x(i) is the ith row) – p(D|, M) is the probability that D will be selected n – If x(i)s are independent, then p( D | , M ) p( x(i) | , M ) i 1 CS6762 55 Statistical Inferences – called likelihood function if viewed as the function of – reject model (parameter) if probability of the observed data arising under that model (parameter) is less than threshold • Example: – determined the data set is Normal distribution with unit variance, but do not know its mean – calculate the probability for each value of mean – reject those values for which the probability is less than the threshold – or, more frequently, select the value that maximizes the probability, a special case of estimation • In the above example, we could choose the value of the mean that maximizes the probability CS6762 56 Estimation • Purpose: making statement about of entire data set • approach: estimating the parameter(s) using samples • Notation – Let be the parameter to be estimated – Estimator of , denoted as ˆ , is a function of the observational data in the sample – Example 1: estimate the average battery life of given type • Let denote the (true) average • We randomly take n batteries of that type and observe their lives, denoted as x1,…,x n • A reasonable estimator: ̂ = x = (x1+…+x n)/n CS6762 57 Estimation • Let n = 3, and x1 = 5.2, x2 = 5.6, x3 = 6.0 • Then we have ̂ = 5.6 – Example 2: Assume the break down voltage of a device has a normal distribution with mean . Let x1,…,x n be the observed break down voltages in the sample – Possible estimator • x : average of all the values x : average of the middle two values • ~ • [min(xi) + max(xi)]/2 • 10% trimmed mean (discard the smallest and largest 10% and then make an average) CS6762 58 Estimation (cont.) – Let n = 10, and the sample be: 24.46 25.61 26.25 26.42 26.66 27.15 27.31 27.54 27.74 27.94 – We have: • X = x = (xi)/n = 267.1/10 = 26.71 ~ ~ • X = x = (26.66+27.15)/2 = 26.91 • [min(xi) + max(xi)]/2 = (24.46+27.94)/2 = 26.2 • X tr (10) = (267.1 – 24.46 – 27.94)/8 = 26.84 – Question: which estimator is better? CS6762 59 Estimation (cont.) – Example 3: assume X is the reaction time of some stimulus, and has uniform distribution [0, ]. Estimate using random sample x1,…,x n • Estimator 1, ˆ1 = max (x1,…,x n) n 1 ˆ • Estimator 2, 2 = n max (x1,…,x n) • Let n = 5, and the sample be 4.2, 1.7, 2.4, 3.9, 1.3, then ˆ1 = 4.2, ˆ2 = 5 1 4.2 = 5.04 5 • Which one is more reasonable? CS6762 60 Estimation (cont.) – Example 4: let the random sample x1,…,x n be drawn from a population with certain probability distribution. Estimate 2 (X X ) • Estimator 1: ˆ1 = n 2 (X X ) ˆ • Estimator 1: 2 = n 1 2 2 i 2 i • Let n = 8, and the sample be 44.2, 43.9, 44.7, 44.2, 44.0, 43.8, 44.6, 43.1 then 2 2 ˆ = .220, 2 = 0.251 ˆ1 CS6762 61 Estimation (cont.) – Example 5: Let the sample (x1t,…,x nt) be drawn from a population of 2D – objects with certain (unknown) distribution. Estimate the mean and co-variance matrix. • Note: we have x it = <xi1, xi2> for all i • Estimator for mean: μ̂ t = < y1 , y 2 > where y1 = n x i 1 i1 n n and y 2 = x i 1 i2 n • Estimator for co-variance matrix: S 2= n for i = 1,2, ii = 2 ( x y ) ji i j 1 n CS6762 11 12 21 22 n , 12 = (x i 1 i1 where y 1 )( xi 2 y 2 ) n 62 Estimation (cont.) – Let X be any random sample containing n observations, then X is a (multivariant) random variable – To emphasize this, we should write X = (X1,…,X n) – Any estimator f(X) is a random variable of n arguments – Thus when we define an estimator, we should use the notation: f(X) = f(X1,…,X n) • For instance, in example 1, we write ̂ = X = X i n • Same for other examples CS6762 63 Estimation (cont.) • Let ˆ be an estimator for parameter . We say that E( ˆ ) - is the bias of ˆ • If an estimator ˆ has bias 0 for every possible value of , then it is said to be unbiased • Observation: an unbiased estimator does not show systematic departure from the parameter it estimates • But for any particular sample, the estimate of ˆ may be far away from the true value of CS6762 64 Estimation (cont.) • Example 6: consider Example 1 – ̂ = X = (X1+…+ Xn )/n – We have E( ̂ ) = E(X1+…+ Xn )/n) = (E(X1)+…+E(Xn ))/n = n/n = – Thus E( ̂ ) - = 0 for any . So ̂ is unbiased • Example 7: consider Example 3 – Let the density of ˆ1 be f(x) where x [0, ] n 1 nx nx n 1 ˆ – We can show that f(x) = n . Thus E(1) = n xdx = CS6762 0 n n 1 65 Estimation (cont.) – Since E( ˆ1 ) - = - /(n+1) 0, ˆ1 is not unbiased – But ˆ2 is unbiased (intuitively clear?) • Principle 1: Among different estimators, select one that is unbiased • Recall Var( ˆ ) = E[( ˆ - E( ˆ ))2] 1 1 1 – It measures how much our estimates will vary across different observed samples – Note: variance of estimator does not depend on the true value of the parameter • Principle 2: among all the unbiased estimators, choose one with minimum variance CS6762 66 Estimation (cont.) • Example 8: consider Example 3 again – – – – ˆ3 X i Define estimator = 2X = 2 n We can show that ˆ3 is also unbiased But Var( ˆ2 ) = 2/[n(n+2)] and Var( ˆ3) = 2/(3n) When n > 1, 2/[n(n+2)] < 2/(3n) • In reality, bias and variance often compete, i.e., small biased estimator may have higher variance • How to balance the two is a central issue in data mining tasks CS6762 67 Maximum Likelihood Estimation • Consider the probability p(D|, M) • Suppose the model structure is determined, we can write it as p(D|) • When viewed as a function of , write it as L(|D) – Interpretation: for any given sample D, each value of gives a probability that D will arise • Maximum likelihood estimator (MLE): for each D, assign a value to that maximizes the likelihood function – Intuition: MLE lets D have highest probability to be selected CS6762 68 MLE • Example 9: estimate the proportion of the customers that purchase milk in a supermarket, based on a sample x1, …, xn. Assume the # of customers purchasing milk follows a binomial distr. – Proportion of the customers that purchase milk is the likelihood that a random customer purchases milk, let it be – Assume among the n customers, r purchases milk n r nr ( 1 ) – Thus L(|n, r) = r – All we need is to maximize r(1- )n-r CS6762 69 MLE – Differentiating w.r.t , and then set to zero, we get rr-1(1- )n-r - (n-r)r(1- )n-r-1 = 0, i.e., r(1- )-(n-r) = 0 Thus = r/n – The proportion purchasing milk in the sample = maximum likelihood estimation of the proportion purchasing milk in the entire customer population ! – (note: we can say the above only under binomial model) – Consider the following three scenario n=100, r=70 n=10, r=7 L() L() 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CS6762 n=1000, r=700 L() 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 70 MLE (cont.) • Example 10: assume x1,…, xn have arisen independently from a normal distribution with unit variance. Use MLE to estimate mean – Density of each xi is – Thus L(| x1,…, xn) = n 1 ( xi ) 2 / 2 2 = e 2n 1 / 2 ( x ) ( 2 ) e i 2 /2 1 ( xi ) 2 / 2 e 2 = i 1 1 ( xi ) 2 2 i1 take n derivative w.r.t , set to zero and simplify, we get ( xi ) 0 (2 ) n / 2 e thus = CS6762 x i i i 1 /n (refer to example 1) 71 MLE (cont.) • Example 11: Let X and Y be random variables, and are related in the way: Y = a+bX+e where a and b are constant, and e is a random variable with normal distribution with mean 0 and variance 2. Use MLE to estimate a and b. – Let D = {(x1,y1),…,(xn,yn)} be the sample. We have, for all i, yi = axin + b + ei, or ei = yi – (a + bxi) 1 ( 0.5 ( yi ( a bxi )) / ) 2 L(a,b|D) = = e n 2 0.5 / 2 ( yi ( a bxi )) 2 i 1 i 1 1 e 2 n/2 (2 ) CS6762 72 MLE (cont.) • Note: maximizing 1 e 2 n/2 (2 ) 0.5 / n 2 ( yi ( a bxi )) 2 i 1 is n equivalent to minimizing 2 ( y ( a bx )) ) i i i 1 • This is called least squares method CS6762 73 Bayesian Estimation • Until now, our estimation is based on the following: – Parameters of the population are fixed – The sample is randomly drawn from that population, and therefore has a probability distribution – We search for values of the parameters to accommodate the distributions of the samples • Bayesian estimation takes a different view: – The sample is fixed (after all, they have been observed) – The parameters are random numbers that have a probability distribution – We use the sample values to gain information about the distributions of the parameters CS6762 74 Bayesian Estimation (cont.) • Basic approach – Before the sample data is analyzed, there is an initial knowledge about the distribution p() of parameter , called prior distribution – Analysis of the sample D leads to the modification to the prior distribution, called posterior distribution, p(|D) – The modification is based on Bayes theorem • Bayes theorem: p(|D)= p( D | ) p( ) p( D) • (Examples in Bayes classifications in later chapters) CS6762 75 Mixtures of Models • A dataset may be better described by more than one distribution • A mixture density for attribute x is defined as K f k ( x | k ) k f(x) = k 1 – The data set is subdivided into K classes (or components) – The kth class has a density of fk(x|k) – k is the probability that a randomly selected object belongs to the kth class – Note: k k = 1 CS6762 76 Mixtures of Models (cont.) • Interpretation for the mixtures of normal distribution – A single normal distribution can be thought of as being located at the mean – Its variance (co-variance matrix) determines its shape) – The components of the dataset thus have different locations and shapes • Example 12 – a 1D (2D) data set consists of three classes – Each has a normal distribution with certain weight CS6762 77 Data set CS6762 • Three normal densities • Each contour is 3 away from mean for one density • Contours of overall probability distribution 78 Mixtures of Models (cont.) • Example 13 – The entire population of customers for a supermarket can be grouped into different types, such as Asian, English – Different types of customers have different purchasing characteristics • People in Asian group are more likely to buy rice, tea, vegetable (such as bok choi, nappo) etc. than English group • People in English group are more likely to buy potato, coffee, cheese than Asian group – The probability distribution of the buying habits can be better represented by the mixture model of two densities, one for each group CS6762 79