#### Transcript Numerical Descriptive Measures

Chapter 4 Numerical Descriptive Techniques Introduction Recall Chapter 2, where we used graphical techniques to describe data: While this histogram provides some new insight, other interesting questions (e.g. what is the class average? what is the mark spread?) go unanswered. 2007會計資訊系統計學(一)上課投影片 4-2 Numerical Descriptive Techniques Measures of Central Location（中央位置） Mean, Median, Mode Measures of Variability（離散程度） Range, Standard Deviation, Variance, Coefficient of Variation Measures of Relative Standing（相對位置） Percentiles, Quartiles Measures of Linear Relationship（線性關係） Covariance, Correlation, Least Squares Line 2007會計資訊系統計學(一)上課投影片 4-3 4.1 Measures of Central Location Usually, we focus our attention on two types of measures when describing population characteristics: Central location (e.g. average) Variability or spread The measure of central location reflects the locations of all the actual data points. 2007會計資訊系統計學(一)上課投影片 4-4 4.1 Measures of Central Location The measure of central location reflects the locations of all the actual data points. How? With two data points, the central location But if the third data With one data point should fall inpoint the middle on the leftthem hand-side clearly the centralappears between (in order of the midrange, it should “pull”of location is at the point to reflect the location the central location to the left. itself. both of them). 2007會計資訊系統計學(一)上課投影片 4-5 The Arithmetic Mean（算術平均數） This is the most popular and useful measure of central location Sum of the observations Mean = Number of observations 2007會計資訊系統計學(一)上課投影片 4-6 Notation When referring to the number of observations in a population, we use uppercase letter N When referring to the number of observations in a sample, we use lower case letter n The arithmetic mean for a population is denoted with Greek letter “mu”: （母體平均數） The arithmetic mean for a sample is denoted with an “x-bar”.（樣本平均數） 2007會計資訊系統計學(一)上課投影片 4-7 Statistics is a pattern language Size Population Sample N n Mean 2007會計資訊系統計學(一)上課投影片 4-8 The Arithmetic Mean Sample mean x n n ii11xxii nn Sample size 2007會計資訊系統計學(一)上課投影片 Population mean N i1 x i N Population size 4-9 Statistics is a pattern language Size Population Sample N n Mean 2007會計資訊系統計學(一)上課投影片 4-10 The Arithmetic Mean • Example 4.1 The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 hours. Find the mean time on the Internet. 10 x10 0x1 7x2 ... 22 i 1 xi x 11.0 10 10 • Example 4.2 Suppose the telephone bills of Example 2.1 represent the population of measurements. The population mean is 200 i1 x38.45 ... x45.77 x i x42.19 1 2 200 200 200 2007會計資訊系統計學(一)上課投影片 43.59 4-11 The Arithmetic Mean • Additional Example When many of the measurements have the same value, the measurement can be summarized in a frequency table. Suppose the number of children in a sample of 16 employees were recorded as follows: Number of children per family Number of families x 16 x i 1 i 2007會計資訊系統計學(一)上課投影片 16 0 1 2 3 3+ 4 + 7+ 2 16 employees x1 x 2 ... x16 3(0) 4(1) 7(2) 2(3) 1.5 16 16 4-12 The Arithmetic Mean …is appropriate for describing measurement data, e.g. heights of people, marks of student papers, etc. …is seriously affected by extreme values called “outliers”. E.g. as soon as a billionaire moves into a neighborhood, the average household income increases beyond what it was previously! 2007會計資訊系統計學(一)上課投影片 4-13 The Median（中位數） The Median of a set of observations is the value that falls in the middle when the observations are arranged in order of magnitude. Example 4.3 Comment Find the median of the time on the internet Suppose only 9 adults were sampled (exclude, say, the longest time (33)) for the 10 adults of example 4.1 Even number of observations 0, 0, 5, 0, 7, 5, 8, 7, 8, 9, 12, 14,14, 22,22, 33 33 8.59,, 12, 2007會計資訊系統計學(一)上課投影片 Odd number of observations 0, 0, 5, 7, 8 9, 12, 14, 22 4-14 The Mode（眾數） The Mode of a set of observations is the value that occurs most frequently. Set of data may have one mode (or modal class), or two or more modes. The modal class 2007會計資訊系統計學(一)上課投影片 For large data sets the modal class is much more relevant than a single-value mode. 4-15 The Mode Example 4.5 Find the mode for the data in Example 4.1. Here are the data again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 Solution All observation except “0” occur once. There are two “0”. Thus, the mode is zero. Is this a good measure of central location? The value “0” does not reside at the center of this set (compare with the mean = 11.0 and the mode = 8.5). 2007會計資訊系統計學(一)上課投影片 4-16 The Mode Additional example The manager of a men’s store observes the waist size (in inches) of trousers sold yesterday: 31, 34, 36, 33, 28, 34, 30, 34, 32, 40. The mode of this data set is 34 in. This information seems to be valuable (for example, for the design of a new display in the store), much more than “ the median is 33.5 in.” 2007會計資訊系統計學(一)上課投影片 4-17 Measures of Central Location The mode of a set of observations is the value that occurs most frequently. A set of data may have one mode (or modal class), or two, or more modes. Mode is a useful for all data types, though mainly used for nominal data. For large data sets the modal class is much more relevant than a single-value mode. ※ Sample and population modes are computed the same way. 2007會計資訊系統計學(一)上課投影片 4-18 =MODE(range) in Excel Note: if you are using Excel for your data analysis and your data is multi-modal (i.e. there is more than one mode), Excel only calculates the smallest one. You will have to use other techniques (i.e. histogram) to determine if your data is bimodal, trimodal, etc. 2007會計資訊系統計學(一)上課投影片 4-19 The Mean, Median and Mode Additional example A professor of statistics wants to report the results of a midterm exam, taken by 100 students. • The mean of the test marks is 73.90 • The median of the test marks is 81 • The mode of the test marks is 84 Describe the information each one provides. The mean provides information Median indicates that half of the class The mode must be usedThe when data are nominal about the over-all performance level a grade If marks are classified byreceived letter grade, thebelow 81%, and half of the class of the class. It can serve as a tool a grade above 81%. A student can use frequency of for each gradereceived can be calculated. making comparisons with other thisa statistic to place his mark relative to other Then, the mode becomes logical measure classes and/or other exams. students in the class. to compute. 2007會計資訊系統計學(一)上課投影片 4-20 Relationship among Mean, Median, and Mode If a distribution is symmetrical, the mean, median and mode coincide If a distribution is asymmetrical, and skewed to the left or to the right, the three measures differ. A positively skewed distribution (“skewed to the right”) Mode Mean Median 2007會計資訊系統計學(一)上課投影片 4-21 Relationship among Mean, Median, and Mode If a distribution is symmetrical, the mean, median and mode coincide If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ. A positively skewed distribution (“skewed to the right”) A negatively skewed distribution (“skewed to the left”) Mode Mean Median Mean Mode Median 2007會計資訊系統計學(一)上課投影片 4-22 Mean, Median, Mode If data are symmetric, the mean, median, and mode will be approximately the same. If data are multimodal, report the mean, median and/or mode for each subgroup. If data are skewed, report the median. 2007會計資訊系統計學(一)上課投影片 4-23 Mean, Median, & Modes for Ordinal & Nominal Data For ordinal and nominal data the calculation of the mean is not valid. Median is appropriate for ordinal data. For nominal data, a mode calculation is useful for determining highest frequency but not “central location”. 2007會計資訊系統計學(一)上課投影片 4-24 The Geometric Mean（幾何平均數） This is a measure of the average growth rate. Let Ri denote the the rate of return in period i (i=1,2…,n). The geometric mean of the returns R1, R2, …,Rn is the constant Rg that produces the same terminal wealth at the end of period n as do the actual returns for the n periods. 2007會計資訊系統計學(一)上課投影片 4-25 The Geometric Mean For the given series of rate of returns the nth period return is calculated by: If the rate of return was Rg in every period, the nth period return would be calculated by: n ( 1 R ) (1 R1 )(1 R 2 )...( 1 R n ) g Rg is selected such that… R g n (1 R1 )(1 R2 )...(1 Rn ) 1 2007會計資訊系統計學(一)上課投影片 4-26 Finance Example Suppose a 2-year investment of $1,000 grows by 100% to $2,000 in the first year, but loses 50% from $2,000 back to the original $1,000 in the second year. What is your average return? Using the arithmetic mean, we have This would indicate we should have $1,250 at the end of our investment, not $1,000. Solving for the geometric mean yields a rate of 0%, which is correct. The upper case Greek Letter “Pi” represents a product of terms… 2007會計資訊系統計學(一)上課投影片 4-27 The Geometric Mean Additional Example A firm’s sales were $1,000,000 three years ago. Sales have grown annually by 20%, 10%, -5%. Find the geometric mean rate of growth in sales. Solution Since Rg is the geometric mean (1+Rg)3 = (1+.2)(1+.1)(1-.05)= 1.2540 Thus, Rg 3 (1 .2)(1 .1)(1 .05) 1 .0784, or 7.84%. 2007會計資訊系統計學(一)上課投影片 4-28 Measures of Central Location： Summary Compute the Mean to Describe the central location of a single set of interval data Compute the Median to Describe the central location of a single set of interval or ordinal data Compute the Mode to Describe a single set of nominal data Compute the Geometric Mean to Describe a single set of interval data based on growth rates 2007會計資訊系統計學(一)上課投影片 4-29 4.2 Measures of variability Measures of central location fail to tell the whole story about the distribution. A question of interest still remains unanswered: How much are the observations spread out around the mean value? 2007會計資訊系統計學(一)上課投影片 4-30 4.2 Measures of variability Observe two hypothetical data sets: Small variability The average value provides a good representation of the observations in the data set. This data set is now changing to... 2007會計資訊系統計學(一)上課投影片 4-31 4.2 Measures of variability Observe two hypothetical data sets: Small variability The average value provides a good representation of the observations in the data set. Larger variability The same average value does not provide as good representation of the observations in the data set as before. 2007會計資訊系統計學(一)上課投影片 4-32 Measures of Variability Measures of central location fail to tell the whole story about the distribution; that is, how much are the observations spread out around the mean value? For example, two sets of class grades are shown. The mean (=50) is the same in each case… But, the red class has greater variability than the blue class. 2007會計資訊系統計學(一)上課投影片 4-33 The range（全距） The range of a set of observations is the difference between the largest and smallest observations. Its major advantage is the ease with which it can be But, how do all the observations spread out? computed. ? ?to provide information Its major shortcoming is?its failure Smallest on the dispersion of the observationsLargest between the two observation observation end points. 2007會計資訊系統計學(一)上課投影片 The range cannot assistRange in answering this question 4-34 Range The range is the simplest measure of variability, calculated as: Range = Largest observation – Smallest observation E.g. Data: {4, 4, 4, 4, 50} Range = 46 Data: {4, 8, 15, 24, 39, 50} Range = 46 The range is the same in both cases, but the data sets have very different distributions… 2007會計資訊系統計學(一)上課投影片 4-35 Range Its major advantage is the ease with which it can be computed. Its major shortcoming is its failure to provide information on the dispersion of the observations between the two end points. Hence we need a measure of variability that incorporates all the data and not just two observations. Hence… 2007會計資訊系統計學(一)上課投影片 4-36 Variance（變異數） Variance and its related measure, standard deviation, are arguably the most important statistics. Used to measure variability, they also play a vital role in almost all statistical inference procedures. Population variance is denoted by 母體變異數） (Lower case Greek letter “sigma” squared) Sample variance is denoted by (Lower case “S” squared) 2007會計資訊系統計學(一)上課投影片 樣本變異數） 4-37 Statistics is a pattern language Size Population Sample N n Mean Variance 2007會計資訊系統計學(一)上課投影片 4-38 The Variance This measure reflects the dispersion of all the observations The variance of a population of size N x1, x2,…,xN whose mean is is defined as 2 2 N ( x ) i 1 i N The variance of a sample of n observations x1, x2, …,xn whose mean is x is defined as s2 2007會計資訊系統計學(一)上課投影片 ni1( xi x)2 n 1 4-39 The Variance Example 4.7 The following sample consists of the number of jobs six students applied for: 17, 15, 23, 7, 9, 13. Finds its mean and variance Solution x i61 xi 6 17 15 23 7 9 13 84 14 jobs 6 6 n 2 ( x x ) 1 2 i1 i s (17 14)2 (15 14)2 ...(13 14)2 n 1 6 1 33.2 jobs2 2007會計資訊系統計學(一)上課投影片 4-40 The Variance – Shortcut method n 2 n ( i 1 xi ) 1 2 2 s xi i 1 n 1 n 2 1 17 15 ... 13 2 2 2 17 15 ... 13 6 1 6 33.2 jobs 2 2007會計資訊系統計學(一)上課投影片 4-41 Why not use the sum of deviations? Consider two small populations: 9-10= -1 11-10= +1 8-10= -2 12-10= +2 A measure of dispersion A Can the sum of deviations agreesofwith this Be aShould good measure dispersion? The sum of deviations is observation. zero for both populations, 8 9 10 11 12 therefore, is not a good …but Themeasurements mean of both in B measure of arepopulations moredispersion. dispersed is 10... 4-10 = - 6 16-10 = +6 7-10 = -3 then those in A. B 4 Sum = 0 7 10 13 16 13-10 = +3 Sum = 0 2007會計資訊系統計學(一)上課投影片 4-42 The Variance Let us calculate the variance of the two populations 2 2 2 2 2 ( 8 10 ) ( 9 10 ) ( 10 10 ) ( 11 10 ) ( 12 10 ) 2A 2 5 2 2 2 2 2 ( 4 10 ) ( 7 10 ) ( 10 10 ) ( 13 10 ) ( 16 10 ) B2 18 5 Why is the variance defined as After all, the sum of squared the average squared deviation? deviations increases in Why not use the sum of squared magnitude when the variation deviations as a measure of of a data set increases!! variation instead? 2007會計資訊系統計學(一)上課投影片 4-43 The Variance Which data set has a larger dispersion? Data set B is more dispersed around the mean A B 1 2 3 1 3 5 Let us calculate the sum of squared deviations for both data sets. 2007會計資訊系統計學(一)上課投影片 4-44 The Variance SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10 SumB = (1-3)2 + (5-3)2 = 8 SumA > SumB. This is inconsistent with the observation that set B is more dispersed. A B 1 2007會計資訊系統計學(一)上課投影片 2 3 1 3 5 4-45 The Variance However, when calculated on “per observation” basis (variance), the data set dispersions are properly ranked. A2 = SumA/N = 10/10 = 1 B2 = SumB/N = 8/2 = 4 A B 1 2007會計資訊系統計學(一)上課投影片 2 3 1 3 5 4-46 Standard Deviation（標準差） The standard deviation of a set of observations is the square root of the variance . Sample standard dev iation: s s 2 Population standard dev iation: 2007會計資訊系統計學(一)上課投影片 2 4-47 Statistics is a pattern language Size Population Sample N n Mean Variance Standard Deviation 2007會計資訊系統計學(一)上課投影片 4-48 Standard Deviation Example 4.8 To examine the consistency of shots for a new innovative golf club, a golfer was asked to hit 150 shots, 75 with a currently used (7-iron) club, and 75 with the new club. The distances were recorded. Which 7-iron is more consistent? 2007會計資訊系統計學(一)上課投影片 4-49 Standard Deviation Example 4.8 – solution Excel printout, from the “Descriptive Statistics” sub-menu. The innovation club is more consistent, and because the means are close, is considered a better club 2007會計資訊系統計學(一)上課投影片 Current Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Innovation 150.5467 0.668815 151 150 5.792104 33.54847 0.12674 -0.42989 28 134 162 11291 75 Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count 150.1467 0.357011 150 149 3.091808 9.559279 -0.88542 0.177338 12 144 156 11261 75 4-50 Standard Deviation Additional Example • Rates of return over the past 10 years for two mutual funds are shown below. Which one have a higher level of risk? Fund A: 8.3, -6.2, 20.9, -2.7, 33.6, 42.9, 24.4, 5.2, 3.1, 30.05 Fund B: 12.1, -2.8, 6.4, 12.2, 27.8, 25.3, 18.2, 10.7, -1.3, 11.4 2007會計資訊系統計學(一)上課投影片 4-51 Standard Deviation Solution Let us use the Excel printout that is run from the “Descriptive statistics” submenu. Fund A should be considered riskier because its standard deviation is larger 2007會計資訊系統計學(一)上課投影片 Fund A Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Fund B 16 Mean 5.295 Standard Error 14.6 Median #N/A Mode 16.74 Standard Deviation 280.3 Sample Variance -1.34 Kurtosis 0.217 Skewness 49.1 Range -6.2 Minimum 42.9 Maximum 160 Sum 10 Count 12 3.152 11.75 #N/A 9.969 99.37 -0.46 0.107 30.6 -2.8 27.8 120 10 4-52 Interpreting Standard Deviation The standard deviation can be used to compare the variability of several distributions make a statement about the general shape of a distribution. The empirical rule（經驗法則）: If a sample of observations has a bell-shaped distribution, the interval ( x s, x s) contains approximately 68% of the measuremen ts ( x 2s, x 2s) contains approximately 95% of the measuremen ts ( x 3s, x 3s) contains approximately 99.7% of the measuremen ts 2007會計資訊系統計學(一)上課投影片 4-53 The Empirical Rule Approximately 68% of all observations fall within one standard deviation of the mean. Approximately 95% of all observations fall within two standard deviations of the mean. Approximately 99.7% of all observations fall within three standard deviations of the mean. 2007會計資訊系統計學(一)上課投影片 4-54 Interpreting Standard Deviation Example 4.9 A statistics practitioner wants to describe the way returns on investment are distributed. The mean return = 10% The standard deviation of the return = 8% The histogram is bell shaped. 2007會計資訊系統計學(一)上課投影片 4-55 Interpreting Standard Deviation Example 4.9 – solution The empirical rule can be applied (bell shaped histogram) Describing the return distribution Approximately 68% of the returns lie between 2% and 18% [10 – 1(8), 10 + 1(8)] Approximately 95% of the returns lie between -6% and 26% [10 – 2(8), 10 + 2(8)] Approximately 99.7% of the returns lie between -14% and 34% [10 – 3(8), 10 + 3(8)] 2007會計資訊系統計學(一)上課投影片 4-56 Chebysheff’s Theorem（柴比氏定理） A more general interpretation of the standard deviation is derived from Chebysheff’s Theorem, which applies to all shapes of histograms (not just bell shaped). The proportion of observations in any sample that lie within k standard deviations of the mean is at least: For k=2 (say), the theorem states that at least 3/4 of all observations lie within 2 standard deviations of the mean. This is a “lower bound” compared to Empirical Rule’s approximation (95%). 2007會計資訊系統計學(一)上課投影片 4-57 The Chebysheff’s Theorem The proportion of observations in any sample that lie within k standard deviations of the mean is at least 1-1/k2 for k > 1. This theorem is valid for any set of measurements (sample, population) of any shape!! K Interval Chebysheff Empirical Rule 1 2 3 x s, x s x 2s, x 2s x 3s, x 3s 2007會計資訊系統計學(一)上課投影片 at least 0% at least 75% at least 89% (1-1/12) (1-1/22) (1-1/32) approximately 68% approximately 95% approximately 99.7% 4-58 The Chebysheff’s Theorem Example 4.10 The annual salaries of the employees of a chain of computer stores produced a positively skewed histogram. The mean and standard deviation are $28,000 and $3,000,respectively. What can you say about the salaries at this chain? Solution At least 75% of the salaries lie between $22,000 and $34,000 28000 – 2(3000) 28000 + 2(3000) At least 88.9% of the salaries lie between $19,000 and $37,000 28000 – 3(3000) 28000 + 3(3000) 2007會計資訊系統計學(一)上課投影片 4-59 Coefficient of Variation（變異係數） The coefficient of variation of a set of observations is the standard deviation of the observations divided by their mean, that is: Population coefficient of variation = CV = Sample coefficient of variation = cv = 2007會計資訊系統計學(一)上課投影片 4-60 Statistics is a pattern language Size Population Sample N n Mean Variance Standard Deviation Coefficient of Variation 2007會計資訊系統計學(一)上課投影片 S CV cv 4-61 Coefficient of Variation This coefficient provides a proportionate measure of variation, e.g. A standard deviation of 10 may be perceived as large when the mean value is 100, but only moderately large when the mean value is 500. 2007會計資訊系統計學(一)上課投影片 4-62 Measures of Variability If data are symmetric, with no serious outliers, use range and standard deviation. If comparing variation across two data sets, use coefficient of variation. The measures of variability introduced in this section can be used only for interval data. 2007會計資訊系統計學(一)上課投影片 4-63 4.3 Measures of Relative Standing and Box Plots Measures of relative standing are designed to provide information about the position of particular values relative to the entire data set. Percentile（百分位數） The pth percentile of a set of measurements is the value for which • p percent of the observations are less than that value • (100-p) percent of all the observations are greater than that value. Example • Suppose your score is the 60% percentile of a SAT test. Then 60% of all the scores lie here 40% Your score 2007會計資訊系統計學(一)上課投影片 4-64 Quartiles（四分位數） We have special names for the 25th, 50th, and 75th percentiles, namely quartiles. The first or lower quartile is labeled Q1 = 25th percentile. The second quartile, Q2 = 50th percentile (which is also the median). The third or upper quartile, Q3 = 75th percentile. We can also convert percentiles into quintiles (fifths) and deciles (tenths). 2007會計資訊系統計學(一)上課投影片 4-65 Commonly Used Percentiles First (lower) decile First (lower) quartile, Q1, Second (middle)quartile,Q2, Third quartile, Q3, Ninth (upper) decile = 10th percentile = 25th percentile = 50th percentile = 75th percentile = 90th percentile Note: If your exam mark places you in the 80th percentile, that doesn’t mean you scored 80% on the exam – it means that 80% of your peers scored lower than you on the exam; its about your position relative to others. 2007會計資訊系統計學(一)上課投影片 4-66 Quartiles Example Find the quartiles of the following set of measurements 7, 8, 12, 17, 29, 18, 4, 27, 30, 2, 4, 10, 21, 5, 8 2007會計資訊系統計學(一)上課投影片 4-67 Quartiles：Solution Sort the observations 2, 4, 4, 5, 7, 8, 10, 12, 17, 18, 18, 21, 27, 29, 30 The first quartile 15 observations At most (.25)(15) = 3.75 observations should appear below the first quartile. Check the first 3 observations on the left hand side. At most (.75)(15)=11.25 observations should appear above the first quartile. Check 11 observations on the right hand side. Comment:If the number of observations is even, two observations remain unchecked. In this case choose the midpoint between these two observations. 2007會計資訊系統計學(一)上課投影片 4-68 Location of Percentiles Find the location of any percentile using the formula P L P (n 1) 100 th w hereL P is the locationof the P percentile Example 4.11 Calculate the 25th, 50th, and 75th percentile of the data in Example 4.1 2007會計資訊系統計學(一)上課投影片 4-69 Location of Percentiles Example 4.11 – solution After sorting the data we have 0, 0, 5, 7, 8, 9, 12, 14, 22, 33. 25 L 25 (10 1) 2.75 100 Values 0 0 Location 2 Location 1 3.75 2.75 5 3 Location 3 The 2.75th location Translates to the value (.75)(5 – 0) = 3.75 2007會計資訊系統計學(一)上課投影片 4-70 Location of Percentiles Example 4.11 – solution continued 50 L 50 (10 1) 5.5 100 The 50th percentile is halfway between the fifth and sixth observations (in the middle between 8 and 9), that is 8.5. 2007會計資訊系統計學(一)上課投影片 4-71 Location of Percentiles Example 4.11 – solution continued 75 L 75 (10 1) 8.25 100 The 75th percentile is one quarter of the distance between the eighth and ninth observation that is 14+.25(22 – 14) = 16. Eighth observation 2007會計資訊系統計學(一)上課投影片 Ninth observation 4-72 Location of Percentiles Please remember… position 2.75 16 0 0 | 5 7 8 9 12 14 | 22 33 3.75 position 8.25 Lp determines the position in the data set where the percentile value lies, not the value of the percentile itself. 2007會計資訊系統計學(一)上課投影片 4-73 Quartiles and Variability Quartiles can provide an idea about the shape of a histogram Q1 Q2 Positively skewed histogram 2007會計資訊系統計學(一)上課投影片 Q3 Q1 Q2 Q3 Negatively skewed histogram 4-74 Interquartile Range（四分位距） The quartiles can be used to create another measure of variability, the interquartile range, which is defined as follows: Interquartile range = Q3 – Q1 This is a measure of the spread of the middle 50% of the observations Large value indicates a large spread of the observations 2007會計資訊系統計學(一)上課投影片 4-75 Box Plot（箱形圖、盒鬚圖） This is a pictorial display that provides the main descriptive measures of the data set: • • • • • L - the largest observation Q3 - The upper quartile Q2 - The median Q1 - The lower quartile S - The smallest observation 1.5(Q3 – Q1) S 2007會計資訊系統計學(一)上課投影片 Whisker 1.5(Q3 – Q1) Q1 Q2 Q 3 Whisker L 4-76 Box Plot Example 4.14 (Xm02-01) Bills 42.19 38.45 29.23 89.35 118.04 110.46 . Smallest =. 0 . Q1 = 9.275 Median = 26.905 Q3 = 84.9425 Largest = 119.63 IQR = 75.6675 Outliers = () 2007會計資訊系統計學(一)上課投影片 Left hand boundary = 9.275–1.5(IQR)= -104.226 Right hand boundary=84.9425+ 1.5(IQR)=198.4438 -104.226 0 9.275 84.9425 119.63 26.905 198.4438 No outliers are found 4-77 Box Plot Additional Example - GMAT scores Create a box plot for the data regarding the GMAT scores of 200 applicants (see GMAT.XLS) GMAT 512 531 461 515 . . . Smallest = 449 Q1 = 512 Median = 537 Q3 = 575 Largest = 788 IQR = 63 Outliers = (788, 788, 766, 763, 756, 719, 712, 707, 703, 694, 690, 675, ) 417.5 449 512-1.5(IQR) 2007會計資訊系統計學(一)上課投影片 512 537 575 669.5 575+1.5(IQR) 788 4-78 Box Plot GMAT - continued Q1 512 449 25% Q2 537 Q3 575 50% 669.5 25% Interpreting the box plot results • The scores range from 449 to 788. • About half the scores are smaller than 537, and about half are larger than 537. • About half the scores lie between 512 and 575. • About a quarter lies below 512 and a quarter above 575. 2007會計資訊系統計學(一)上課投影片 4-79 Box Plot GMAT - continued The histogram is positively skewed Q1 512 449 25% Q2 537 50% Q3 575 669.5 25% 50% 25% 2007會計資訊系統計學(一)上課投影片 25% 4-80 Box Plot Example 4.15 (Xm04-15) A study was organized to compare the quality of service in 5 drive through restaurants. Interpret the results Example 4.15 – solution Minitab box plot 2007會計資訊系統計學(一)上課投影片 4-81 Box Plot Jack in the Box5 Jack in the box is the slowest in service Hardee’s Hardee’s service time variability is the largest C7 McDonalds 4 3 Wendy’s 2 Popeyes 1 Wendy’s service time appears to be the shortest and most consistent. 100 200 300 C6 2007會計資訊系統計學(一)上課投影片 4-82 Box Plot Times are symmetric Jack in the Box5 Jack in the box is the slowest in service Hardee’s Hardee’s service time variability is the largest C7 McDonalds 4 3 Wendy’s 2 Popeyes 1 Wendy’s service time appears to be the shortest and most consistent. 100 200 300 C6 2007會計資訊系統計學(一)上課投影片 Times are positively skewed 4-83 4.4 Measures of Linear Relationship We now present two numerical measures of linear relationship that provide information as to the strength & direction of a linear relationship between two variables (if one exists). They are the covariance and the coefficient of correlation. Covariance（共變數） - is there any pattern to the way two variables move together? Coefficient of correlation （相關係數）- how strong is the linear relationship between two variables? 2007會計資訊系統計學(一)上課投影片 4-84 Covariance（共變數） Population covariance COV(X, Y) (x i x )(y i y ) N x (y) is the population mean of the variable X (Y). N is the population size. (xi x)(y i y) Sample cov ariance cov (x y, ) n-1 x (y) is the sample mean of the variable X (Y). n is the sample size. 2007會計資訊系統計學(一)上課投影片 4-85 Covariance In much the same way there was a “shortcut” for calculating sample variance without having to calculate the sample mean, there is also a shortcut for calculating sample covariance without having to first calculate the mean: 2007會計資訊系統計學(一)上課投影片 4-86 Statistics is a pattern language Size Population Sample N n Mean Variance S2 Standard Deviation S Coefficient of Variation Covariance 2007會計資訊系統計學(一)上課投影片 CV cv Sxy 4-87 Covariance Compare the following three sets xi yi (x – x) (y – y) (x – x)(y – y) 2 6 7 13 20 27 -3 1 2 x=5 y =20 xi yi (x – x) (y – y) (x – x)(y – y) 2 6 7 27 20 13 -3 1 2 x=5 y =20 2007會計資訊系統計學(一)上課投影片 -7 0 7 21 0 14 Cov(x,y)=17.5 7 0 -7 -21 0 -14 xi yi 2 6 7 20 27 13 Cov(x,y) = -3.5 x=5 y =20 Cov(x,y)=-17.5 4-88 Covariance Illustrated Consider the following three sets of data (textbook §4.5) In each set, the values of X are the same, and the value for Y are the same; the only thing that’s changed is the order of the Y’s. In set #1, as X increases so does Y; Sxy is large & positive In set #2, as X increases, Y decreases; Sxy is large & negative In set #3, as X increases, Y doesn’t move in any particular way; Sxy is “small” 2007會計資訊系統計學(一)上課投影片 4-89 Covariance (Generally speaking) When two variables move in the same direction (both increase or both decrease), the covariance will be a large positive number. When two variables move in opposite directions, the covariance is a large negative number. When there is no particular pattern, the covariance is a small number（close to zero）. 2007會計資訊系統計學(一)上課投影片 4-90 Covariance Y Y Ⅱ Ⅰ (x X ) 0 ( y Y ) 0 COV(X, Y) ＜0 (x X ) 0 ( y Y ) 0 COV(X, Y) ＞0 Ⅲ Ⅳ (x X ) 0 ( y Y ) 0 COV(X, Y) ＞0 (x X ) 0 ( y Y ) 0 COV(X, Y) ＜0 X 2007會計資訊系統計學(一)上課投影片 X 4-91 Covariance COV(X, Y) ＞0 2007會計資訊系統計學(一)上課投影片 COV(X, Y) ＜0 COV(X, Y) 0 4-92 The coefficient of correlation（相關係數） Population coefficien t of correlatio n COV ( X, Y) xy Greek letter “rho” Sample coefficien t of correlatio n cov( X, Y ) r sxsy This coefficient answers the question: How strong is the association between X and Y. 2007會計資訊系統計學(一)上課投影片 4-93 Statistics is a pattern language Size Population Sample N n CV S2 S cv Sxy r Mean Variance Standard Deviation Coefficient of Variation Covariance Coefficient of Correlation 2007會計資訊系統計學(一)上課投影片 4-94 Coefficient of Correlation The advantage of the coefficient of correlation over covariance is that it has fixed range from -1 to +1, thus: If the two variables are very strongly positively related, the coefficient value is close to +1 (strong positive linear relationship). If the two variables are very strongly negatively related, the coefficient value is close to -1 (strong negative linear relationship). No straight line relationship is indicated by a coefficient close to zero. 2007會計資訊系統計學(一)上課投影片 4-95 Coefficient of Correlation +1 Strong positive linear relationship COV(X,Y)>0 or r = or 0 No linear relationship -1 Strong negative linear relationship 2007會計資訊系統計學(一)上課投影片 COV(X,Y)=0 COV(X,Y)<0 4-96 Coefficient of Correlation 2007會計資訊系統計學(一)上課投影片 4-97 The coefficient of correlation and the covariance Example 4.16 Compute the covariance and the coefficient of correlation to measure how GMAT scores and GPA in an MBA program are related to one another. Solution We believe GMAT affects GPA. Thus • GMAT is labeled X • GPA is labeled Y 2007會計資訊系統計學(一)上課投影片 4-98 The coefficient of correlation and the covariance Example 4.16 Student 1 x 599 y 9.6 x2 y2 xy 358801 92.16 5750.4 2 689 8.8 474721 77.44 6063.2 cov(x,y)=(1/12-1)[67,559.2-(7587)(106.4)/12]=26.16 3 584 7.4 341056 54.76 4321.6 Sx = {(1/12-1)[4,817,755-(7587)2/12)]}.5=43.56 4 631 100 6310 Sy =…………………………………………………. (similar to Sx )10= 1.12 398161 593 xSy = 26.16/(43.56)(1.12) 8.8 351649 77.44 r = 11 cov(x,y)/S = .5362 5218.4 12 683 8 466489 64 5464 Total 7,587 106.4 4,817,755 957.2 67,559.2 2007會計資訊系統計學(一)上課投影片 Shortcut Formulas cov(x, y ) xi y i 1 xi y i n 1 n 2 1 x 2 s2 x i n 1 n 4-99 The coefficient of correlation and the covariance Example 4.16 – Excel Use the Covariance option in Data Analysis If your version of Excel returns the population covariance and variances, multiply each one by n/n-1 to obtain the corresponding sample values. Use the Correlation option to produce the correlation matrix. Variance-Covariance Matrix Population values GPA GPA 1.15 GMAT 23.98 2007會計資訊系統計學(一)上課投影片 GMAT Sample values GPA GPA 1.25 12 × 12-1 1739.52 GMAT 26.16 GMAT 1897.66 4-100 The coefficient of correlation and the covariance Example 4.16 – Excel Interpretation The covariance (26.16) indicates that GMAT score and performance in the MBA program are positively related. The coefficient of correlation (.5365) indicates that there is a moderately strong positive linear relationship between GMAT and MBA GPA. 2007會計資訊系統計學(一)上課投影片 4-101 Least Squares Method（最小平方法） Recall, the slope-intercept equation for a line is expressed in these terms: y = mx + b Where: m is the slope of the line b is the y-intercept. If we’ve determined there is a linear relationship between two variables with covariance and the coefficient of correlation, can we determine a linear function of the relationship? 2007會計資訊系統計學(一)上課投影片 4-102 The Least Squares Method …produces a straight line drawn through the points so that the sum of squared deviations between the points and the line is minimized. This line is represented by the equation: b0 (“b” naught) is the y-intercept, b1 is the slope, and (“y” hat) is the value of y determined by the line. 2007會計資訊系統計學(一)上課投影片 4-103 The least Squares Method Y Errors Errors X Different lines generate different errors, thus different sum of squares of errors. There is a line that minimizes the sum of squared errors. 2007會計資訊系統計學(一)上課投影片 4-104 The Least Squares Method We are seeking a line that best fits the data when two variables are (presumably) related to one another. We define “best fit line” as a line for which the sum of squared differences between it and the data points is minimized. n Minimize( y i ŷ i ) 2 i1 The actual y value of point i The y value of point i calculated from the equation ŷ b b i 2007會計資訊系統計學(一)上課投影片 0 1xi 4-105 The least Squares Method The coefficients b0 and b1 of the line that minimizes the sum of squares of errors are calculated from the data. n b1 cov(x, y ) s x2 ( x x )( y y ) i i i 1 , n ( xi x ) 2 i 1 b0 y b1 x n where y 2007會計資訊系統計學(一)上課投影片 y i 1 n n i and x x i i 1 n 4-106 The Least Squares Method Example 4.17 b1 Find the least squares line for Example 4.16 (Xm04-16.xls) cov(x, y ) x s x2 xi n y y 26 .16 .0138 1897 .2 7,587 632 .25 12 Scatter Diagram 12 y = 0.1496 + 0.0138x 10 8 106 .4 6 8.87 500 n 12 b0 y b1 x 8.87 (.0138 )( 632 .25 ) .145 i 2007會計資訊系統計學(一)上課投影片 600 700 800 4-107