#### Transcript Lecture 2 - notes - for Dr. Jason P. Turner

Descriptive Measures MARE 250 Dr. Jason Turner Descriptive Measures Descriptive Measures – numbers that are used to describe datasets Parts of Descriptive Statistics Used to summarize raw data Descriptive Measures Measures of Center Measures of Variation – how data are distributed around center 5-number summary – used to construct visual representation - Boxplot Measures of Center Measure of Central Tendency – indicate where center or most typical value of data set lie Mean, Median, Mode Measures of Center Mean – of a dataset is the sum of the observations divided by the number of observations; Arithmetic Average 10,20,30,40,50,60,70,80,90,100 = 550 550 / 10 = 55 Measures of Center Median – the number that divides the bottom 50% of the data from the top 50% 1) Arrange data in increasing order 2) If number of observations is ODD, the median is the observation exactly in the middle 3) If the number of observations is EVEN, median is the mean of the middle two observations Measures of Center Median = (n+1)/2 10,20,30,40,50,60,70,80,90,100, 110 (ODD); Median = 60 10,20,30,40,50,60,70,80,90,100 (EVEN); Median = 50+60/2 = 55 Measures of Center Mode – frequency of each value in the data set If no value occurs more than once – No Mode; 10,20,30,40,50,60,70,80,90,100 Otherwise – any value with greatest frequency is Mode; 10,20,30,40,50,50, 60,70,80,90,100…Mode is 50 Measures of Center Number of Individuals The mode is useful if the distribution is skewed or bimodal (having two very pronounced values around which data are concentrated) 30 20 10 0 You are so totally skewed! The mean is sensitive to extreme (very large or small) observations and the median is not Therefore – you can determine how skewed your data is by looking at the relationship between median and mean Mean is Greater than the Median Mean and Median are Equal Mean is Less Than the Median Resistance Measures A resistance measure is not sensitive to the influences of a few extreme observations Median – resistant measure of center Mean – not resistant Outliers DO NOT affect Median Outliers DO affect Mean Resistance Measures Resistance of Mean can be improved by using – Trimmed Means – a specified percentage of the smallest and largest observations are removed before computing the mean Will do something like this later when exploring the data and evaluating outliers…(their effects upon the mean) Measures of Variation Measures of Variation (Spread) – amount of variability in the data set Range, Standard Deviation, Variance Range = Maximum Observation – Minimum Observation 10,20,30,40,50,60,70,80,90,100; Range = 100-10 = 90 Measures of Variation Standard Deviation - (±SD) measures the variation by indicating how far (on average) the observations are from the mean Large Dev. – far From mean Small Dev. – Close to mean Measures of Variation Variance - (measure used by statistical formulas) square of the standard deviation “Equal Variance” is one of the assumptions of parametric means testing…(we will learn this later) Measures of Variation Three Standard Deviations Rule – almost all observations in any data set lie within three standard deviations to either side of the mean; “almost all” defined in 2-ways by stats nerds… Measures of Variation Three Standard Deviations Rule – Chebychev’s Rule – 89% of data within 3 Standard Deviations Empirical Rule – 99.7% of observations are within 3 Standard deviations; if data are approximately bell-shaped 5 Number Summary Percentiles – data set is divided into hundredths (100 equal parts) Why?..Percentiles are not sensitive to the influence of a few extreme observations (outliers) 5 Number Summary Quartiles – data set is divided into quarters (4 equal parts); most typically used Data set has 3 Quartiles: Q1, Q2, Q3 Q1 – is the number that divides the bottom 25% from top 75% Q2 – is the median; bottom 50% from top 50% Q3 – is the number that divides the bottom 75% from top 25% 5 Number Summary Quartiles – data set is divided into quarters (4 equal parts); most typically used 5 Number Summary Interquartile Range (IQR) – the difference between the first and third quartiles IQR = Q3 – Q1 The IQR gives you the range of the middle 50% of the data Outlier, Outlier Outliers – observations that fall well outside the overall pattern of the data Requires special attention May be the result of: Measurement or Recording Error Observation from a different population Unusual Extreme observation Pants on Fire! Must deal with outliers: (Yes, really!) If error – can delete; otherwise judgment call Can use quartiles and IQR to identify potential outliers The Outer Limits Lower and Upper Limits: Lower limit – is the number that lies 1.5 IQR’s below the first quartile Lower Limit = Q1 - 1.5 * IQR Upper limit – is the number that lies 1.5 IQR’s above the first quartile Upper Limit = Q3 + 1.5 * IQR The Outer Limits If a value is outside the “Outer Limits” of a dataset it is an… OUTLIER! 5 Number Summary 5-Number Summary: Min, Q1, Q2, Q3, Max Written in increasing order Provides information on Center and Variation Are used to construct Box-Plots Boxplot Boxplot (Box-and-Whisker-Design): based on the 5-number summary provide graphic display of the center and variation Q1 Q2 Q3 Min Max 0 70 Boxplot Modified Boxplot – includes outliers Potential Outlier * 0 70 Note that Min & Max are determine after outliers are removed! Boxplot Boxplot Boxplots summarize information about the shape, dispersion, and center of your data They can also help you spot outliers Boxplot Left edge of the box represents the first quartile (Q1), while the right edge represents the third quartile (Q3) Box portion of the plot represents the interquartile range (IQR) - middle 50% of data Q1 Q2 Q3 Lower Limit 0 Upper Limit 70 Boxplot The line drawn through the box represents the median of the data The lines extending from the box are called whiskers The whiskers extend outward to indicate the Upper and Lower limits in the data set (excluding outliers) Boxplot Extreme values, or outliers, are represented by dots A value is considered an outlier if it is outside of the box (greater than Q3 or less than Q1) by more than 1.5 times the IQR Potential Outlier * 0 70 Boxplot Use the boxplot to assess the symmetry of the data: If the data are fairly symmetric, the median line will be roughly in the middle of the IQR box and the whiskers will be similar in length 0 70 Boxplot Use the boxplot to assess the symmetry of the data: If the data are skewed, the median may not fall in the middle of the IQR box, and one whisker will likely be noticeably longer than the other 0 70