Lecture 2 - West Virginia University

Download Report

Transcript Lecture 2 - West Virginia University

Lecture 2
STAT 211 – 019
Dan Piett
West Virginia University
Last Lecture
 Population/Sample
 Variable Types
 Discrete/Continuous Numeric & Ranked/Unranked
Categorical
 Displaying Small Sets of Numbers
 Dot Plots, Stem and Leaf, Pie Charts
 Histograms
 Frequency/Density and Symmetric vs Right/Left Skewed
 Measures of Center
 Mean/Median
Overview
 2.3 Measures of Dispersion
 2.5 Boxplots
 3.1 Scatterplots
 3.2 Correlation
 3.3 Regression
Section 2.3
Measures of Dispersion
Descriptive Statistics
 Describing the Data
 How do we describe data?
 Graphs (Last Class)
 Measures
 Center (Last Class)
 Mean/Median
 Dispersion/Spread (This Class)
 Variance, Standard Deviation, IQR
Spread of Data
 Example: Spread
 Data 1: 8, 8, 9, 9, 10, 11, 11, 12, 12
 Data 2: -30, -20, -10, 0, 10, 20, 30, 40 ,50
 Data 1 – Mean = Median = 10
 Data 2 – Mean = Median = 10
 Both have the same measure of center but how do they differ?
 Data 2 is much more spread out.
Sample Standard Deviation
 Sample Standard Deviation (S) is a measure of how spread
out the data is
 S can be any number >= 0
 Larger S indicates a larger spread
 Unit Associated with S is the same unit as the variable
 Example: Mean of 110 lb, Standard Deviation 10 lb
 The square of the sample standard deviation is called the
sample variance
Standard Deviation Example
 Data 1 (8, 8, 9, 9, 10, 11, 11, 12, 12)
 S = 1.58
 Data 2 (-30, -20, -10, 0, 10, 20, 30, 40 ,50)
 S = 27.39
 As you can see, the standard deviation of Data 2 is much
larger than Data 1.
Population Variance/Standard
Deviation
 Much like the sample mean (xbar) estimates the population
mean (mu), the sample variance/standard deviation (s) can
be used to estimate the true population standard deviation
(sigma)
Linear Transformations and Changes of
Scale
 By adding or subtracting a constant to every value in a data
set
 The mean is increased/decreased by the same amount
 The median is increased/decreased by the same amount
 The standard deviation is unchanged
 By multiplying each value by a constant
 The mean is multiplied by the same amount
 The median is multiplied by the same amount
 The standard deviation is multiplied by the same amount
Section 2.5
Boxplots
Quartiles
 Quartiles are numbers which partition the data into 4
subgroups (ie 4 quarters in a dollar)
 Q1
 The data separating lowest 25% of the data values
 Q2 aka. Median
 The data separating the lowest 50% of the data values
 Q3
 The data separating the lowest 75% of the data values
 Q4 aka. Maximum
 The largest data value
Quartiles Example
 You can think of Q1 as the median of the bottom half of the
data and Q3 as the median of the top half of the data
Interquartile Range (IQR)
 The IQR is another measure of spread, much like S.
 Larger IQR results in more spread data
 IQR is calculated as Q3 - Q1
 Example
 Data 1 (8, 8, 9, 9, 10, 11, 11, 12, 12)
 IQR = 11.5-8.5=3
 Data 2 (-30, -20, -10, 0, 10, 20, 30, 40 ,50)
 IQR = 35-(-15) = 50
Boxplots
 Boxplots are a graphical representation of the quartiles.
Using IQR to Find Potential Outliers
 One method to find potential outliers is as follows:
Find the IQR
2. Add 1.5*IQR to Q3
1.
 Anything larger than this value can be flagged as a potential outlier
3.
Likewise, subtract 1.5*IQR from Q1
 Anything smaller than this value can be flagged as a potential outlier
 Example
 Data 1 (8, 8, 9, 9, 10, 11, 11, 12, 12)
 Data 2 (-30, -20, -10, 0, 10, 20, 30, 40 ,50)
Section 3.1
Scatterplots
Bivariate Data
 Bivariate data is data consisting of two variables from the
same individual
 Examples
 Height and Weight
 Classes skipped and GPA
 Graphed using a scatterplot
Scatterplot Example
Section 3.2
Correlation
Pearson Correlation Coefficient
 We have discussed ways to describe data of one variable. This
section will discuss how to describe two variables on the
same individual together.
 The correlation coefficient, r, is a measure of the strength of
a linear (straight line) relationship between bivariate data.
(You will not need to know the formula for r)
 To say two variables are correlated is two say that an
increase/decrease in one corresponds to an
increase/decrease in the other.
More on r
 r can take on values between -1 and 1
 The strength of the correlation depends on how close you are
to the extreme values of -1 or 1
 r = -.78 is a stronger correlation than r = .50
 There are three types of correlation
 Positive
 Negative
 No Correlation
Positive Correlation
 Positive Correlation exists when r is between 0 and 1.
 The closer r is to 1, the stronger the relationship
 This implies that if you increase one of the variables, the
other one will also increase.
 Examples:
 Height and Weight, Temperature and Ice Cream Sales
Negative Correlation
 Positive Correlation exists when r is between -1 and 0.
 The closer r is to -1, the stronger the relationship
 This implies that if you increase one of the variables, the
other one will decrease.
 Example:
 Temperature and Hot Chocolate Sales
No Correlation
 No Correlation exists when r is approximately 0
 This implies that if you increase one of the variables the other
one does not change
 Example:
 Temperature and Cookie Sales
Interpretation of r
 Although we may find that two variables are correlated, this does
not mean that there is necessarily a causal relationship.
 Example:
 High School Teachers who are paid less tend to have students who
do better on the SATs than Teachers who are paid more. It has
been found that there is a negative correlation between teacher
salary and students SAT scores. Therefore we should pay our
teachers less so students score higher.
 Clearly this is not a causal relationship. There is likely a third
variable, that is explaining this. One possibility may be the age of
the teacher.
Section 3.3
Regression
Regression Intro
 So we have decided that two variables are correlated, we are
now going to use the value of one of the variables, “x”, to
predict the value of the other variable, “y”.
 Example:
 Use height (x) to predict weight (y)
 Use temperature (x) to predict ice cream sales (y)
Regression Equation

Calculating a Regression Equation
Given the slope and intercept

Plotting a Regression Line
Notes on Regression Lines

Residuals
 A residual is the distance between a point (observed y-value)
and the regression line (predicted y-value)
 Formula: Observed Value – Predicted Value
 Using the Cholesterol Example:
 For TV Hours = 3, our predicted value was 212.2
 The actual value on the graph is 220.
 The residual for this particular point is = 220-212.2=7.8
 A residual may be positive or negative
 The interpretation is that the observed y-value is 7.8 units
larger than the predicted y value for TV Hours = 3