Presenter 15 - Basics

Download Report

Transcript Presenter 15 - Basics

生醫統計學
指導老師: 蔡章仁 老師
學生: 15. 許傳智(101581001)
時間: 2013.12.31(09:00)
Real Statistics Using Excel
Descriptive Statistics
•Measures of Central Tendency
•Measures of Variability
•Symmetry, Skewness and Kurtosis
•Ranking Functions in Excel
•Descriptive Statistics Tools
•Frequency Tables
•Histograms
•Creating Box Plots
•Outliers and Robustness
•Dealing with Missing Data
•Assumptions for Statistical Tests
•Data Transformations
Ranking Functions in Excel
Figure 1 summarizes the various ranking functions in Excel for a data set R.
We describe each of these functions in more detail in the rest of the section.
In the examples which follow we describe the values of these functions for a
range R with data values {4, 0, -1, 7, 5}.
MIN and MAX
Description: MIN(R) = the smallest value in R and MAX(R) = the largest value in R
Examples:
MIN(R) = -1
MAX(R) = 7
SMALL and LARGE
Description: SMALL(n, R) = the nth smallest value in R and LARGE(n, R) = the
nth largest value in R. Here n can take on any value from 1 to the number of
elements in R, i.e. COUNT(R).
Examples:
LARGE(1, R) = 7, LARGE(2, R) = 5, LARGE(5, R) = -1
SMALL(1, R) = -1, SMALL(2, R) = 0, LARGE(5, R) = 7
RANK
Description: RANK(c, R, d) = the rank of data element c in R. If d = 0
(or is omitted) then the ranking is in increasing order, i.e. a rank of 1
represents the largest data element in R. If d ≠ 0 then the ranking is
in decreasing order and so a rank of 1 represents the smallest
element in R.
Examples:
RANK(7, R) = RANK(7, R, 0) = 1
RANK(7, R, 1) = 5
RANK(0, R) = RANK(0, R, 0) = 4
RANK(0, R, 1) = 2
Observations:
If SMALL(n, R) = c then RANK(c, R) = RANK(c, R, 0) = n
If LARGE(n ,R) = c then RANK(c, R, 1) = n
For any value c and d, 1 ≤ RANK(c, R, d) ≤ COUNT(R)
PERCENTILE
Description: For any percentage p (i.e. 0 ≤ p ≤ 1 or equivalently 0% ≤ p ≤ 100%),
PERCENTILE(R, p) = the element at the pth percentile This means that if
PERCENTILE(R, p) = c then p% of the data elements in R are less than c.
If p = k/(n–1) for some integer value k = 0, 1, 2, … n–1 where n = COUNT(R), then
PERCENTILE(R, p) = SMALL(R, k+1) = the k+1th element in R. If p is not a multiple
of 1/(n–1), then the PERCENTILE function makes a linear interpolation as described in
the examples below.
Example: The 5 data elements in R divide the range into 4 intervals of size 25%, i.e. 1/(51) = .25. Thus,
PERCENTILE(R, 0) = -1 (the smallest element in R)
PERCENTILE(R, .25) = 0 (the second smallest element in R)
PERCENTILE(R, .5) = 4 (the third smallest element in R)
PERCENTILE(R, .75) = 5 (the fourth smallest element in R)
PERCENTILE(R, 1) = 7 (the fifth smallest element in R)
For other values of p we need to interpolate. For example,
PERCENTILE(R, .8) = 5 + (7 – 5) * (0.8 – 0.75) / 0.25 = 5.4
PERCENTILE(R, .303) = 0 + (4 – 0) * (0.303 – 0.25) / 0.25 = .85
Of course, Excel’s PERCENTILE function calculates all these values automatically
without you having to figure things out.
PERCENTRANK
Description: PERCENTRANK(R, c) = the percentage of data elements below c. If
PERCENTRANK(R, c) = p then PERCENTILE(R, p) = c.
Example:
PERCENTRANK(R, 5) = .75
PERCENTRANK(R, 54) = .8
You can also add a 3rd argument which represents the number of significant figures
in the answer. Thus PERCENTRANK(R, .85, 5) = .30312
QUARTILE
Description: For any integer n = 0, 1, 2, 3 or 4, QUARTILE(R, n) = PERCENTILE(R,
n/4). If c is not an integer, but 0 ≤ c ≤ 4, then QUARTILE(R, c) = QUARTILE(R, INT(c)).
Observation:
QUARTILE(R, 0) = PERCENTILE(R, 0) = MIN(R)
QUARTILE(R, 1) = PERCENTILE(R, .25)
QUARTILE(R, 2) = PERCENTILE(R, .5) = MEDIAN(R)
QUARTILE(R, 3) = PERCENTILE(R, .75)
QUARTILE(R, 4) = PERCENTILE(R, 1) = MAX(R)
Example:
QUARTILE(R, 0) = PERCENTILE(R, 0) = -1
QUARTILE(R, 1) = PERCENTILE(R, .25) = 0
QUARTILE(R, 2) = PERCENTILE(R, .5) = 4
QUARTILE(R, 3) = PERCENTILE(R, .75) = 5
QUARTILE(R, 4) = PERCENTILE(R, 1) = 7
Descriptive Statistics Tools
Excel provides a data analysis tool called Descriptive Statistics which produces a summary
of the key statistics for a data set.
Example 1 – Provide a table of the most common descriptive statistics for the scores in
column A of Figure 1.
Figure 1 – Output from Descriptive Statistics data analysis tool
The output from the tool is shown in the right side of Figure 1. To use the
tool, select Data > Analysis|Data Analysis and choose the Descriptive
Statistics option. A dialog box appears as in Figure 2.
Now click on Input Range and highlight the
scores in column A (i.e. cells A3:A14). If you
include the heading, as is done here, check
Labels in first row. Since we want the output
to start in cell C3, click the Output Range
radio button and insert C3 (or click on cell C3).
Finally click the Summary statistics checkbox
and press OK.
Figure 2 – Descriptive Statistics dialog box
Note that if we had also checked the Kth Largest checkbox, the output would also contain the
value for LARGE(A4:A14, k) where k is the number we insert in the box to the right of the
label Kth Largest. Similarly, checking the Kth Smallest checkbox outputs SMALL(A4:A14, k).
The option Confidence Interval for Mean generates a confidence interval using the t
distribution as explained in One Sample t Test.
Real Statistics Data Analysis Tool: The Real Statistics Resource Pack provides a
supplemental Descriptive Statistics and Normality data analysis tool which outputs the above
statistics plus GEOMEAN, HARMEAN, MAD, AAD and IQR. But instead of just generating
the numerical value of each statistic, as is in Excel’s Descriptive Statistics data analysis tool,
the Real Statistics tool outputs the appropriate Excel formula for computing each statistic (see
Figure 4 below). Thus whenever the input data values change, the output values will change
automatically as well.
Both Excel’s Descriptive Statistics and the Real Statistics Descriptive Statistics and Normality
data analysis tools allow you to report on multiple sets of data at the same time, as shown in
the following example.
Example 2 – Use Excel’s
Descriptive Statistics data
analysis tool as well as the
Real Statistics Descriptive
Statistics and Normality data
analysis tool to show the
descriptive statistics for the
two samples on the left side of
Figure 3.
Figure 3 – Output from Excel’s
Descriptive Statistics data analysis tool
The output from the Excel Descriptive Statistics data analysis tool is given on the right
side of Figure 3. To use the Real Statistics data analysis tool, enter Ctrl-m and select the
Descriptive Statistics and Normality option. A dialog box will now appear. Select the
Descriptive Statistics option and the following output will be displayed:
Figure 4 – Real Statistics Descriptive Statistics data analysis tool
As described above, the tool actually generates formulas instead of the numerical values.
Frequency Tables
Often data is presented in the form of a frequency . For example,
This means that the data element 2 occurs 4 times, the
element 4 occurs 2 times and the element 3 and 5 occur
1 time. This is equivalent to a data set with elements 2, 2,
Figure 1 – Frequency Table
2, 3, 4, 4, 5. When data is provided in the form of a
frequency table, the calculation of the mean and
standard deviation cannot be performed directly using the usual AVERAGE and STDEV
Excel functions. In fact for sample data {x1, …, xm} with corresponding frequency counts
of f1, …, fm respectively and n = f1 + f2 + … + fm, then the sample mean is:
This can be calculated in Excel as
=SUMPRODUCT(R1, R2) / SUM(R2)
where R1 is a range containing the data elements {x1, …, xm} and R2 is a range
containing {f1, …, fm}.
Based on Property 1 of Measures of Variability, the sample variance can be calculated as
This can be calculated in Excel as
=SUMPRODUCT((R1-R3)^2, R2)/(SUM(R2)-1)
where R1 and R2 are as above and R3 contains the sample mean (as described above).
Using these formulas we can calculate the mean and variance of sample data expressed
in the form of a frequency table. We demonstrate this in the following example.
Example 1: Calculate the mean and variance of the sample data from the
frequency table in Figure 1( see page 11).
The required calculation is displayed in Figure 2. Here cell
F11 contains the formula =D11/E11, which calculates the
mean, and cell G14 contains the formula =(D14E14*F14)/(E14-1), which calculates the variance. The
results are the same as calculating the mean and variance by
applying Excel’s AVERAGE and VAR functions to the data
set {2, 2, 2, 2, 3, 4, 4, 5}.
Figure 2 – Calculation of
mean and standard
deviation from frequency
table
Note too that a frequency table is closely linked to a
frequency function, as defined in Definition 1 of Discrete
Distributions. E.g., since there are 8 elements in the data set
in Figure 2, we see that the frequency function for random
variable is as in Figure 3 where each frequency value is
divided by 8:
Often frequency tables are given for a range of
data values, i.e. intervals for the x values. In this
case the midpoint of each interval is assigned the
value xi.
Figure 3 – Frequency function
corresponding to frequency table
Example 2: Calculate the mean and variance for the data in the frequency
table in Figure 4.
The first interval in Figure 4 is 0 < x ≤ 4, the
second 4 < x ≤ 10, etc. The calculation of the mean
and variance is as in Figure 2, except that now the
midpoints are used as the x values.
Observation: Sometimes the first and/or last
interval is unbounded: e.g. if the last interval in
Figure 4 is replaced by “over 20”. In this case it
isn’t possible to establish a midpoint, and so all
you can do is make your best estimate of a suitable
representative value for that interval.
Figure 4 – Calculations for a
frequency table with intervals
Observation: Sometimes the first and/or last interval is unbounded: e.g. if the last
interval in Figure 4 is replaced by “over 20”. In this case it isn’t possible to
establish a midpoint, and so all you can do is make your best estimate of a suitable
representative value for that interval.
Excel Function: When you have a lot of data, it is convenient to put the data in bins,
usually of equal size, and then graph the number of data elements in each bin. Excel
provides the FREQUENCY(R1, R2) array function for doing this, where R1 = the input
array and R2 = the bin array.
To use the FREQUENCY array function, enter the data into the worksheet and then enter
a bin array. The bin array defines the intervals that make up the bins. E.g., if the bin array
= 10, 20, 30, then there are 4 bins, namely data with value x ≤ 10, data with value x where
10 < x ≤ 20, data with value x where 20 < x ≤ 30, and finally data with value x > 30. The
FREQUENCY function simply returns an array consisting of the number of data elements
in each of the bins.
Example 3: Create a frequency table for the 22
data elements in the range A4:B14 of Figure 5
based on the bin array D4:D7 (the text “over
20” in cell D8 is not part of the bin array).
To produce the output, highlight the range E4:E8
(i.e. a column range with one more cell than the
number of bins) and enter the formula
=FREQUENCY(A4:B11,D4:D7)
Figure 5 – Example of the
FREQUENCY function
Since this is an array formula, you must press Ctrl-Shft-Enter. Excel now inserts
frequency values in the highlighted range E4:E8. Here E4 contains the number of
data elements in the input range with value in the first bin (i.e. data elements
whose value is ≤ 20). Similarly, E5 contains the number of data elements in the
input range with value in the second bin (i.e. data elements whose value is > 20
and ≤ 40). The final output cell (E8) contains the number of data elements in the
input range with value > the value of the final bin (i.e. > 80 for this example).
Observation: As described in Discrete Probability Distributions, the Real Statistics
Resource Pack provides the FREQTABLE function. This function can also be used
to create a frequency table with bins where the bins are equally spaced.
Real Statistics Function: The Real Statistics Resource Pack supplies the
following supplemental array function to create a frequency table.
FREQTABLE(R1, bsize) = an array which contains the frequency table for the
data in range R1, assuming equally sized bins of size bsize.
To use the function you must highlight an array with 3 columns and at least k rows
where k = (MAX(R1) – MIN(R1) / bsize + 1. You can highlight more rows than you
need; any extra rows will take value #N/A.
Example 4: Create a frequency table for the 22 data elements in the range A4:B14 of
Figure 5 based on bins of size 15.
The desired frequency table can be produced using the array formula
=FREQTABLE(A4:B14,15)
as shown in range M4:O11 of Figure 6.
Figure 6 – FREQTABLE function with
bin size 15
The headings are not outputted by the
function but have been added manually. Note
that two extra rows have been highlighted and
so are filled with #N/A.
Observation: You can also use the Frequency
Table data analysis tool for creating
frequency tables. See Histograms for an
example of how to use these tools. Also see
Frequency Table Conversion for how to
calculate the descriptive statistics (see
Descriptive Statistics) for the data described
by a frequency table.
Histograms
A histogram is a graphical representation of the output of the FREQUENCY function (as
described in Frequency Tables).
Example 1: Create a histogram for the data and bin selection for Example 1 from Frequency
Tables.
We start by replicating the data and bin section for Example 1 in Figure 1.
Figure 1 – Data for Example 1
You can use Excel’s chart tool to graph the data in
Figure 1, or alternatively you can use the Histogram
data analysis tool to accomplish this directly, as
described next.
Excel Data Analysis Tool: To use Excel’s Histogram
data analysis tool, you must first establish a bin array
(as for the FREQUENCY function described in
Frequency Tables) and then select the Histogram data
analysis tool. In the dialog box that is displayed you
next specify the input data (Input Range) and bin array
(Bin Range). You can optionally include the labels for
these ranges (in which case you check the Labels
check box).
For Example 1, the Input Range is A4:B14 and the Bin Range is D4:D7 (with the
Labels check box unchecked). The output is displayed in Figure 2.
Figure 2 – Histogram data analysis tool
Observation: Caution must be exercised when creating histograms to present the data in a
clear and accurate way. For most purposes it is important that the intervals be equal in size
(except for an unbounded first and/or last interval). Otherwise a distorted picture of the
data may be presented.
To avoid this problem equally-spaced intervals can be used. This is the approach
illustrated in Example 4 of Frequency Tables using the FREQTABLE supplemental
function. The Frequency Table supplemental data analysis tool can be used as well.
Real Statistics Data Analysis Tool: The Frequency Table data analysis tool provided in
the Real Statistics Resource Pack can be used to create a Frequency Table and histogram
as illustrated in the following example.
Example 2: Create a frequency table and histogram for the 22 data elements in the range
A4:B14 of Figure 1 based on bins of size 15.
Enter Ctrl-m and select the Frequency Table option. The dialog box shown in Figure 3
will appear.
Figure 4 – Frequency Table and Histogram
Figure 3 – Dialog box for
Frequency Table data
analysis tool
Insert A4:B14 in the Input Range field, select Raw data as
the Input format and insert 15 as the bin size. The output is
shown in Figure 4.
Observation: You can also produce a frequency table (and histogram) of the type described
in Example 3 of Discrete Probability Distributions (i.e. without specifying any bins) via the
Frequency Table data analysis tool. In this case you would select Raw data as the Input
format for the dialog box shown in Figure 3 and leave the Bin size field blank.
Creating Box Plots in Excel
Another way to characterize a distribution or sample is via a box plot. Specifically, a
box plot provides a pictorial representation of the following statistics: maximum,
75%-percentile, median (50%-percentile), 25%-percentile and minimum.
Box plots are especially useful when comparing samples and testing whether data is
symmetric.
Real Statistics Data Analysis Tool: To generate a box plot, you can use the Box
Plot option of the Descriptive Statistics and Normality supplemental data analysis
tool found in the Real Statistics Resource Pack, as described in the following
example. See also Special Charting Capabilities for how to create the box plot
manually using Excel’s charting capabilities.
Example 1: A market research company asks 30
people to evaluate three brands of tablet computers
using a questionnaire. The 30 people are divided at
random into 3 groups of 10 people each, where the
first group evaluates Brand A, the second evaluates
Brand B and the third evaluates Brand C. The
questionnaire scores from these groups are
summarized in Figure 1.
Figure 1 – Sample data
To generate the box plots for these three groups, enter Ctrl-m and select the Descriptive
Statistics and Normality supplemental data analysis tool. A dialog box will appear. Select
the Box Plot option and insert A3:C13 in the Input Range and check Headings included with
the data. The resulting plot is shown in Figure 2.
Note too that the data analysis tool
also generates a table, which may in
fact be located behind the chart. For
those who are interested, this table
contains the information in Figure 3, as
explained in Special Charting
Capabilities.
Figure 2 – Box Plot
Figure 3 – Box Plot elements
For each sample, the box plot consists
of a rectangular box with one line
extending upward and another
extending downward (usually called
whiskers). The box itself is divided into
two parts. In particular, the meaning of
each element in the box plot is
described in Figure 3.
From the box plot (see Figure 2) we can see that the scores for Brand C tend to be higher
than for the other brands and those for Brand B tend to be lower. We also see that the
distribution of Brand A is pretty symmetric at least in the range between the 1st and 3rd
quartiles, although there is some asymmetry for higher values (or potentially there is an
outlier). Brands B and C look less symmetric. Because of the long upper whisker (especially
with respect to the box), Brand B may have an outlier (see Outliers and Robustness for a
discussion of outliers).
We can also convert the box plot to a horizontal representation of the data (as in Figure 4)
by clicking on the chart and selecting Insert > Charts|Bar > Stacked Bar.
Figure 4 – Horizontal Box Plot
See Special Charting Capabilities for more information about the Box Plot data
analysis tool, especially regarding issues that arise when some of the data is negative.
Outliers and Robustness
One problem that we face in analyzing data is the presence of outliers, i.e. data that are a
lot different from the other data collected, especially data that is much bigger or much
smaller.
For example, the mean of the sample {2, 3, 4, 5, 6} is 4, while the mean of {2, 3, 4, 5, 60}
is 14.4. The appearance of the 60 completely distorts the mean in the second sample.
Some statistics, such as the median, are more resistant to such outliers. In fact, the
median for both samples is 4.
For this example it is obvious that 60 is a potential outlier. In Identifying Outliers and
Missing Data we show how to identify potential outliers using a data analysis tool
provided in the Real Statistics Resource Pack.
Excel Function: One approach for dealing with outliers is to throw away data that is
either too big or too small. Excel provides the TRIMMEAN function for dealing with
this issue.
TRIMMEAN(R, p) – calculates the mean of the data in the range R after first
throwing away p% of the data, half from the top and half from the bottom. If R
contains n data elements and k = the largest whole number ≤ np/2, then the k
largest items and the k smallest items are removed before calculating the mean.
For example, suppose R = {5, 4, 3, 20, 1, 4, 6, 4, 5, 6, 7, 1, 3, 7, 2}. Then
TRIMMEAN(R, 0.2) works as follows. Since R has 15 elements, k = INT(15 * .2 / 2)
= 1. Thus the largest element (20) and the smallest element (1) is removed from R
to get R′ = {5, 4, 3, 4, 6, 4, 5, 6, 7, 1, 3, 7, 2}. TRIMMEAN now returns the mean of
this range, namely 4.385 instead of the mean of R which is 5.2.
A related approach is to use Winsorized samples, in which the trimmed values are
replaced by the remaining highest and lowest values. Consider the following
sample:4, 6, 10, 14, 16, 19, 22, 23, 25, 27, 27, 31, 37, 38, 40, 44, 45, 48, 50, 80
A 10% trimmed sample would simply remove the two lowest and two highest
elements (i.e. 4, 6, 50, 80). A 10% Winsorized sample replaces the two lowest
elements by the third lowest and the two highest by the 3rd highest, resulting in
the following data set:
10, 10, 10, 14, 16, 19, 22, 23, 25, 27, 27, 31, 37, 38, 40, 44, 45, 48, 48, 48
Since 4 data elements have been replaced, the degrees of freedom of any test
need to be reduced by 4.
Dealing with Missing Data
Another problem faced when collecting data is that some data may be missing. For example, in
conducting a survey with ten questions, perhaps some of the people who take the survey don’t
answer all ten questions. In Identifying Outliers and Missing Data we show how to identify
missing data using a supplemental data analysis tool provided in the Real Statistics Resource
Pack.
A simple approach for dealing with missing data is to throw out all the data for any sample
missing one or more data elements. One problem with this approach is that the sample size
will be reduced. This is particularly relevant when the reduced sample size is too small to
obtain significant results in the analysis. In this case additional sample data elements may need
to be collected. This problem is a bigger than might first be evident. E.g. if a questionnaire
with 5 questions is randomly missing 10% of the data, then on average almost 60% of the
sample will have at least one question missing.
Also it is often the case that the missing data is not randomly distributed. E.g., people filling
out a long questionnaire may give up at some point and not answer any further questions, or
they may be offended or embarrassed by a particular question and choose not to answer it.
These are characteristics that might be quite relevant to the analysis.
In general there are the following types of remedies for missing data:
•Delete the samples with any missing data elements
•Impute the value of the missing data
•Remove a variable (e.g. a particular question in the case of a questionnaire or survey)
which has a high incidence of missing data, especially if there are other variables (i.e.
questions) which measure similar aspects of the characteristics being studied.
Deleting Missing Data
Of particular importance is the randomness of the missing data. E.g. suppose question 5 has a lot
of missing data and question 7 has no missing data. If the frequency of the responses to question
7 changes significantly when samples which are missing responses to question 5 are dropped,
then the missing data is not random, and so dropping samples can bias the results of the analysis.
In this case either another remedy should be employed or the analysis should be run twice: once
with samples with missing data retained (e.g. by adding a “no response” for missing data) and
once with these samples dropped.
Missing data can be removed by using the following supplemental Excel functions.
Supplemental Excel Functions:
DELBLANK(R1, s) – fills the highlighted range with the data in range R1 (by columns)
omitting any empty cells
DELNonNum(R1, s) – fills the highlighted range with the data in range R1 (by columns)
omitting any non-numeric cells
DELROWBLANK(R1, b) – fills the highlighted range with the data in range R1 omitting
any row which has one or more empty cells; if b is True then the first row of R1
(presumably containing column headings) is always is always copied (even if it contains an
empty cell); the second argument is optional defaults to b = False.
DELROWNonNum(R1, b) – fills the highlighted range with the data in range R1 omitting
any row which has one or more non-numeric cells; if b is True then the first row of R1
(presumably containing column headings) is always is always copied (even if it contains a
non-numeric cell); the second argument is optional defaults to b = False.
.
The string s is used as a filler in case the output range has more cells than R1. This second
argument is optional and defaults to the error value #N/A. See Data Conversion and
Reformatting for an example of the use of these functions.
Also see Data Conversion and Reformatting for how to use the supplemental Reformat Data
Range data analysis tool found in the Real Statistics Resource Pack to accomplish the same
objectives.
Imputing the values for missing data
Some techniques for imputing values for missing data include:
•Substituting the missing data with another observation which is considered similar,
either taken from another sample or from a previous study.
•Using the mean of all the non-missing data elements for that variable. This might be
acceptable in cases with a small number of missing data elements, but otherwise it can
distort the distribution of the data (e.g. by reducing the variance) or by lowering the
observed correlations (see Basic Concepts of Correlation).
•Using regression techniques. In this approach regression (as described in
Regression and Multiple Regression) is used to predict the value of the missing data
element based on the relationship between that variable and other variables. This
approach reinforces existing relationships and so makes it more likely that the analysis
will characterize the sample and not the general population.
Assumptions for Statistical Tests
As we will see shortly, most of the statistical tests we perform are based on a set of
assumptions. When these assumptions are violated the results of the analysis can be
misleading or completely erroneous.
Typical assumptions are:
•Normality: Data have a normal distribution (or at least is symmetric)
•Homogeneity of variances: Data from multiple groups have the same
•Linearity: Data have a linear relationship
•Independence: Data are independent
We explore in detail what it means for data to be normally distributed in Normal
Distribution, but in general it means that the graph of the data has the shape of a bell
curve. Such data is symmetric around its mean and has kurtosis equal to zero. In
Testing for Normality and Symmetry we provide tests to determine whether data meet
this assumption.
Some tests (e.g. ANOVA) require that the groups of data being studied have the same
variance. In Homogeneity of Variances we provide some tests to determine whether
groups of data have the same variance.
Some tests (e.g. Regression) require that there be a linear correlation between the
dependent and independent variables. Generally linearity can be tested graphically
using scatter diagrams or via other techniques explored in Correlation, Regression and
Multiple Regression.
We touch on the notion of independence in Definition 3 of Basic Probability Concepts.
In general, data are independent when there is no correlation between them (see
Correlation). Many tests require that data be randomly sampled with each data
element selected independently of data previously selected. E.g. if we measure the
monthly weight of 10 people over the course of 5 months, these 50 observations are not
independent since repeated measurements from the same people are not independent.
Also the IQ of 20 married couples doesn’t constitute 40 independent observations.
Almost all of the most commonly used statistical tests rely of the adherence to some
distribution function (such as the normal distribution). Such tests are called parametric
tests. Sometimes when one of the key assumptions of such a test is violated, a nonparametric test can be used instead. Such tests don’t rely on a specific probability
distribution function (see Non-parametric Tests).
Another approach for addressing problems with assumptions is by transforming the
data (see Transformations).
Data Transformations
It can sometimes be useful to transform data to overcome the violation of an assumption
required for the statistical analysis we want to make. Typical transformations take a
random variable and transform it into log x or 1/x or x2 or , etc.
There is some controversy regarding the desirability of performing such
transformations since often they cause more problems than they solve. Sometimes a
transformation can be considered simply as another way of looking at the data. For
example, sound volume is often given in decibels, which is essentially a log
transformation; time to complete a task is often expressed as speed, which is essentially
a reciprocal transformation; area of a circular plot of land can be expressed as the
radius, which is essentially a square root transformation.
In any case, we will see some examples in the rest of the book where transformations
are desirable. One thing that is very important is that transformations be applied
uniformly. E.g. when comparing three groups of data, it would not be appropriate to
apply a log transformation to one group but not to the other two.
Also transformations should only be used to achieve the assumptions of a test. You
shouldn’t try lots of transformation in order to find one that achieves a specific test
result.
--The end--