Data transformation
Download
Report
Transcript Data transformation
Data Transformations
Data transformations
Data transformation can “correct” deviation from normality
and uneven variance (heteroscedasticity)
Right transformation can allow you to use parametric
statistics
The need for transformations
We should always check the assumptions
that data follow a normal distribution with
uniform variance
*If the data meet the assumptions we can
analyze the raw data as described.
*If they are not met, we have two possible
strategies:
1-We can use a method which does not
require these assumptions, such as a rankbased method.
2-We can transform the data mathematically
to make them fit the assumptions more
closely before analysis.
There are three commonly used transformations for
quantitative data:
The logarithm, the square root, and the reciprocal. We call
,these transformations variance-stabilizing because
their purpose is to make variances the same.
For most data encountered in healthcare research, the first
or third situation applies
If we have several groups of subjects and calculate the
mean and variance for each group, we can plot variability
against mean. We might have one of these situations:
-Variability and mean are unrelated. We do not usually
have a problem and can treat the variances as uniform.
We do not need a transformation.
-Variance is proportional to mean. A square root
transformation should remove the relationship between
variability and mean.
-Standard deviation is proportional to mean. A logarithmic
transformation should remove the relationship between
variability and mean.
-Standard deviation is proportional to the square of the
mean. A reciprocal transformation should remove the
relationship between variability and mean.
.
Variance-stabilizing transformations also tend to make
distributions Normal.
There is a mathematical reason for this, as for so much in
statistics.
It can be shown that if we take several samples from the same
population, the means and variances of these samples will
be independent if and only if the distribution is Normal.
This means that uniform variance tends to go with a Normal
Distribution. A transformation which makes variance uniform
will often also make data follow a Normal distribution and
vice versa
logarithmic transformation
*The most frequently used is the logarithm.
*This is particularly useful for concentrations of
substances in blood.
*The reason for this is that blood is very dynamic,
with reactions happening continuously. Many of
the substances we measure are part of a
metabolic chain, both being synthesized and
metabolized to something else.
*The rates at which these reactions happen
depends on the amounts of other substances in
the blood and the consequence is that the various
factors which determine the concentration of the
substance are multiplied together.
*Multiplying and dividing tends to produce
skew distributions.
*If we take the logarithm of several
numbers multiplied together we get the
sum of their logarithms.
**So log transformation produces
something where the various influences
are added together and addition tends to
produce a Normal distribution.
For example, the following figure shows
serum cholesterol in stroke patients
As we have seen, for the serum cholesterol in stroke
patients data, the log transformation gives a good fit to
the Normal. What happens if we analyze the logarithm
of serum cholesterol then try to transform back to the
natural scale?
For the raw data, serum cholesterol: mean = 6.34, SD =
1.40.
For log (base e) serum cholesterol: mean = 1.82, SD =
0.22.
If we take the mean on the transformed scale and backtransform by taking the antilog, we get exp(1.82) =
6.17. This is less than the mean for the raw data. The
antilog of the mean log is not the same as the
untransformed arithmetic mean.
geometric mean is calculated which is found by
multiplying all the observations and taking the n’th root
The geometric mean is found by multiplying all the n
observations together and then taking the nth root. For
example, the geometric mean of 4 and 9 is 6, found by
multiplying 4 by 9 to give 36 and taking the square (or
second) root.
The geometric mean is usually smaller than the arithmetic
mean. For 4 and 9 this is (4 + 9)/2 = 6.5.
Thus the mean of the logs is the log of the geometric
mean.
What about the units for the
geometric mean?
If cholesterol is measured in mmol/L, the
log of a single observation is the log of a
measurement in mmol/L.
and the antilog is back in the original units,
mmol/L
Even if a transformation does not produce a
really good fit to the Normal distribution, it
may still make the data much more
amenable to analysis.
The following figure shows a histogram and Normal plot
for the area of venous ulcer at recruitment
The raw data have a very skew distribution
and the small number of very large ulcers
might lead to problems in analysis.
Although the log transformed data are still
skew, the skewness is much less and the
data much easier to analyze
Making a distribution more like the Normal is not the only
reason for using a transformation
The following figure shows prostate specific antigen (PSA)
for three groups of prostate patients: with benign
conditions, with prostatitis, and with prostate cancer
A log transformation of the PSA gives a much clearer
picture . The variability is now much more similar in the
three groups
The square root
The square root is best for fairly weak
relationships between variability and magnitude,
i.e. variance proportional to mean or standard
deviation proportional to the square root of the
mean.
The square root can be used for variables which
are greater than or equal to zero, the log and
the reciprocal can only be used for variables
which are strictly greater than zero, because
neither the logarithm nor the reciprocal of zero
are defined.
Arm lymphatic flow in rheumatoid arthritis
with oedema
The distribution is positively skew and the
variability is clearly greater in the groups with
greater lymphatic activity.
A square root transformation has the effect of
making the data less skew and making the
variation more uniform.
In these data, a log transformation proved to have
too great an effect, making the distribution
negatively skew, and so the square root of the
data was used in the analysis.
Reciprocal transformation
Removes the relationship between variability
and mean.
The reciprocal is best for very strong
relationships, where the standard
deviation is proportional to the square of
the mean.
The reciprocal can only be used for variables
which are strictly greater than zero.
If the square root removes the least amount
of skewness , the reciprocal removes the
.most.
Can all data be transformed?
Not all data can be transformed
successfully.
1-Sometimes we have very long tails at both
ends of the distribution, which makes
transformation by log, square root or
reciprocal ineffective
For example the distribution of blood sodium in
ITU patients
This is fairly symmetrical, but has longer tails than a
Normal distribution. The shape of the Normal plot is first
convex then concave
2-Sometimes we have a bimodal distribution, which makes
transformation by log, square root or reciprocal ineffective
For example systolic blood pressure in a sample of ITU
patients
3-Sometimes we have a large number of identical
observations, which will all transform to the
same value whatever transformation we use.
These are often at one extreme of the
distribution, usually at zero
For example the distribution of coronary artery calcium in a
large group of patients
More than half of these observations were equal at zero. Any
transformation would leave half the observations with the same
value, at the extreme of the distribution. It is impossible to
transform these data to a Normal distribution
.
4-Sometimes transformation lead to variation in p-value.
So, What can we do if we cannot transform data?
It is usually safer to use methods that do not require such
assumptions
These include the non-parametric methods.
Hypothesis Testing Procedures
Hypothesis
Testing
Procedures
Parametric
Nonparametric
Wilcoxon
Rank Sum
Test
Z Test
t Test
One-Way
ANOVA
Kruskal-Wallis
H-Test
Many More Tests Exist!
Types of data and analysis
Nominal
Ordinal
Discrete
Continuous
Parametric
Non-parametric
Types of Data
Nominal - no numerical value
Ordinal - order or rank
Discrete - counts
Continuous - interval, ratio
Parametric Test Procedures
1-Involve Population Parameters
Example: Population Mean
2-Require Interval Scale or Ratio Scale .
Whole Numbers or Fractions
Example: Height in Inches (72, 60.5, 54.7)
3-Have Stringent Assumptions .
Example: Normal Distribution
Nonparametric Test
Procedures
A nonparametric test is a hypothesis
test that does not require any specific
conditions about the shape of the
populations or the value of any
population parameters.
Tests are often called “distribution free”
tests.
Why non-parametric statistics?
-Need to analyse ‘Crude’ data (nominal, ordinal)
-Data derived from small samples
-Data that do not follow a normal
distribution
-Data of unknown distribution
Advantages of
Nonparametric Tests
1-Used With All Scales .
2-Easier to Compute.
3- Make Fewer Assumptions.
4- Suitable for small sample size.
5-Analysis involves outlier values.
6- No need for population
Parameters.
as 7-Results May Be as Exact
Parametric Procedures
© 1984-1994 T/Maker Co.
Disadvantages of
Nonparametric Tests
1-May Waste Information
If Data Permit Using Parametric
Procedures
Example: Converting Data From Ratio to
Ordinal Scale
2-Difficult to Compute by
Hand for Large Samples
3-Tables Not Widely Available
© 1984-1994 T/Maker Co.
What is a parameter and why should I
care?
Most statistical tests, like the t test, assume some kind of underlying
distribution, like the normal distribution
If you know the mean and the standard deviation of a normal
distribution then you know how to calculate probabilities
Means and standard deviations are called Parameters; all theoretical
distributions have parameters.
Statistical tests that assume a distribution and use parameters are
called parametric tests
Statistical tests that don't assume a distribution or use parameters are
called nonparametric tests
Ranks
Many nonparametric procedures are based on ranks.
Data are ranked by ordering them from lowest to highest and
assigning them, in order, the integer values from 1 to the
sample size.
Ties are resolved by assigning tied values the mean of the ranks
they would have received if there were no ties.
Example: 117, 119, 119, 125, 128 becomes 1, 2.5, 2.5, 4, 5
– If the two 119s were not tied, they would have been assigned
the ranks 2 and 3. The mean of 2 and 3 is 2.5.
Procedure: replace the original data with the ranks across subjects
and then perform the parametric test.
For large samples, many nonparametric techniques can be
viewed as the usual normal-theory-based procedures
applied to ranks
Normal theory based test
Corresponding
nonparametric test
t test for independent
samples
Mann-Whitney U test;
Wilcoxon rank-sum
test
Paired t test
Wilcoxon matched pairs
signed-rank test
Pearson correlation
coefficient
One way analysis of
variance (F test)
Two way analysis of
variance
Spearman rank
correlation coefficient
Kruskal-Wallis analysis
of variance by ranks
Friedman Two way
analysis of variance
Purpose of test
Compares two
independent samples
Examines a set of
differences
Assesses the linear
association between
two variables.
Compares three or more
groups
Compares groups
classified by two
different factors
Wilcoxon signed rank test
To test difference between
paired data
STEP 1
-Exclude any differences which are zero
-Put the rest of differences in ascending
order
-Ignore their signs
-Assign them ranks
-If any differences are equal, average their
ranks
STEP 2
-Count up the ranks of +ives as T+
-Count up the ranks of –ives as T-
STEP 3
If there is no difference between drug (T+)
and placebo (T-), then T+ & T- would be
similar
If there were a difference
one sum would be much smaller
and
the other much larger than expected
The smaller sum is denoted as T
T = smaller of T+ and T-
STEP 4
Compare the value obtained with the
critical values (5%, 2% and 1% ) in
table
N is the number of differences that
were ranked (not the total number of
differences)
So the zero differences are excluded
Wilcoxon Signed Rank Test
- assume distribution is continuous and symmetric
Discard any observation(s) that equal M0, adjust n
Again look at the differences between the
observations and the null value, M0
(Paired data, look at differences within pairs)
Rank the absolute values of the differences,
from low to high
Ties receive the average rank
T+ = sum of the ranks of the positive differences
T = sum of the ranks of the negative differences
p-values for one-sided tests are in Table
- only if results are in correct “direction”
Double the table value to get the p-value for
a two-sided test
For one simple sample.
Test, at a = .05, if median age of students finishing
a Masters degree in biostatistics is greater than 25.
H0: M = 25
H1: M > 25
Age
Age-25 Rank
26
30
37
23
42
25
28
33
28
1
5
12
-2
17
0
3
8
3
1
5
7
2
8
3.5
6
3.5
T+ = 34
T =2
T=2
p-value = .0118
*Because calculated T is at a p value less
than 0.05 , from the tables ,the difference
is significant .
*we can reject H0
Signed Rank Test Computation Table for
paired data.
X1i X2i Di = X1i - X2i |Di|
Ri Sign Sign Ri
X11 X21 D1 = X11 - X21 |D1| R1
±
± R1
X12 X22 D2 = X12 - X22 |D2| R2
±
± R2
X13 X23 D3 = X13 - X23 |D3| R3
±
± R3
:
:
±
± Rn
:
:
:
:
:
X1n X2n Dn = X1n - X2n |Dn| Rn
Total
T+ & T-
Hours of sleep
Patient
Drug
Placebo
Difference
Rank
Ignoring sign
1
6.1
5.2
0.9
3.5*
2
7.0
7.9
-0.9
3.5*
3
8.2
3.9
4.3
10
4
7.6
4.7
2.9
7
5
6.5
5.3
1.2
5
6
8.4
5.4
3.0
8
7
6.9
4.2
2.7
6
8
6.7
6.1
0.6
2
9
7.4
3.8
3.6
9
10
5.8
6.3
-0.5
1
3rd & 4th ranks are tied hence averaged
T= smaller of T+ (50.5) and T- (4.5)
Here T=4.5 significant at 2% level indicating the drug (hypnotic) is
more effective than placebo
Signed Rank Test Computation
Table
X1i
X2i
9.98 9.88
9.88 9.86
9.90 9.83
9.99 9.80
9.94 9.87
9.84 9.84
Total
Di
+0.10
+0.02
+0.07
+0.19
+0.07
0.00
|Di|
Ri
0.10 4
0.02 1
0.07 2 2.5
0.19 5
0.07 3 2.5
0.00 ...
Sign Sign Ri
+
+
+
+
+
...
+4
+1
+2.5
+5
+2.5
Discard
T+ = 15, T- = 0
Wilcoxon Signed Rank
Table (Portion)
One-Tailed Two-Tailed
a = .05
a = .025
a = .01
a = .005
a = .10
a = .05
a = .02
a = .01
n=5
n=6
1
2
1
n = 7 ..
4
2
0
..
..
..
..
n = 11 n = 12 n = 13
:
:
:
:
There are two types of comparison
using tables for wilcoxon signed
rank test
1- Looking at critical values (Z):
In which the calculated T value
( smaller one ) is compared with the
tabulated value at specific N and p
The difference is significant (Null
HYPOTHESIS IS REJECTED ) If
calculated T < OR = tabulated T
2-By comparing the P values
By finding P value at certain N that match
the calculated T
If this P value > the specified one ( 0.05 for
example ) the H0 can not be rejected.i,e
not significant. It is only significant if that
p = or < the assumed p .
** for 2 tailed test p=2 x p for one tailed