ERP Boot Camp Lecture #9

Download Report

Transcript ERP Boot Camp Lecture #9

The ERP Boot Camp
Plotting, Measurement, &
Statistics
All slides © S. J. Luck, except as indicated in the notes sections of individual slides
Slides may be used for nonprofit educational purposes if this copyright notice is included, except as noted
Permission must be obtained from the copyright holder(s) for any other use
Plotting- The Right Way
To-be-compared
waveforms overlaid
Legend in figure
Time Zero
Time ticks on baseline for
every waveform
Electrode Site
Voltage calibration
aligned with waveform
Calibration size
and polarity
Baseline shows 0 µV
Plotting- Basic Principles
•
•
•
•
You must show the waveforms (SPR rule)
- You need to show enough sites so that experts can figure out
underlying component structure
- I often show just one site for a cognitive audience when
component can be isolated (N2pc or LRP)
- In most cases, don’t shown more than 6-8 sites (topo map
instead)
A prestimulus baseline must be shown
- Usually 200 ms (minimum of 100 ms for most experiments)
- If you don’t see a baseline, the study is probably C.R.A.P
(Carelessly Reviewed Awful Publication)
Overlay the key waveforms
In most cases, show both original waveforms and
difference waves
Measuring ERP Amplitudes
Basic options
- Peak amplitude
•
•
Or average around peak
Or local peak amplitude
- Mean/area amplitude
Why Mean is Better than Peak
•
•
•
Rule #1: “Peaks and components are not the same
thing. There is nothing special about the point at which
the voltage reaches a local maximum.”
- Mean amplitude better characterizes a component as being
extended over time
- Peak amplitude encourages misleading view of components
Peak may find rising edge of adjacent component
- Can be solved by local peak measure
Peak is sensitive to high-frequency noise
- Can be mitigated by low-pass filter or “mean around peak”
• Time of peak depends on overlapping components
- The peak may be nowhere near the center of the experimental
effect
Why Mean is Better than Peak
•
Peak amplitude is biased by the noise level
- More noise means greater peak amplitude
- Mean amplitude is unbiased by noise level
•
Example
- Do 1000 simulation runs at two noise levels
- Take mean amplitude and peak amplitude on each run
- Average of 1000 mean amplitudes will be approximately the same
for high-noise and low-noise data
- Average of 1000 peak amplitudes will be greater for high-noise data
than for low-noise data
Peak Amplitude and Noise
Clean Waveform
Waveform + 60-Hz Noise
Why Mean is Better than Peak
•
•
•
Peak at different time points for different electrodes
- A real effect cannot do this
A narrower measurement window can be used for mean
amplitude
Mean amplitude is linear; peak amplitude is not
- Mean of peak amplitudes ≠ peak amplitude of grand average
- Mean of mean amplitudes = mean amplitude of grand average
- Same applies to single-trial data vs. averaged waveform
Shortcomings of Mean Amplitude
•
•
You will still pick up overlapping components
- A narrower window reduces this, but increases noise level
Different measurement windows might be appropriate for
different subjects
- This could be a source of measurement noise
- Patients and controls might have different latencies, leading to a
systematic distortion of the results
•
•
This is a case where peak might be better
How do you pick the measurement window?
- Using the time course of an effect biases you to find a significant
effect
- Reality: People often look at the data first
- Alternative 1: Select window based on prior results
- Alternative 2: “Functional localizer” condition to find “ROI”
- Alternative 3: Resampling/randomization approaches
The Baseline (reminder)
•
•
•
Baseline correction is equivalent to subtracting baseline
voltage from your amplitude measures
-
Any noise in baseline contributes to amplitude measure
Short baselines are noisy
Usual recommendation: 200 ms
Need to look at 200+ ms to evaluate overlap and preparatory activity
Baseline can be significant confound
- Baselines may differ across conditions due to overlap or preparatory
activity, and this activity may fade over time
- A poststimulus amplitude measure may therefore vary across
conditions due to differential baselines
Fading prestimulus differences can also distort scalp
distributions
- Distribution of prestimulus period contributes to distribution
Measuring Midpoint Latency
Basic options
- Peak latency
•
Or local peak latency
- 50% area latency
Better Example of 50% Area
Rare Minus Frequent
Shortcomings of Peak Latency
•
•
Peak may find rising edge of adjacent component
- Can be solved by local peak measure
Peak is sensitive to high-frequency noise
- Can be mitigated by low-pass filter
•
•
Time of peak depends on overlapping components
Terrible for broad components with no real peak
•
Biased by the noise level
- More noise => nearer to center of measurement window
• Not linear
• Difficult to relate to reaction time
50% Area Latency
• Uses entire waveform in determining latency
• Robust to noise
• Not biased by the noise level
• Works fine for broad waveforms with no real peak
• Linear
• Easier to relate to RT
•
- Almost the same as median
Shortcomings
-
Measurement window must include entire component
Strongly influenced by overlapping components
Requires monophasic waveforms
Works best on big components and/or difference waves
Relating Midpoint Latency to RT
Probability Distribution of RT
Probability of Reaction Time
0.6
17% of RTs at 350
ms
0.4
0.2
0
-200
25% of RTs at 400
ms
7% of RTs at 300
ms
0
200
400
Time
600
800
1000
Relating Midpoint Latency to RT
Peak latency is related to
mode of RT distribution,
not mean or median
ERP Amplitude
0.6
0.4
0.2
0
-200
0
200
400
Time
600
800
1000
Relating Midpoint Latency to RT
Typical RT probability distributions across different conditions
P3 peak latency usually differs less across conditions than mean RT
50% Area Latency Example
Luck & Hillyard (1990)
50% Area Latency Example
Luck & Hillyard (1990)
Measuring Onset Latency
•
Basic options for onset of component
- 20% area latency
- 50% peak latency
- Statistical threshold
•
First of N consecutive p<.05 points
Peak amplitude
50% of peak amplitude
Latency @ 50% of peak amplitude
Jackknife Approach
•
Miller, Patterson, & Ulrich (1998)
- Hard to measure onset latency (and other nonlinear parameters)
from noisy single-subject waveforms
- Much easier to measure from grand average
•
•
Measure from grand average of N-1 subjects N times
(once excluding each subject)
Variance will be artificially low but can be corrected
- Fcorrected = Funcorrected ÷ (N-1)2 [N per condition]
- Between, within, main effects, interactions
- Jackknife can also be used with Pearson r
•
So precise that you may need to use interpolation to
measure latencies between sample points
Jackknife Approach
Subject 1
50% fractional peak
latency
Grand w/o
Subject 1
Subject 2
Grand w/o
Subject 2
Subject 3
Grand w/o
Subject 3
Jackknife Approach
•
Conventional ANOVA on LRP onset latency
- F(1, 20) = 1.315, p = 0.258
• Jackknife ANOVA on LRP onset latency
•
- F(1, 20) = 5221.625, Fc = 13.05, p = .0017
Limitations
-
Doesn’t help with linear measures
Easier to have equal Ns for between-subjects ANOVAs
Is sometimes worse than conventional approach
Testing a slightly different null hypothesis
Jackknife Approach
•
•
•
Conventional null hypothesis
- If you measure from every individual in the population, the average
of these measures does not differ across conditions
Jackknife null hypothesis
- If you make grand averages across every individual in the
population, and measure from these grand averages, these
measures do not differ across conditions
Making a grand average leads to the same problems as
averaging across trials
- Greater latency variability across subjects in one group will lead to
lower peak amplitude in this group’s grand average
- The onset time in the grand average will reflect the onset times of
the subjects with the earliest onset times
• Think about it, and make sure you get the same general
pattern with conventional statistics
Jackknife Approach
Condition A
Condition B
Sub1 Sub2 Sub3 Sub4
Sub2 Sub3
Sub1
Stim
Mean of singlesubject values
Stim
Mean of singlesubject values
Sub4
Jackknife Approach
Condition A
Condition B
Sub1 Sub2 Sub3 Sub4
Sub2 Sub3
Sub1
Value from
grand
average
Stim
Sub4
Value from
grand
average
Mean of singlesubject values
Stim
Mean of singlesubject values
A difference in timing variability is misconstrued as a difference in mean onset time
Statistical Analysis
•
Replication is the best statistic
- The .05 threshold is arbitrary
•
•
What would happen if we decided the threshold should be .06?
- We regularly violate the assumptions of statistical tests, so the
computed p-values are not correct estimates of probability of a
Type I error
- The real question is whether the effects are real or noise
- If they are real (and large enough), they will be replicable
General advice
- Collect clean data with big effects
- Run follow-up experiments that contain replications
- Use a vanilla statistical approach (with jackknife approach for
nonlinear measures, when appropriate)
or
- Find a really good statistician who can do the most appropriate
statistical tests
Standard Approach
•
First, collapse across irrelevant factors
- If target and standard are counterbalanced, collapse to avoid
physical stimulus differences
- This reduces number of ANOVA factors
•
•
•
•
Fewer p-values
Fewer spurious interactions
Smaller experimentwise error
Do a separate ANOVA for each component
- Don’t use component as a repeated-measures factor
- Separate ANOVAs for amplitude and latency
- You could do a gigantic MANOVA, but it would have a zillion pvalues
Standard Approach
•
•
•
•
Use electrodes at which component is present
- Otherwise your effect may get swamped by noise at other
electrodes
- Interaction with electrode site has low power
Electrode site is usually two factors
- Anterior-posterior
- Left-middle-right
- Or clusters (averages across nearby electrodes)
Usually bad to do a separate ANOVA for each site
- More p-values means greater chance of Type I error
- Less power means greater chance of Type II error
Overall advice: Use stats in a way that most directly tests
your main hypotheses
Choosing Electrode Sites
• Imagine you are comparing Condition A and Condition B at 128
electrode sites, and the conditions do not actually differ (zero difference
with infinite power)
- If the noise is independent at each site, you would expect p < .05 for 6-7
sites (.05 x 128 = 6.4)
- If noise is correlated among nearby sites, you would expect p < .05 for at
least one cluster of sites
• Therefore, if you choose which sites to measure by seeing which sites
•
•
•
•
•
•
(or clusters) show a difference, you will have many false positives
(actual p >> .05)
Solution 1: All sites in an omnibus ANOVA (low power)
Solution 2: Bonferonni correction (even lower power)
Solution 3: Use false discovery rate correction (not quite as bad)
Solution 4: Use a priori region of interest
Solution 5: Use “functional localizer” condition
Solution 6: Use resampling/randomization approaches
Example: Fishing for N2ac
2 simultaneous stimuli on each trial,
selected from:
A) Pure sine wave
B) FM sweep
C) White noise burst
D) Click train
Duration=750, SOA = 1500±150
One stimulus defined as target for each trial block (e.g., FM sweep)
Task: Press one button for target-present, another for target-absent
Each stimulus equally likely to be combined with each other stimulus
Locations are randomized from trial to trial
Target is present on 25% of trials
Look at contra vs ipsi with respect to target
Example: Fishing for N2ac
Example: Fishing for N2ac
Example: Fishing for N2ac
Separate ANOVAs for anterior and
posterior electrode clusters
Factors: Contra/Ipsi, Hemisphere,
Within-Hemisphere Site, Time
Example: Fishing for N2ac
Key Effects
Contra/Ipsi: Significant
Contra/Ipsi x Time: Significant
Contra/Ipsi x Electrode: ns
Contra/Ipsi x Hemisphere: ns
Key Effects
Contra/Ipsi: ns
Contra/Ipsi x Time: Significant
Contra/Ipsi x Electrode: ns
Contra/Ipsi x Hemisphere: Significant
Example: Fishing for N2ac
Contra/Ipsi @ Each Time Interval
200-300: Significant
300-400: Significant
400-500: Significant
500-600: ns
Contra/Ipsi @ Each Time Interval
200-300: ns
300-400: ns
400-500: Significant
500-600: Significant
Example: Fishing for N2ac
Follow-Up Experiment:
Same basic paradigm to demonstrate replicability
Slightly different stimuli to demonstrate generality
Additional anterior electrode sites to better map scalp distribution
Also included unilateral stimuli to determine whether the N2ac requires
competition between simultaneous stimuli
Replicated basic anterior and posterior patterns
These effects were not present for unilateral stimuli
Electrode Interactions
•
Amplitudes are multiplicative across electrodes
- Fz amplitude might go from 1.0 µV to 1.5 µV, and Pz amplitude
might go from 2 µV to 3 µV
3
2.5
2
Fz
1.5
Cz
Pz
1
0.5
0
Condition A
•
Condition B
Condition C
Multiplicative
Additive
This produces a condition x electrode site interaction
- Even without a change in neural generators
Electrode Interactions
•
McCarthy & Wood (1985): Normalize the Data
- Divide by vector length
1
3
0.9
2.5
0.8
0.7
2
0.6
0.5
1.5
0.4
1
0.3
0.2
0.5
0.1
0
Fz
Cz
Pz
Condition A
•
Condition B
Condition C
Now the conditions have the same overall amplitude
- Main effects are eliminated; they are assessed prior to
normalization
Electrode Interactions
•
Technical Problem: Urbach & Kutas (2002) demonstrated
that this does not actually work under many realistic
conditions
- Many of these problems disappear if you measure from difference
waves
• Conceptual problem: The conclusions that can be drawn
from an electrode site interaction are extremely weak
•
- Could be same generators, but change in relative amplitudes
- Could be same generators, but a change in relative latencies
General advice: Don’t worry about electrode interactions
- You can’t draw very strong conclusions from them, so just report
them
Heterogeneity of Covariance
•
•
Within-subjects ANOVA assumes homogeneity of
variance and covariance (sphericity)
- Modest heterogeneity of variance not a big problem
- Heterogeneity of covariance inflates Type I error rate
What is homogeneity of covariance?
- 3 or more levels of a within-subjects factor
- Each level must be equally correlated with the other levels
Heterogeneity of Covariance
Within-Subjects
ANOVA assumes:
Covariance(A, B) =
Covariance(B, C) =
Covariance(A, C)
Subject 1
Subject 2
Subject 3
Cond
A
Cond
B
Cond
C
Heterogeneity of Covariance
•
•
•
•
Why is this a special problem for ERPs?
- Covariance is lower for more distant electrode pairs than for
nearby electrode pairs
- Whenever 3 or more electrodes are used, heterogeneity of
covariance is likely
SPR mandates that papers deal with this problem
Greenhouse-Geisser epsilon adjustment
-
Degree of nonsphericity is computed
An adjustment factor, epsilon, is computed
New df computed by multiplying epsilon by original df
New df used for computing p-values
Greehouse-Geisser epsilon is overly conservative
- Can use Huynh-Feldt epsilon instead
• Everyone should use epsilon adjustment for all studies,
not just ERP studies