Transcript Document
Can five be enough?
Sample sizes in usability tests
Paul Cairns and Caroline Jarrett
Problem: usability studies have small
samples
Good experiments: 30+ Ps
Typical usability studies: ~5Ps
Moving to 3Ps!
How?! What?!
– Typically, I have conniptions
– CJ asked me to solve it!
8th March, 2012
UX people like small samples
Common practice (CJ)
Krug (2010), 3 (“a morning a month”)
Tullis & Albert, 6 to 8 (formative)
Virzi (1992)
Nielsen (1993)
– 7 experts ≈ 5 experts
8th March, 2012
Use probabilities to suggest sample
sizes
Total number of problems, K
Probability of problem discovery, p
Find n so that 1 – (1-p)n is x% of K
– Binomial distribution
n is our sample size
p = 0.16, 0.22, 0.41, 0.6, n ≈ 5
8th March, 2012
The models can be refined
What is p for your system?
– p can be small (Spool & Schroeder, 2001)
– Bootstrap
Is p constant for all problems?
– More complex models
Are all participants equally good?
Tend to increase n
8th March, 2012
The models have conceptual flaws
Is p meaningful?
– Independence of discovery
– Discovery is probabilistic
– What’s the probability space?
Problem classification
8th March, 2012
A usability test can be an experiment
Conduct like an experiment
Need an alternative hypothesis
Measure (quantitatively) one thing
Carefully defined tasks
Manipulate the interface
Use statistics to identify true variation
8th March, 2012
Example questions
Is task quicker on new design?
Does design increase click-throughs?
Are errors below a threshold rate?
Is performance comparable in new design?
Can you prove this design is worth it?
8th March, 2012
Why use an experiment?
Good for…
Not for…
When reasoning is
not enough
Good beliefs for
improvements
Finessing
Show-stoppers
Large effects
Anything but
alternative
hypothesis
8th March, 2012
Usability tests are more about better
designs
Move to new technology
Design well
Reach a point of plausibility
Competing considerations
Test!
8th March, 2012
There are different argument styles
Deduction: X causes Y; X hence Y
Induction: From instances of X and Y, when I
see X, I infer Y.
Abduction: X causes Y; Y hence?
– Explanation seeking
Pierce: “matted felt of pure hypothesis”
Sherlock Holmes does abduction!
8th March, 2012
Solutions arise from abduction
Users act in response to system
Features cause good/bad outcomes
Abduce explanations
– More experience, better explanations
8th March, 2012
So what should be the sample size?
H= “X is good”
Null: p(H) = 0.5
Five people are enough:
– H does not hold for 5 people
– (0.5)5 = 1/32 < 1/20 hence sig
8th March, 2012
So what should be the sample size?
H= “X is good”
Null: p(H) = 0.5
Five people are enough:
– H does not hold for 5 people
– (0.5)5 = 1/32 < 1/20 hence sig
This is false!
8th March, 2012
Usability tests sit in a cloud of
hypotheses
Usability as a privative
Every feature is contingently usable
– Any falsification forces revision (Popper)
– Kuhnian resistance
Neo-Popperian (Deutsch)
– Falsification + narratives (explanations)
8th March, 2012
Sample size depends on explanation
Plausible sample sizes
– Show-stopper: 1
– Unexpected but plausible: 3-5
– No explanation: many
Different behaviour, same explanation
ROI
8th March, 2012
Why usability tests might look like
experiments
Control in experiments
– Causal attribution
Coverage
Observation
Piggy-backing
8th March, 2012
Questions?
8th March, 2012