Transcript Document

Can five be enough?
Sample sizes in usability tests
Paul Cairns and Caroline Jarrett
Problem: usability studies have small
samples




Good experiments: 30+ Ps
Typical usability studies: ~5Ps
Moving to 3Ps!
How?! What?!
– Typically, I have conniptions
– CJ asked me to solve it!
8th March, 2012
UX people like small samples





Common practice (CJ)
Krug (2010), 3 (“a morning a month”)
Tullis & Albert, 6 to 8 (formative)
Virzi (1992)
Nielsen (1993)
– 7 experts ≈ 5 experts
8th March, 2012
Use probabilities to suggest sample
sizes
 Total number of problems, K
 Probability of problem discovery, p
 Find n so that 1 – (1-p)n is x% of K
– Binomial distribution
 n is our sample size
 p = 0.16, 0.22, 0.41, 0.6, n ≈ 5
8th March, 2012
The models can be refined
 What is p for your system?
– p can be small (Spool & Schroeder, 2001)
– Bootstrap
 Is p constant for all problems?
– More complex models
 Are all participants equally good?
 Tend to increase n
8th March, 2012
The models have conceptual flaws
 Is p meaningful?
– Independence of discovery
– Discovery is probabilistic
– What’s the probability space?
 Problem classification
8th March, 2012
A usability test can be an experiment






Conduct like an experiment
Need an alternative hypothesis
Measure (quantitatively) one thing
Carefully defined tasks
Manipulate the interface
Use statistics to identify true variation
8th March, 2012
Example questions





Is task quicker on new design?
Does design increase click-throughs?
Are errors below a threshold rate?
Is performance comparable in new design?
Can you prove this design is worth it?
8th March, 2012
Why use an experiment?
Good for…
Not for…
 When reasoning is
not enough
 Good beliefs for
improvements
 Finessing
 Show-stoppers
 Large effects
 Anything but
alternative
hypothesis
8th March, 2012
Usability tests are more about better
designs





Move to new technology
Design well
Reach a point of plausibility
Competing considerations
Test!
8th March, 2012
There are different argument styles
 Deduction: X causes Y; X hence Y
 Induction: From instances of X and Y, when I
see X, I infer Y.
 Abduction: X causes Y; Y hence?
– Explanation seeking
 Pierce: “matted felt of pure hypothesis”
 Sherlock Holmes does abduction!
8th March, 2012
Solutions arise from abduction
 Users act in response to system
 Features cause good/bad outcomes
 Abduce explanations
– More experience, better explanations
8th March, 2012
So what should be the sample size?
 H= “X is good”
 Null: p(H) = 0.5
 Five people are enough:
– H does not hold for 5 people
– (0.5)5 = 1/32 < 1/20 hence sig
8th March, 2012
So what should be the sample size?
 H= “X is good”
 Null: p(H) = 0.5
 Five people are enough:
– H does not hold for 5 people
– (0.5)5 = 1/32 < 1/20 hence sig
 This is false!
8th March, 2012
Usability tests sit in a cloud of
hypotheses
 Usability as a privative
 Every feature is contingently usable
– Any falsification forces revision (Popper)
– Kuhnian resistance
 Neo-Popperian (Deutsch)
– Falsification + narratives (explanations)
8th March, 2012
Sample size depends on explanation
 Plausible sample sizes
– Show-stopper: 1
– Unexpected but plausible: 3-5
– No explanation: many
 Different behaviour, same explanation
 ROI
8th March, 2012
Why usability tests might look like
experiments
 Control in experiments
– Causal attribution
 Coverage
 Observation
 Piggy-backing
8th March, 2012
Questions?
8th March, 2012