Empirical Evaluation - Georgia Institute of Technology

Transcript Empirical Evaluation - Georgia Institute of Technology

Evaluating Visualizations
cs5764: Information Visualization
Chris North
Evaluating Visualizations
• Usability Test
• Observation, problem identification
• Controlled Experiment
• Formal controlled scientific experiment
• Comparisons, statistical analysis
• Expert Review
• Examination by visualization expert
• Heuristic Evaluation
• Principles, Guidelines
• Algorithmic
Projects
• Implementation projects:
• Small usability test of implementation
• Short usability report
• Experiment projects:
• Main controlled experiment
• Experiment materials and raw data
• Then data analysis
Usability test vs. Controlled Expm.
• Usability test:
•
•
•
•
•
Formative: helps guide design
Single UI, early in design process
Few users
Usability problems, incidents
Qualitative feedback from users
• Controlled experiment:
•
•
•
•
•
Summative: measure final result
Compare multiple UIs
Many users, strict protocol
Independent & dependent variables
Quantitative results, statistical significance
Controlled Experiments
What is Science?
• Measurement
• Modeling
Scientific Method
1.
2.
3.
4.
Form Hypothesis
Collect data
Analyze
Accept/reject hypothesis
•
How to “prove” a hypothesis in science?
•
•
•
•
Easier to disprove things, by counterexample
Null hypothesis = opposite of hypothesis
Disprove null hypothesis
Hence, hypothesis is proved
Empirical Experiment
• Typical question:
• Which visualization is better in which situations?
Spotfire
vs.
TableLens
Cause and Effect
• Goal: determine “cause and effect”
• Cause = visualization tool (Spotfire vs. TableLens)
• Effect = user performance time on task T
• Procedure:
• Vary cause
• Measure effect
• Problem: random variation
random variation
Real
world
Collected
data
uncertain conclusions
• Cause = vis tool OR random variation?
Stats to the Rescue
• Goal:
• Measured effect unlikely to result by random variation
• Hypothesis:
• Cause = visualization tool (e.g. Spotfire ≠ TableLens)
• Null hypothesis:
• Visualization tool has no effect (e.g. Spotfire = TableLens)
• Hence: Cause = random variation
• Stats:
• If null hypothesis true, then measured effect occurs with probability < 5%
• But measured effect did occur! (e.g. measured effect >> random variation)
• Hence:
• Null hypothesis unlikely to be true
• Hence, hypothesis likely to be true
Variables
• Independent Variables (what you vary), and
treatments (the variable values):
• Visualization tool
» Spotfire, TableLens, Excel
• Task type
» Find, count, pattern, compare
• Data size (# of items)
» 100, 1000, 1000000
• Dependent Variables (what you measure)
•
•
•
•
User performance time
Errors
Subjective satisfaction (survey)
HCI metrics
Example: 2 x 3 design
Ind Var 2: Task Type
Task1
Task2
Task3
SpotInd Var 1: fire
Vis. Tool TableLens
• n users per cell
Measured user
performance times
(dep var)
Groups
• “Between subjects” variable
•
•
•
•
1 group of users for each variable treatment
Group 1: 20 users, Spotfire
Group 2: 20 users, TableLens
Total: 40 users, 20 per cell
• “With-in subjects” (repeated) variable
•
•
•
•
•
All users perform all treatments
Counter-balancing order effect
Group 1: 20 users, Spotfire then TableLens
Group 2: 20 users, TableLens then Spotfire
Total: 40 users, 40 per cell
Issues
• Eliminate or measure extraneous factors
• Randomized
• Fairness
• Identical procedures, …
• Bias
• User privacy, data security
• IRB (internal review board)
Procedure
• For each user:
• Sign legal forms
• Pre-Survey: demographics
• Instructions
» Do not reveal true purpose of experiment
• Training runs
• Actual runs
» Give task, measure performance
• Post-Survey: subjective measures
• * n users
Data
• Measured dependent variables
• Spreadsheet:
User Spotfire
TableLens
task task task task task task
1
2
3
1
2
3
Step 1: Visualize it
•
•
•
•
Dig out interesting facts
Qualitative conclusions
Guide stats
Guide future experiments
Step 2: Stats
Ind Var 2: Task Type
SpotInd Var 1: fire
Vis. Tool TableLens
Task1
Task2
Task3
37.2
54.5
103.7
29.8
53.2
145.4
Average
user performance
times (dep var)
TableLens better than Spotfire?
Avg
Perf time
(secs)
Spotfire
TableLens
• Problem with Averages: lossy
• Compares only 2 numbers
• What about the 40 data values? (Show me the data!)
The real picture
Avg
Perf time
(secs)
Spotfire
TableLens
• Need stats that compare all data
Statistics
• t-test
• Compares 1 dep var on 2 treatments of 1 ind var
• ANOVA: Analysis of Variance
• Compares 1 dep var on n treatments of m ind vars
• Result:
• p = probability that difference between treatments is random
(null hypothesis)
• “statistical significance” level
• typical cut-off: p < 0.05
• Hypothesis confidence = 1 - p
Excel
p < 0.05
•
•
•
•
Woohoo!
Found a “statistically significant” difference
Averages determine which is ‘better’
Conclusion:
•
•
•
•
•
•
Cause = visualization tool (e.g. Spotfire ≠ TableLens)
Vis Tool has an effect on user performance for task T …
“95% confident that TableLens better than Spotfire …”
NOT “TableLens beats Spotfire 95% of time”
5% chance of being wrong!
Be careful about generalizing
p > 0.05
• Hence, no difference?
• Vis Tool has no effect on user performance for task T…?
• Spotfire = TableLens ?
• NOT!
•
•
•
•
Did not detect a difference, but could still be different
Potential real effect did not overcome random variation
Provides evidence for Spotfire = TableLens, but not proof
Boring, basically found nothing
• How?
• Not enough users
• Need better tasks, data, …
Data Mountain
• Robertson, “Data Mountain”
•
(Microsoft)
Comparison of Info Vis Systems
• Kobsa
Cleveland’s Rules for Secondary Tasks
• Chewar et al.
Usability Testing
Usability test vs. Controlled Expm.
• Usability test:
•
•
•
•
•
Formative: helps guide design
Single UI, early in design process
Few users
Usability problems, incidents
Qualitative feedback from users
• Controlled experiment:
•
•
•
•
•
Summative: measure final result
Compare multiple UIs
Many users, strict protocol
Independent & dependent variables
Quantitative results, statistical significance
Usability Specification Table
Scenario
task
Worst case Planned
Target
Best case
(expert)
Observed
Find most
expensive
house for
sale?
1 min.
3 sec.
??? sec
…
10 sec.
Usability Test Setup
• Set of benchmark tasks
• Easy to hard, specific to open-ended
• Coverage of different UI features
• E.g. “find the 5 most expensive houses for sale”
• Consent forms
• Not needed unless video-taping user’s face (new rule)
• Experimenters:
• Facilitator: instructs user
• Observers: take notes, collect data, video tape screen
• Executor: run the prototype if faked
• Users
• 3-5 users, quality not quantity
Usability Test Procedure
• Goal: mimic real life
• Do not cheat by showing them how to use the UI!
• Initial instructions
• “We are evaluating the system, not you.”
• Repeat:
•
•
•
•
Give user a task
Ask user to “think aloud”
Observe, note mistakes and problems
Avoid interfering, hint only if completely stuck
• Interview
• Verbal feedback
• Questionnaire
• ~1 hour / user
Usability Lab
• E.g McBryde 102
Data
• Note taking
• E.g. “&%$#@ user keeps clicking on the wrong button…”
• Verbal protocol: think aloud
• E.g. user expects that button to do something else…
• Rough quantitative measures
• HCI metrics: e.g. task completion time, ..
• Interview feedback and surveys
• Video-tape screen & mouse
• Eye tracking, biometrics?
Analyze
• Initial reaction:
• “stupid user!”, “that’s developer X’s fault!”, “this sucks”
• Mature reaction:
• “how can we redesign UI to solve that usability problem?”
• the user is always right
• Identify usability problems
• Learning issues: e.g. can’t figure out or didn’t notice feature
• Performance issues: e.g. arduous, tiring to solve tasks
• Subjective issues: e.g. annoying, ugly
• Problem severity: critical vs. minor
Cost-Importance Analysis
Problem
Importance
Solutions
Cost
Ratio I/C
• Importance 1-5: (task effect, frequency)
• 5 = critical, major impact on user, frequent occurance
• 3 = user can complete task, but with difficulty
• 1 = minor problem, small speed bump, infrequent
• Ratio = importance / cost
• Sort by this
• 3 categories: Must fix, next version, ignored
Refine UI
• Simple solutions vs. major redesigns
• Solve problems in order of: importance/cost
• Example:
• Problem: user didn’t know he could zoom in to see more…
• Potential solutions:
–
–
–
–
Better zoom button icon, tooltip
Add a zoom bar slider (like moosburg)
Icons for different zoom levels: boundaries, roads, buildings
NOT: more more “help” documentation!!! You can do better.
• Iterate
• Test, refine, test, refine, test, refine, …
• Until? Meets usability specification
Project revisited
• For implementation projects:
• Informal test
• A few users
– Not (tainted) info vis students
• 102 lab not required
• Simple data collection
– Biometrics optional!
• 1 iteration
• Exploit this opportunity to improve your design
• For experiment projects:
• See controlled experiments

Empirical Evaluation - Georgia Institute of Technology

Transcript Empirical Evaluation - Georgia Institute of Technology

Directory