Observation & Empirical Studies

Download Report

Transcript Observation & Empirical Studies

Observation &
Experiments
Watch, listen, and learn…
Observing Users
 Not as easy as you think
 One of the best ways to gather feedback about your
interface
 Watch, listen and learn as a person interacts with
your system
 Qualitative & quantitative, end users, experimental or
naturalistic
Observation
 Direct
 In same room
 Can be intrusive
 Users aware of your
presence
 Only see it one time
 May use 1-way mirror
to reduce intrusiveness
Indirect
Video or app recording
Reduces intrusiveness,
but doesn’t eliminate it
Cameras focused on
screen, face & keyboard
Gives archival record,
but can spend a lot of
time reviewing it
Location
 Observations may be
 In lab - Maybe a specially built usability lab



Easier to control
Can have user complete set of tasks
In field



Watch their everyday actions
More realistic
Harder to control other factors
Observation
Room
 State-of-the-art observation room
equipped with three monitors to view
participant, participant's monitor, and
composite picture in picture.
 One-way mirror plus angled glass
captures light and isolates sound
between rooms.
 Comfortable and spacious for three
people, but room enough for six
seated observers.
 Digital mixer for unlimited mixing of
input images and recording to VHS,
SVHS, or MiniDV recorders.
Task Selection
 What tasks are people performing?



Representative and realistic?
Tasks dealing with specific parts of the
interface you want to test?
Problematic tasks?
 Don’t forget to pilot your entire evaluation!!
Engaging Users in Evaluation
 What’s going on in the user’s head?
 Use verbal protocol where users describe their
thoughts
 Qualitative techniques
 Think-aloud - can be very helpful
 Post-hoc verbal protocol - review video
 Critical incident logging - positive & negative
 Structured interviews - good questions
 “What did you like best/least?”
 “How would you change..?”
Think Aloud
 User describes verbally what s/he is thinking and
doing



What they believe is happening
Why they take an action
What they are trying to do
 Widely used, popular protocol
 Potential problems:
 Can be awkward for participant
 Thinking aloud can modify way user performs task
Cooperative approach
 Another technique: Co-discovery learning
(Constructive iteration)




Join pairs of participants to work together
Use think aloud
Perhaps have one person be semi-expert (coach) and
one be novice
More natural (like conversation) so removes some
awkwardness of individual think aloud
 Variant: let coach be from design team (cooperative
evaluation)
Alternative
 What if thinking aloud during session will be
too disruptive?
 Can use post-event protocol



User performs session, then watches video
afterwards and describes what s/he was
thinking
Sometimes difficult to recall
Opens up door of interpretation
What if a user gets stuck?
 Decide ahead of time what you will do.
 Offer assistance or not? What kind of assistance?
 You can ask (in cooperative evaluation)
 “What are you trying to do..?”
 “What made you think..?”
 “How would you like to perform..?”
 “What would make this easier to accomplish..?”
 Maybe offer hints
 This is why cooperative approaches are used
Inputs
 Need operational prototype
 could use Wizard of Oz simulation
 Need tasks and descriptions
 Reflect real tasks
 Avoid choosing only tasks that your design
best supports
 Minimize necessary background knowledge
 Pay attention to time and training required.
Capturing a Session
 1. Paper & pencil



Can be slow
May miss things
Is definitely cheap and easy
Task 1
Time 10:00
10:03
10:08
10:22
Task 2
S
e
Task 3
S
e
…
Capturing a Session
 2. Recording (screen, audio and/or video)





Good for think-aloud
Multiple cameras may be needed
Good, rich record of session
Can be intrusive
Can be painful to transcribe and analyze
Recording
 2b: With usability software (such as Morae)




Combines audio/video, screen, mouse and
keyboard clicks, window opening/closing, etc.
Synchronizes everything
Allows for annotations
Can create clips or presentations
Capturing a Session
 3. Software logging



Modify software to log user actions
Can give time-stamped key press or mouse
event
Two problems:


Too low-level, want higher level events
Massive amount of data, need analysis tools
Example logs
2303761098721869683|hrichter|1098722080134|MV|START|566
2303761098721869683|hrichter|1098722122205|MV|QUESTION|false|false|false|false|false|false|
2303761098721869683|hrichter|1098724978982|MV|TAB|AGENDA
2303761098721869683|hrichter|1098724981146|MV|TAB|PRESENTATION
2303761098721869683|hrichter|1098724985161|MV|SLIDECHANGE|5
2303761098721869683|hrichter|1098724986904|MV|SEEK|PRESENTATION-A|566|604189|0
2303761098721869683|hrichter|1098724996257|MV|SEEK|PRESENTATION-A|566|604189|604189
2303761098721869683|hrichter|1098724998791|MV|SEEK|PRESENTATION-A|566|604189|604189
2303761098721869683|hrichter|1098725002506|MV|TAB|AGENDA
2303761098721869683|hrichter|1098725003848|MV|SEEK|AGENDA|566|149613|604189
2303761098721869683|hrichter|1098725005981|MV|TAB|PRESENTATION
2303761098721869683|hrichter|1098725007133|MV|SLIDECHANGE|3
2303761098721869683|hrichter|1098725009326|MV|SEEK|PRESENTATION|566|315796|149613
2303761098721869683|hrichter|1098725011569|MV|PLAY|566|315796
2303761098721869683|hrichter|1098725039850|MV|TAB|AV
2303761098721869683|hrichter|1098725054241|MV|TAB|PRESENTATION
2303761098721869683|hrichter|1098725056053|MV|SLIDECHANGE|2
2303761098721869683|hrichter|1098725057365|MV|SEEK|PRESENTATION|566|271191|315796
2303761098721869683|hrichter|1098725064986|MV|TAB|AV
2303761098721869683|hrichter|1098725083373|MV|TAB|PRESENTATION
2303761098721869683|hrichter|1098725084534|MV|TAB|AGENDA
2303761098721869683|hrichter|1098725085255|MV|TAB|PRESENTATION
2303761098721869683|hrichter|1098725088690|MV|TAB|AV
2303761098721869683|hrichter|1098725130500|MV|TAB|AGENDA
2303761098721869683|hrichter|1098725139643|MV|TAB|AV
2303761098721869683|hrichter|1098726430039|MV|STOP|566|271191
2303761098721869683|hrichter|1098726432482|MV|END
Analysis
 Many approaches
 Task based
 How do users approach the problem
 What problems do users have
 Need not be exhaustive, look for interesting cases
 Performance based
 Frequency and timing of actions, errors, task
completion, etc.
 Can be very time consuming!!
Experiments
Testing hypotheses…
Experiments
 Test hypotheses in your design
 More controlled examination than just
observation
 Generally quantitative, experimental,
with end users.
Types of Variables
 Independent

What you’re studying, what you intentionally
vary (e.g., interface feature, interaction device,
selection technique, design)
 Dependent

Performance measures you record or examine
(e.g., time, number of errors)
 Controlled

What you do not want to affect your study
“Controlling” Variables
 Prevent a variable from affecting the results in
any systematic way
 Methods of controlling for a variable:



Don’t allow it to vary
 e.g., all males
Allow it to vary randomly
 e.g., randomly assign participants to different
groups
Counterbalance - systematically vary it
 e.g., equal number of males, females in each
group
 The appropriate option depends on
circumstances
Hypotheses
 What you predict will happen
 More specifically, the way you predict the dependent
variable (i.e., accuracy) will depend on the
independent variable(s)
 “Null” hypothesis (Ho)
 Stating that there will be no effect
 e.g., “There will be no difference in performance
between the two groups”
 Data used to try to disprove this null hypothesis
Defining Performance
 Based on the task
 Specific, objective measures/metrics
 Examples:
 Speed (reaction time, time to complete)
 Accuracy (errors, hits/misses)
 Production (number of files processed)
 Score (number of points earned)
 …others…?
 Preference, satisfaction, etc. (i.e. questionnaire
response) are also valid measurements
Example
 Do people complete operations faster with a black-
and-white display or a color one?



Independent - display type (color or b/w)
Dependent - time to complete task (minutes)
Controlled variables - same number of males and
females in each group

Hypothesis: Time to complete the task will be shorter
for users with color display

Ho: Timecolor = Timeb/w
What about subjects?
 How many?
 Book advice:at least 10
 Other advice:6 subjects per experimental
condition
 Real advice: depends on statistics
 Relating subjects and experimental
conditions

Within/between subjects design
Experimental Designs
 Within Subjects Design

Every participant provides a score for all levels
or conditions
P1
P2
P3
...
Color
12 secs.
19 secs.
13 secs.
B/W
17 secs.
15 secs.
21 secs.
Experimental Designs
 Between Subjects

Each participant provides results for only one
condition
Color
P1 12 secs.
P3 19 secs.
P4 13 secs.
...
P2
P5
P6
B/W
17 secs.
15 secs.
21 secs.
Within Subjects Designs
 More efficient:
 Each subject gives you more data - they complete
more “blocks” or “sessions”
 More statistical “power”:
 Each person is their own control
 Therefore, can require fewer participants
 May mean more complicated design to avoid “order
effects”

e.g. seeing color then b/w may be different from
seeing b/w then color
Between Subjects Designs
 Fewer order effects
 Participant may learn from first condition
 Fatigue may make second performance worse
 Simpler design & analysis
 Easier to recruit participants (only one
session, less time)
 Less efficient
Now What…?
 Performed initial data inspection

Removed outliers, have general idea what
occurred
 Descriptive Statistics

Totals, Averages, Ranges, etc.
 Subgroup Statistics
 Inferential Analysis to test hypotheses
Descriptive Statistics
 For all variables and subgroups, get a feel for results:
 Total scores, times, ratings, etc.
 Minimum, maximum
What is the
 Mean, median, ranges, etc.
e.g. “Twenty participants completed both
sessions (10 males, 10 females; mean age
22.4, range 18-37 years).”

e.g. “The median time to complete the task
in the mouse-input group was 34.5 s
(min=19.2, max=305 s). The median time to
complete the task in the keyboard-input
group was 32.1 s (min=17.6, max=286 s)

difference
between mean
& median? Why
use one or the
other?
Inferential Stats and the Data
Are these really
different? What
would that mean?
Goal of analysis
 Get >95% confidence in significance of result

that is, null hypothesis disproved

Ho: Timecolor = Timeb/w

OR, there is an influence

ORR, only 1 in 20 chance that difference
occurred due to random chance
Hypothesis Testing
 Tests to determine differences
 t-test to compare two means
 ANOVA (Analysis of Variance) to compare
several means
 Need to determine “statistical significance”
 “Significance level” (p):
 The probability that your null hypothesis was
wrong, simply by chance
 p (“alpha” level) is often set at 0.05, or 5% of the
time you’ll get the result you saw, just by chance
Feeding Back Into Design
 What were the conclusions you reached?
 How can you improve on the design?
 What are quantitative benefits of the redesign?
 e.g. 2 minutes saved per transaction, which means
24% increase in production, or $45,000,000 per year in
increased profit
 What are qualitative, less tangible benefit(s)?
 e.g. workers will be less bored, less tired, and
therefore more interested --> better cust. service
Example: Resolution Effects
 Task: Given a filename, locate the icon with that
filename on a desktop scattered with other icons
(distractors)
 Compare task time under 3 resolutions (800 x 600,
1024 x 768, 1280 x 1024)
 Control for:




Number of distractors (same)
Icon shape/size (all text document icons)
Position of target (randomized)
Etc.
Resolution Experiment
 Task: Cursor starts in middle of screen. Filename




flashes at top of screen, user has to find the file and
click on it once with cursor
Hypothesis: time to select target icon increases with
higher resolution (because icons and font are
smaller)
Independent variables:
 3 resolution settings
Dependent variables:
 Time to click on target
Controlling variables:
 Experience, # distractors, size/shape, target
position, distance from start, direction to target
Resolution Experiment
 Within-Participants design
 1 hour, 3 blocks of 15 minutes each, 100 trials
in each condition
 24 participants, split into 6 groups of 4 people
 Why 6 groups? 3 conditions, and 3! = 6
 User preference questionnaire, ratings
Group
Block 1
Block 2
Block 3
A
Res 1
Res 2
Res 3
B
Res 1
Res 3
Res 2
C
Res 2
Res 3
Res 1
D
Res 2
Res 1
Res 3
E
Res 3
Res 2
Res 1
F
Res 3
Res 1
Res 2
Res 1: 600 x 800
Res 2: 1024 x 768
Res 3: 1280 x 1024
3! = 3 * 2 * 1 = 6 --> so there are six possible orderings of the blocks
Divide the 24 participants evenly among these 6 groups.
Analysis Process
(Within-Participants)
 Look at descriptive statistics, average trial time, user





preferences on questionnaire, etc.
Remove outliers (any trial with measure greater than 2
standard deviations from the mean).
Check null hypothesis for your control variables to
ensure no effects exist.
Check for learning effects. Make sure the counterbalancing worked! Do an ANOVA of trial time vs.
Group, should show no effect.
Do ANOVA of trial time vs. resolution to test
significance of findings. Hopefully there is an effect
here!
Do post-hoc pairwise tests
Results (trial time in seconds)
600 x 800
1024 x 768
1280 x 1024
Group A
7.5
9.4
11.1
Group B
8.3
8.8
12.8
7.6
9.5
11.0
…
Mean trial time
Hypotheses: Appears to be supported…
NOTE: These are MADE UP numbers. I have not run this study!!!!!!
High probability indicates
that whatever relationship
is found, it had a 63%
chance of being found in
random data…
ANOVAs
ANOVA for mean trial time, group:
Source
DegreesFreedom
F-Ratio
Probability
Group
5
0.457
0.635
Low F-ratio (less than 1) means not much effect
Conclusion: counterof group on trial time
balancing worked, and
resolution has a statistically
significant effect on selecting
a target in this experiment
ANOVA for mean trial time, resolution:
Source
DegreesFreedom
F-Ratio
Probability
Resolution
2
4.337
<= 0.0001
Higher F-ratio here indicates that resolution does
have an effect on trial time (doesn’t tell you what the
effect is, exactly)
Low probability
indicates the effect is
very unlikely to be
found in random data
Post-Hoc Pairwise Test
Once you’ve found an effect using an ANOVA, you need to figure out whether
each condition is significantly different from the others. You do this with posthoc tests, such as Tukey HSD:
Conditions
800x600 1024x768
1280x1024 800x600
1280x1024 1025x768
Difference
(seconds)
Probability
1.9
0.037
Significant
3.4
0.002
Significant
1.5
Not Significant
0.054
What could be concluded?
(if this data was real…)
In a laboratory setting, target search and selection
tasks are faster on 800x600 screens than 1024x768
or 1280x1024, all other things being equal. There is
no statistically significant difference between the two
higher resolutions.
What could not be concluded:
Lower resolution screens are better.
Why not?
For other tasks, in other conditions (like different lighting, off-line
distractions, etc. the other resolutions might perform better).
Another Example: Heather’s
simple experiment
 Designing interface for categorizing keywords
in a transcript
 Wanted baseline for comparison
 Experiment comparing:



Pen and paper, not real time
Pen and paper, real time
Simulated interface, real time
Experiment
 Hypothesis: fewer keywords in real time, fewer with
simulated
 Independent variables:

Time, accuracy of transcript, platform
 Dependent variables:
 Number of keywords of each category
 Controlling variables:
 Gender, experience, etc.
 Between subjects design
 1 hour, mentally intensive task
Results
Non-Real Time
Rate
Real Time Rate
Simulated Rate
Domain-specific tags
7.5
9.4
5.1
Domaintags
12
9.8
5.8
1.8
3
2.5
independent
Conversation tags
For Domain-specific tags, Simulated less than RealTime, p < 0.01
For Domain-independent tags, Simulated less than RealTime, p < 0.01
Hypotheses
fewer in Real Time: not supported
fewer with Simulated: supported for two categories
Example: add video to IM voice chat?
 Compare voice chat with and without video
 Plan an experiment:
 Compare message time or difficulty in communicating
or frequency…
 Consider:
 Tasks
 What data you want to gather
 How you would gather
 What analysis you would do after
Questions:






What are independent variables?
What are dependent variables?
What could be hypothesis?
Between or within subjects?
What was controlled?
What are the tasks?