Transcript PPTweek7

IEEM 552 - Human-Computer
Systems
Dr. Vincent Duffy - IEEM
Week 7 - Hazards in HCI
March 16, 1999
http://wwwieem.ust.hk/dfaculty/duffy/552
email: [email protected]
1
For today
 1.
Further discussion/review of the
summary of results you submitted
– based on the week 3 in-class exercise ‘an
example’
 2.
Hazards to conducting and interpreting
HCI Experiments
 3. Brief discussion - Pictogram, Miller &
Stanney
 4. brief discussion about exam 1
2
A test of 2 interfaces - Which
interface is better?
 Self
rating of Expertise for Library
Online Searches
 Human Subjects Consent Form
 Data sheet for data collector
 Data Sheet for Subject
 Introduction
– subjects will leave the room
3
Is it enough to ask which is better?
 What
do I expect from the results/data?
 What are the hypotheses?
– What do I think is true about the system
before I start?
– What questions am I trying to answer with
the data/analyses?
– H.1. www search is faster.
– H.2. More errors using www search.
– H.3. More data can be found by www.
4
Self-rating and consent form
5
 Group
# _____Name member_______
 Please rate your experience using the
Library search databases at UST or
other universities.

Least
Most

experience
experience

1
2
3
4
5
6
Data sheet for subject
7
Data sheet for subject









Group no./name _________________
The test administrator will show you which interface to use
first. Please do all 3 of the following tasks. Do not stop until all
3 tasks are completed.
Interface 1
1. Please write the 'call number' of the book titled : 'user
interface design' by Eberts.
Call no. ________________
2. Please write the call number for a video about the Wright
Brothers. We do not know the title. However, it is less than 30
minutes long (duration).
Call no. ______________
3. Please locate/find as many Visual C++books in less than 5
minutes).
8
Number of Visual C++ books found _____________
Data/instructions for data collector
9
Before the experiment-subjects
out of the room










Group #____ Data sheet for data collector
1. record name of data collector__________
2. the data collector will need to record time (by watch, clock,
computer, etc.)
3. be sure subject signs human subjects consent form
4. give instruction sheet, allow 1 minute for reading and one
minute for questions.
5. Show the subject how to start library online systems.
Count time beginning when the subject double clicks the correct
icon
(the two interfaces to be tested are www or telnet/dos)odd numbered groups (eg. 1,3,5) should begin with www
even numbered groups (eg. 2,4,6) should begin with telnet/dos
10
interface).
During the experiment









collect 14 pieces of data - 7 pieces for each of two interfaces
1. subject name/group no. _______________
2. www interface (1) or telnet/dos interface (2)_______________
3. time to find item 1 (call number of 'user interface design' by
Eberts).
_______________
4. time to find item 2 (a video about Wright Brothers - less than
30 minutes video)______________
(begin counting time immediately after finding item 1)
5. number of errors in finding item 2 (count error as any back,
prev. record, start over, etc.)________
6. time to find (how many books can you find on Visual C++ in
less than 5 minutes)._____________
7. quantity of Visual C++ books
________________
11
After collecting the data
 Use
sample experimental data
(previously collected)
 upload to www, download so
accessible to you, run analyses
– using SAS -Statistical Analysis Software
 how
to compare?
– Simple test of difference in means - we
used T-test (comparing only 2 variables)
– discuss hypotheses
– for hw: asked you to interpret the output612
 For
assumptions of analyses/output
– hint : see Chapter 6 Cody and Smith (p.138149)
 How
do I determine if Hypothesis 1-3
are supported?
– H.1. www search is faster.
– H.2. More errors using www search.
– H.3. More data can be found by www.
13
Sample SAS program
14
H.1. www search is faster.




For our hypothesis we want to check difference in means.
First check if variances equal to help decide which p values to use (find
p>F’ - prob. That we reject Ho incorrectly? if p<.05 reject Ho)
Either way, look at p>|T|, probability that we reject Ho incorrectly.
If p<.05, reject - Ho- for T-test - for which Ho says ‘means are same’
– if you reject, then conclude - means are statistically different

For time for task 1 means are not statistically different.
15
How do these results influence
conclusions about H1?
16
H.2. More errors using www search.
Suppose your results looked like this….




check difference in means
First check if variances equal to help decide which p values to use (find
p>F’ - prob. That we reject Ho incorrectly? if p<.05 reject Ho)
Either way, look at p>|T|, probability that we reject Ho incorrectly.
If p<.05, reject - Ho- for T-test - for which Ho says ‘means are same’
– if you reject then conclude - means are statistically different

For number of errors means are not statistically different.
17
However, this was your data...
 What
do you conclude about H2?
H.3. More data can be found by www


check difference in means
First check if variances equal p=.004, reject Ho (that Variances are equal)
– use the information to decide which p value to observe for the T-test

In this case, look at p>|T|, probability that we reject Ho incorrectly.
– For ‘unequal variances’ to help decide whether to reject Ho- for the T-test
which says ‘means are equal’

p=.356, accept - Ho- for T-test - ‘means are same’
– if you reject then conclude - means are statistically different

For quantity of books found, means are not statistically different.
18
None of our 3 hypotheses were
fully supported
 Does
start?
this mean we were incorrect from the
– WWW is no better than the dos/telnet based
system?
 Does
it mean www is
– not faster,
– not less error prone,
– not more likely to allow you to find more
information?
 What
might have gone wrong?
19
Hazards to conducting &
interpreting HCI experiments

To be avoided
– when conducting experiments
 To
be noticed
– when reading experiments of other people
– to see if the methodology or interpretation of
results invalidates some conclusions
– Sheil (1981) found large % of studies had some
methodology problem which made results suspect
20
What is wrong with the following? (please submit a separate sheet with your answers)










Q1. Hypothesis states/asks: ‘Is this new interface effective?’
Q2a.New interface is compared to old interface. The subjects tested using the new
design also have used the old design.
Q2b.subjects for new improved design treated more enthusiastically (or more quiet
room)
Q3. Software manufacturer tests financial planning software on its employees
(mostly programmers)
Q4. Two experiments show the same mean difference between interface measures,
but the difference is statistically significant in one experiment and not in the other
Q5. One person administers a test to 10 subjects for one interface test condition
(treatment). A different person administers the test to 11 subjects for the other.
Q6. Suppose a correlation (R=.55) shows a significant relationship (p<.05) is found
between ‘percent correct’ and ‘frequency of use’ of help menus
Also suppose a correlation is found betwn. likert scale (1-7 scale) variable &
‘frequency of use’ of help menus. Which are you more likely to use for drawing
conclusions?
Q7. Experiment finds that there is no statistical difference between measured
variables of the old and new designs. He concludes that the two are the same.
Marketing of the new design is halted.
Q8. a vendor is trying to sell your software company a computer programming tool
that was found to reduce programming time by 50%. You are told you should expect
50% reduction in software development time. The product was previously tested on
novices.
Q.1. What is wrong with this?
 Q.
Is this new interface effective?
–
–
–
–
–
what is meant by effective?
should this mean faster or fewer errors?
should this mean people prefer this one?
for whom? expert or novice?
effective compared to what? 2 different designs?
some standard?
– evaluations must be made for two or more
treatment conditions
a
better question/hypothesis
22
– can new interface can be used with less assistance?
Hazard 1 - Question phrased
improperly
 What’s
the big deal?
– experimenter may discover certain measures for
which data should have been collected
– too late
 How
to avoid it
– planning, behind the scenes work
– conduct a pilot test on a small number of subjects
– understanding the underlying theories related to
the independent variables or the dependent
23
(performance) measures
Q.2. What is wrong with this?
 Q.2.a.
New interface is compared to old
interface. The subjects tested using the new
design also have used the old design.
– important variable not controlled
– subjects have prior experience (training)
 Q.2.b.
subjects for new improved design
treated more enthusiastically (or more quiet
room)
24
– treatment of subjects varies w/level of ind.var.
Hazard 2 - Important variables
not controlled
 What
is the big deal?
– uncontrolled variable (confounding) can
simulate or counteract (eliminate detection)
of a treatment effect
 How
to eliminate or minimize this?
– list all the variables that might influence
– control each variable through
 randomization,
hold constant or eliminate
(variability), manipulate it
25
Consider ‘an example’
Hyp. 2 &
‘which
came
first’?

For task 2, ‘find wright brothers video’,
– time to complete and errors significantly higher
for the first interface (is it because it was
improvement (2nd time doing task) or is www is
26
Q.3 What is wrong with this?
 Software
manufacturer tests financial
planning software on its employees
(mostly programmers)
– inappropriate sample used
– tested mostly experts when software was
designed for computer novices
– mixed group of novice and expert
employees
27
Hazard 3 - Inappropriate sample
used
 What’s
the big deal?
– results can be misleading if they are
generalized to the wrong group of users
 How
to avoid?
– try to demonstrate that subjects have been
stabilized at some performance level
– or report honestly that subjects may not have
been allowed sufficient time to become
proficient at task (so as to stabilize level)
28
Q.4. What is wrong with this?
 Two
experiments show the same mean
difference between interface measures,
but the difference is statistically
significant in one experiment and not in
the other
– not enough subjects are used
– What is meant by statistically significant?
– usually set at p<.05 (probability you reject
null incorrectly ex: null: no difference)
29
Hazard 4 - Not enough subjects
used
 Why
does this commonly happen?
– finding appropriate sample, that is large
enough, is difficult sometimes
– in practice, number of subjects often
determined by number available or based
on reports of previous studies
 How
to avoid?
– choose larger samples (more expensive)
– consider trade-offs: sample size, variability,30
size of potential effect (to be measured)
Q.5. What is wrong with this?
 One
person administers a test to 10 subjects
for one interface test condition (treatment).
A different person administers the test to 11
subjects for the other.
– what if the two experimenters (people
administering the test) conduct the experiment
differently?
– test administered improperly
– two different people should not administer the
test in this manner
31
Hazard 5 - Test administered improperly
- experimental studies
 What’s
the big deal?
– different distractions or test conditions can
influence the results, or increase the
variability making actual differences difficult
to detect (or differences may be due to
sloppy test conditions)
 How
to avoid?
– Test the interface to eliminate bugs, stabilize
the experimental/room, and general test
conditions
 How
is this different for field studies?
32
Q.6. What is wrong with this?
a
correlation (R=.55) shows a significant
relationship (p<.05) is found between ‘percent
correct’ and ‘frequency of use’ of help menus
 correlation is found betwn. likert scale (1-7 scale)
variable & ‘frequency of use’ of help menus
– the first example violates the assumptions of the
method of analysis used (percent correct not usually
normally distributed - likert scales are).
– parametric statistics assume normality and
homogeneity of variances
33
Suppose our T-test shows a significant difference
in the means between time to complete task 2 and interface. Can we safely
conclude that our hypothesis is supported?
What kind of distribution
is shown by this data?
 Normal, uniform?
 What are the assumptions
of the statistics we used
(eg.T-test)
 Normality
>200
100<x<200

<100s
7
6
5
4
3
2
1
0
Timtsk2
(1st
interface
tested)

If the data is not normally
distributed, you can not use the
statistics that require normality as a
basic assumption (correlation,, t-test,
anova, etc.).
Hazard 6 - Improper analysis used
 What’s
the big deal?
– it can invalidate the results of the experiment
 How
to avoid?
– test the data - distributions of variables should be
normal and should have equality of variancesfor multi-variate stats like regressions
– if necessary, use a different method of analysis
(non-parametric-not as robust) or transform data
– be sure that the data meets the assumptions
before running the analysis otherwise you waste
35
your time
Q.7. What is wrong with this?
 Experiment
finds that there is no
statistical difference between measured
variables of the old and new designs.
He concludes that the two are the same.
Marketing of the new design is halted.
– if you can not reject the null hypothesis (no
difference), that does not prove it
– it only shows that you could not prove a
difference
– it may still exist. how?
36
Hazard 7 - Null effects interpreted
incorrectly
 examples
of some things that may make a
difference more difficult to detect
– Hazard 2. a confound may have occurred
– Hazard 4. not enough subjects to detect a
difference
– Hazard 5. treatments administered poorly causing
high variability in the conditions
– Hazard 6. was wrong statistical test conducted?
– your measure may not be sensitive enough to 37
detect a difference
Q.8. What is wrong with this?
a
vendor is trying to sell your software
company a computer programming tool that
was found to reduce programming time by
50%. You are told you should expect 50%
reduction in software development time.
The product was previously tested on
novices.
– software developers are likely not novices, so it
is difficult to know what to expect.
38
Hazard 8 - Results generalized
beyond conditions tested
 What’s
the big deal?
 can mislead readers.
 we can be misled if we only read the
abstract and conclusion
 How to avoid?
 be careful not to generalize the results
beyond the sample & conditions tested
39
 your results may lend evidence, but
further testing may be needed to confirm
For week 8 - Exam details
 Old
exam - on web page
 closed book format in class
– 100 points
– 65% lecture notes, 3 videos, 2 cases & demo
– 35% integrating concepts with the research papers
& the class trip
– week 1-7, lectures 1-5 & demos.
– Background reading: Chapter 1,3 Eberts; Cody &
Smith, Ch. 6 (p.138-146), 3 journal papers ’Thinking Aloud’ and ‘Task complexity’ and
‘Pictogram’, 2 cases.
40