hearst-ibm

Transcript hearst-ibm

Towards Automated Web
Design Advisors
Melody Y. Ivory
Marti A. Hearst
School of Information Management & Systems
UC Berkeley
IBM Make IT Easy Conference
June 4, 2002
The Problem:
Poor Website Design by Non-Professionals
2
The Problem:
Poor Website Design by Non-Professionals
3
A Solution
 Automatic recommendations and
context-specific guidelines.
 “Grammar checkers” for web design
– Create good templates to incorporate into
web design tools
– Compare current design to high-quality
designs and show differences
4
The WebTango Goal
High Quality Designs
User’s Design
Profiles
Quality
Checker
•Predictions
•Similarities
•Differences
•Suggestions
•Modification
5
The Approach
Develop Statistical Profiles
Idea: Reverse engineer design
patterns from high-quality sites
and use to assess the quality of
other sites
Create a large set of measures to assess
various design attributes
2. Obtain a large set of evaluated sites
3. Create models of good vs. avg. vs. poor sites
1.
Take into account the context and type of site
Use models to evaluate other sites
5. Use models to suggest improvements
4.
6
Step 1: Measuring Web Design
Aspects
 Identified key aspects from the
literature
– Extensive survey of Web design literature: texts
from recognized experts; user studies
• amount of text on a page, text alignment, fonts, colors,
consistency of page layout in the site, use of frames, …
– Example guidelines
• Use 2–4 words in text links [Nielsen00].
• Use links with 7–12 useful words [Sawyer & Schroeder00].
• Consistent layout of graphical interfaces result in a 10–25%
speedup in performance [Mahajan & Shneiderman96].
– There are no theories about what to measure
7
157 Web Design Measures
(Metrics Computation Tool)
experience
design
 Text Elements (31)
SA
# words, type of words
 Link Elements (6)
PP
# graphic links, type of links
PF
TF
TE
LF
LE
GF
GE
 Graphic Elements (6)
# images, type of images
 Text Formatting (24)
# font styles, colors, alignment, clustering
 Link Formatting (3)
information,
navigation,
& graphic
design
# colors used for links, standard colors
 Graphics Formatting (7)
max width of images, page area
 Page Formatting (27)
quality of color combos, scrolling
 Page Performance (37)
download time, accessibility
 Site Architecture (16)
consistency, breadth, depth
8
Word Count: 157
9
Content Word Count: 81
10
Body Word Count: 94
11
Step 2: Obtaining a Sample of
Evaluated Sites
 Webby Awards 2000
– Only large corpus of rated Web sites
 3000 sites initially
– 27 topical categories
• Studied sites from informational categories
– Finance, education, community, living, health, services
 100 judges
– International Academy of Digital Arts & Sciences
• Internet professionals, familiarity with a category
– 3 rounds of judging (only first round used)
• Scores are averaged from 3 or more judges
• Converted scores into good (top 33%), average (middle
34%), and poor (bottom 33%)
12
Webby Awards 2000
 6 criteria
– Content
– Structure & navigation
– Visual design
– Functionality
– Interactivity
– Overall experience
 Scale: 1–10 (highest)
 Nearly normally
distributed
13
Example Page from Good Site
14
Example Page from Avg. Site
15
Example Page from Poor Site
16
The Data Set
 Downloaded pages from sites
– Downloads informational pages at multiple
levels of the site
 Computed measures for the sample
– Processes static HTML, English pages
• Measures for 5346 pages
• Measures for 333 sites
– Categorize by
• Topic: education, health, finance, …
• Page Type: content, homepage, link page, …
17
Step 3: Creating Prediction
Models
 Statistical analysis of
quantitative measures
Good
?
Average
– Methods
• Classification &
regression tree, linear
discriminant
classification, & Kmeans clustering
analysis
– Context sensitive models
Poor
• Content category, page
style, etc.
– Models identify a subset
of measures relevant for
each prediction
18
Page-Level Models (5346 Pages)
Model
Method
Overall page quality
C&RT
Accuracy
Good Avg. Poor
96% 94% 93%
LDC
92% 91% 94%
~1782 pgs/class
Content category quality
~297 pgs/class & cat
ANOVAs showed that all differences in measures were
significant (good vs. avg, good vs. poor, etc.)
19
Page-Level Models (5346 Pages)
Page Type Classifier (decision tree)
Home page, content, form, link, other
1770 manually-classified pages, 84% accurate
Model
Method
Page type quality
LDC
Accuracy
Good Avg. Poor
84% 78% 84%
Overall page quality
C&RT
96%
94%
93%
Content category quality
LDC
92%
91%
94%
~356 pgs/class & type
ANOVAs showed that all differences in measures were
significant (good vs. avg, good vs. poor, etc.)
20
Clustering Good Pages
 K-means clustering to
identify 3 subgroups
 ANOVAs revealed key
differences
Small
page
– # words on page, HTML
bytes, table count
 Characterize clusters as:
– Small-page cluster
(1008 pages)
– Large-page cluster
(364 pages)
– Formatted-page cluster (450
pages)
 Use for detailed analysis of
pages
Large
page
Formatted
page
21
Step 4: Evaluate Other Sites
 Make predictions for an existing design
– good, average, poor
– How do the scores on th emetrics vary from good
pages?
22
Example
Site drawn from Yahoo Education/Health
– Discusses training programs on numerous health
issues
– Chose one that looked good at first glance, but on
further inspection seemed to have problems.
– Only 9 pages were available, at level 0 and 1
– Not present in the original study
23
Sample Content Page
(Before)
24
25
Page-Level Assessment
 Decision tree predicts: all 9 pages consistent
with poor pages
– Content page does not have accent color; has
colored, bolded body text words
• Avoid mixing text attributes (e.g., color, bolding, and size)
[Flanders & Willis98]
• Avoid italicizing and underlining text [Schriver97]
26
Page-Level Assessment
 Cluster mapping
– All pages mapped into the small-page cluster
– Deviated on key measures, including
• text link, link cluster, interactive object, content link word, ad
• Most deviations can be attributed to using graphic links without
corresponding text links
– Use corresponding text links [Flanders & Willis98,Sano96]
Top deviant measures
for content page
Link
Count
Text Link
Count
Good Link
Word
Count
Font Count
Sans Serif
Word
Count
Display
Word
Count
27
Page-Level Assessment
 Compared to models for health and
education categories
– All pages found to be poor for both models
 Compared to models for the 5 page
styles
– All 9 pages were considered poor pages by
page style (after correcting predicted
types)
28
Improving the Site
 Eventually want to automate the translation
from differences to recommendations
 Revised the pages by hand as follows:
– To improve color count and link count:
• Added a link text cluster that mirrors the content of the
graphic links
– To improve text element and text formatting
variation
• Added headings to break up paragraphs
• Added font variations for body text and headings and
made the copyright text smaller
– Several other changes based on small-page cluster
characteristics
29
Sample Content Page
(After)
30
31
After the Changes
 All pages now classified correctly by
style
 All pages rated good overall
 All pages rated good health pages
 Most pages rated as average education
pages
 Most pages rated as average by style
32
Profile Evaluation
 Small user study
– Page-level comparisons (15 page pairs)
• Participants preferred modified pages (57.4% vs. 42.6%
of the time, p =.038)
– Site-level ratings (original and modified versions of
2 sites)
• Participants rated modified sites higher than original sites
(3.5 vs. 3.0., p=.025)
• Non Web designers had difficulty gauging Web design
quality
– Freeform Comments
• Subtle changes result in major improvements
33
Summary
Measures
Validate
Evaluate
Data
Models
 Goal:
– Provide automated, context-sensitive
suggestions for improving web design.
 Approach:
– Compute statistics over large collection of
rated web sites
– From these build models of good sites
– Use these to suggest changes.
34
Advantages and Limitations
 Advantages
– Derived from empirical data
– Context-sensitive
– More insight for improving designs
– Evolve over time
– Applicable to other types of Uis
 Limitations
– Based on expert ratings
– Correlation, not causality
– Not a substitute for usability studies
35
Next Steps
 Update the profiles (Webby 02 data)
 Develop tool to facilitate interpretation of predictions
 Examine the profiles in more detail
– Factor analysis to highlight design patterns
– See which guidelines are valid empirically (studies)
• Moving from predictions to recommendations
 Incorporate assessments of content quality (text
analysis & studies)
 Improve site-level measures and models
– Incorporate page-level predictions
 New page-level measures (spatial properties)
 Develop interactive Web design tool
36
Thank You
 For more information
– http://webtango.berkeley.edu
Research supported by the following grants:
Hellman Faculty Fund,
Microsoft Research Grant,
Gates Millennium Fund,
GAANN Fellowship,
Lucent Cooperative Research Fellowship Program
Thanks to:
Webby Awards (Maya Draisin & Tiffany Shlain)
Rashmi Sinha
37
Do Webby Ratings Reflect
Usability?
 Do the profiles assess usability or something else?
 User study (30 participants)
– Usability ratings (WAMMI scale) for 57 sites
• Two conditions – actual and perceived usability
– Contrast to judges’ ratings
 Results
– Some correlation between users’ and judges’ ratings
– Not a strong finding
– Virtually no difference between actual and perceived
usability ratings
• Participants thought it would be easier to find info in the perceived
usability condition
38

hearst-ibm

Transcript hearst-ibm

Directory