eu_20060904_topes

Download Report

Transcript eu_20060904_topes

A Lightweight Model for
End Users’
Domain-Specific Data
Christopher Scaffidi
Carnegie Mellon University
VL/HCC Graduate Consortium 2006
Consider automating repetitive actions in a
web browser.
• Our recent contextual inquiry
revealed that administrative
assistants fill out many expense
reports.
• Given a location and date, they
used a government site to find
the per diem rate.
2
Web macros cannot automate this task.
Existing macro tools cannot convert from two-letter state
abbreviation to full state name.
3
Motivation
Web macros cannot automate this task.
Nor can they convert dates from MM/DD/YYYY to Month DD.
4
Motivation
The world is full of “user-level” data.
• Examples:
–
–
–
–
–
Dates
Credit card numbers
Person names
Quantities of RAM
Product codes
–
–
–
–
–
State names
US phone numbers
Bus route numbers
Dewey decimal numbers
Etc…
• Such data are
– “bigger” than floats and strings.
– “smaller” than a database row.
– typically domain-specific.
5
Motivation
Tools do not “understand” user-level
abstractions such as states and dates.
• Limited support for data manipulation
– Reformatting data in web macros or spreadsheets
– Transporting (transforming) data between applications
• Limited support for data validation
– Are any values mistyped?
– Does the dataset contain duplicates?
• Information Week respondents complained more about
data manipulation & interoperability problems than about
software reliability problems!
6
Problem
To be useful, representations of these
abstractions must meet 3 requirements.
Extensibility
Different people use different data.
 Support creation of new abstractions by end users.
Shareability
Different people sometimes use the same kinds of data.
 Help end users find & evaluate abstractions.
Flexibility
Data appear in many formats, with exceptions to every rule.
 Support multiple formats, and permit exceptions.
7
Problem
Existing approaches do not meet the
requirements.
• Regexps / grammars / data detectors represent syntax,
not semantics (e.g.: how to represent “FL” = “Florida”?)
• Research on units (e.g.: Slate) typically only apply to
numeric data in certain applications (e.g.: spreadsheets).
• Knowledge systems (e.g.: ConceptNet) do not contain
representations of data formats.
• OO and formal types are too difficult for many end users
and typically disallow exceptions to type rules.
• And none of these has built-in support for helping users
decide which abstraction to trust, so sharing is impeded.
8
Existing approaches
A “tope” defines the basic semantics of a
single user-level abstraction.
A “tope” is a pair of functions defined by a user:
• isa: string  [0,1]
returns an estimate of the probability that the string is an
instance of this tope
• eq: string x string  [0, 1]
returns an estimate of the probability that the strings are
equivalent, conditional on being instances of this tope
Topes will be defined in files and compiled, just like types.
9
Proposed model
Reformatting functions would transform
instances from one tope to another.
Two topes are “isotopes” if instances of one can be
reformatted into the other.
• fmt: string  string
treats the input as an instance of one tope and returns
an equivalent string that is in another format
10
Proposed model
A meta-model is required to facilitate
sharing of topes.
• Topes could be implemented in arbitrary languages.
• The binaries would be stored in “repositories”.
• Each user might subscribe to multiple repositories:
– personal repository of custom topes
– university repository of organization-specific topes
– general repository of generic topes
• How do end users decide which topes to use?
 Topes would be annotated with platform (e.g.: JDK1.5),
author names, and other meta information to facilitate
finding and choosing topes.
11
Proposed meta-model
Macro tools would download topes on
demand from repositories.
Back to the macro example…
1. The macro tool retrieves topes.
2. The tool tries to infer a tope for each value in the macro.
•
The user could override this assignment, of course.
3. The tool can now automatically reformat data if needed
12
Tope implementation
Most isa functions could be implemented
with an augmented context-free grammar.
• We logged data from information workers’ web browsers.
• It appears that most data can be recognized using
probabilistic context-free grammars with constraints on
the grammar terms.
– E.g.: time  HH : MM ap
HH, MM  ##
{MM >= 0 && MM <= 59 && … }
• I will need to…
– Verify that an augmented grammar is expressive enough
– Identify what constraint primitives are necessary
13
Tope implementation
Most equivalence and reformatting
functions are built from very few primitives.
• Equivalence functions combine
–
–
–
–
–
Lookup in hard-coded tables
Arithmetic
Numeric comparisons
“Identicalness” comparison
Case-insensitive comparison
• Reformatting functions combine
– Lookup in hard-coded tables
– Arithmetic
– Permutation
14
Tope implementation
A prototype system will help users
create, compile, and share topes.
• I will need to provide a prototype system with…
– A user-friendly editor for end users to define topes.
– A program that turns these definitions into binary modules.
– A repository server to store binaries and the meta-model.
• Remember: Sophisticated end users (or professionals)
can fall back on an arbitrary language to create topes.
• The simple grammar language is for the common case.
15
Tope implementation
Equipping tools with topes will help users
create programs of higher quality.
• During end user programming, tools would download
useful topes from repositories.
– Web macros could perform reformatting automatically.
• Improved composability of web macros
– Spreadsheets could be checked for malformed values.
• Improved correctness of spreadsheets
– Web applications could validate & reformat data from web.
• Improved security of applications
16
Applications of topes
Providing a tope system to users will
improve data interoperability.
• When receiving data, users could define a new tope
(or use an existing tope) and apply it to validate data.
– Users could reformat values to a uniform format.
– Users could find and remove duplicate values.
• Data could be validated automatically when entering
programs.
• Data could carry along tope definitions, particularly if the
representation is “secure” (e.g.: context-free grammar)
 a form of self-describing data.
17
Applications of topes
Thank You…
• To VL/HCC for this opportunity to present.
• To Mary Shaw, Brad Myers, Martin Erwig, Sebastian
Elbaum, Margaret Burnett, Henry Lieberman, Allen
Cypher, Robin Abraham, Andy Ko, Jeff Stylos, Andhy
Koesnandar, Josh Gross, Michael Coblenz, Ericka
Orrick, and many others for helpful discussions.
• To NSF for funding
18