Kein Folientitel

Download Report

Transcript Kein Folientitel

Empirical Evaluations of
Organizational Memory
Information Systems
Felix-Robinson Aschoff & Ludger van Elst
Empirical Evaluations of OMIS
1. Evaluation: Definition and general approaches
2. Contributions from related fields
3. Implications for FRODO
What is Empirical Evaluation?
Empirical evaluation refers to the appraisal of
a theory by observation in experiments.
Chin, 2001
Experiment or not?
Experiment
less controlled
exploratory study
influencing variables can
be controlled
• causal statements can be
infered
•
Advantages
•
Problems
•
artificial – transfer to normal
user context
• requires concrete hypothesis
• to find subjects which
participate and pay them
more realistic (higher
external validity)
• can be easier and faster
to design
influencing variables
cannot be controlled
• cooperation with people
during their everyday
work
•
Artificial Intelligence vs Intelligence Amplification
AI
Expertsystems
IA
OMIS FRODO
Development Goal
mind-imitating userindependent system
working by itself
hybrid solution:
cooperation between
system and human
user
constant interaction
Evaluation
focus on „technical“
focus must be on
evaluation, if system
cooperation of system
meets its requirements and user
human-in-the-loop
studies
Empirical Evaluations of OMIS
1. Evaluation: Definition and general approaches
2. Contributions from related fields
3. Implications for FRODO
Contributions from related fields
1. Knowledge Engineering
1.1 General Approaches
- The Sysiphus Initiative
- High Performance Knowledge Bases
- Essential Theory Approach
- Critical Success Metrics
1.2 Knowledge Acquisition
1.3 Ontologies
2. Human Computer Interaction
3. Information Retrieval
4. Software Engineering (Goal-Question-Metric Techniqe)
The Sisyphus Initiative
A series of challenge problems for the development of KBS by different
research groups with a focus on PSM:
Sisyphus-I: Room allocation
Sisyphus-II: Elevator configuration
Sisyphus-III: Lunar igneous rock classification
Sisyphus-IV:Integration over the web
Sisyphus-V: High quality knowledge base initiative (hQkb) (Menzies, `99)
Problems of the Sisyphus Initiative
Sisyphus I + II:
• No „higher referees“
• No common metrics
• Focus on modelling of knowledge. Effort to build a model of the
domain knowledge was usually not recorded.
• Important aspects like the accumulation of knowledge and costeffectiveness calculation were not paid any attention.
Sisyphus III:
• Funding
• Willingness of researchers to participate
„...none of the Sisyphus experiments have yielded much evaluation information
(though at the time of this writing Sisyphus-III is not complete)“ (Shadbolt et al `99)
High Performance Knowledge Bases
• run by the Defence Advanced Research Project Agency (DARPA)
in the USA
• goal: to increase the rate at which knowledge can be modified in a KBS
• three groups of researchers:
1) challenge problem developers
2) technology developers
3) integration teams
HPKB Challenge Problem
International Crisis Scenario in the Persian Gulf:
Hostilities between Saudia Arabia and Iran
Iran closes the Strait of Hormuz to international shipping
Integration of the following KBs:
1) the HPKB upper-level ontology (Cycorp)
2) the World Fact Book knowledge base (Central Intelligence Agency)
3) the Units and Measures Ontology (Stanford)
Example Questions the system should be able to answer:
•
With what weapons is Iran capable of firing upon tankers in the Strait of H.?
•
What risk would Iran face in closing the strait to shipping?
Answer key to second question contains for expample:
Economic sanctions from {Saudi Arabia, GCC, U.S., UN,}, because Iran
violates an international norm promoting freedom of the seas.
Source: The Convention on the Law of the Sea
HPKB Evaluation
System`s answers were rated on four official criteria
by challenge problem developers and subject matter experts
Scale: 0 – 3
1)
2)
3)
4)
The
The
The
The
correctness of the answer
quality of the explanation of the answer
completeness and quality of the cited sources
quality of the representation of the question
two phase, test-retest schedule
Essential Theory Approach
Menzies & van Harmelen, 1999
Different schools of knowledge engineering
Technical evaluation of ontologies
Gòmez-Pérez, 1999
1) Consistency
2) Completeness
3) Conciseness
4) Expandability
5) Sensitiveness
Errors in developing taxonomies:
•
•
•
•
•
•
Circularity errors
Partition errors
Redundancy errors
Grammatial errors
Semantic errors
Incompleteness errors
Related Fields
Knowledge Acquisition
Shadbolt, N., O'Hara, K. & Crow, L. (1999).The experimental evaluation
of knowledge acquisition techniques and methods: history, problems and
new directions. International Journal of Human-Computer Studies, 51,
729-755.
Human Computer Interaction
„HCI is the study of how people design, implement, and use
interactive computer systems, and how computers affect
individuals and society.“ (Myers et al. 1996)
- facilitate interaction between users and computer systems
- make computers useful to a wider population
Information Retrieval
-Recall and Precision
-e.g. key-word based IR vs. ontology-enhanced IR
(Aitken & Reid, 2000)
Empirical Evaluations of OMIS
1. Evaluation: Definition and general approaches
2. Contributions from related fields
3. Implications for FRODO
Guideline for Evaluation
 Formulate the main purposes of your framework or application.
 Formulate precise hypothesis.
 Define clear performance metrics.
 Standardize the measurement of your performance metrics.
 Be thourough with designing your (experimental) research design.
 Consider the use of inference statistics. (Cohen, 1995)
 Meet common standards for the report of your results.
Evaluation of Frameworks
Frameworks are general in scope and designed to cover a wide
range of tasks and problems.
The systematic control of influencing variables
becomes very difficult
„Only a whole series of experiments across a number of
different tasks and a number of different domains could controll
for all the factors that would be essential to take into account.“
Shadbolt et al. 1999
Approaches:
• Sisyphus Initiative
• Essential Theory Approach (Menzies & van Harmelen, 1999)
Problems with the Evaluation of FRODO
 Difficulty to control influencing variables when evaluating
entire frameworks
 Frodo is not a running system (yet)
 Only few prototypic implementations that are based on
FRODO
 Frodo is probably underspecified for evaluation in many
areas
Goal-Question-Metric Technique
Goal 1
Question
Metric
Question
Metric
Goal 2
Question
Metric
Metric
Question
Metric
Question
Metric
Basili, Caldiera & Rombach 1994
Informal FRODO Projekt Goals
 FRODO will provide a flexible, scalable framework for evolutionary growth
for distributed OMs
 FRODO will provide a comprehensive toolkit for the automatic or semiautomatic construction and maintenance of domain ontologies
 FRODO will improve information delivery by the OM by developing more
integrated and easier adaptable DAU techniques
 FRODO will develop a methodology and tool for business-process oriented
knowledge management relying on the notion of weakly-structured workflows
 FRODO is based on the assumption that a hybrid solution where the system
supports humans in the decision-making process is more appropriate for
OMIS than mind-imitating AI systems (IA>AI)
Task Type and Workflows
Task Type
negotiation
co-decisison making
unique
low volume
communication intensive
projects
FRODO
KiTs
FRODO wf > classical wf
workflow-processes
repetitive
high volume
heads down
FRODO wf => classical wf
FRODO GQM – Goal concerning workflows
Conceptual level (goals)
GQM-Goals should specify:
a Purpose
a quality Issue
a measurement Object
a Viewpoint
Object of Measurement can be:
Products
Processes
Resources
GQM of FRODO
Purpose
Compare
Quality issue
efficiency
Object (process)
task completion
with workflows
Viewpoint
viewpoint of the
end-user
in the context of
knowledge
intensive tasks
GQM Abstraction Sheet for FRODO
Quality factors
Variation factors
efficiency of task completion
Task types as described in Abecker
2001 (dimension: negotiation, codecision making, projects, workflowprocesses)
Baseline hypothesis
Impact of variation factors
the experimental design will provide
a controll group for comparison
KiTs are more succesfully supported
by weakly-structured flexible
workflows based on FRODO than by
a-priori strictly structured workflows.
GQM Questions and Metrics
Question
What is the efficiency of task completion using FRODO weakly-structured
flexible workflows for KiTs?
What is the efficiency of task completion using a-priori strictly-structured
workflows for KiTs?
What is the efficiency or task completion using FRODO weakly-structured
flexible workflows for classical workflow processes?
Metric
Efficiency of task completion: quality of result [expert judgement]
divided by the time needed for completion of the task.
user-friendliness judged by users
Hypothesis
H1: For KiTs weakly-structured flexible workflows as proposed by FRODO
will yield higher efficiency of task completion than a-priori strictly-structured
workflows.
H2: For classical workflow processes FRODO weakly-structured flexible
workflows will be as good as a-priori strictly-structured workflows or better.
Experimental Design
2 x 2 factorial experiment
independent variables: workflows
task type
Dependent variable: efficiency of task completion
workflow
task
type
weakly-struct. flex. wf/
KiT
strictly-struct. wf/
KiT
weakly-struct. flex. wf/
classical wf process
strict-struct. wf/
classical wf process
Within Subject Design vs. Between Subject Design
Randomized Groups (15-20 for statistical inference)
Possibilities: Degradation Studies, Benchmarking
Empirical Evaluation of Organizational Memory Information Systems
Felix-Robinson Aschoff & Ludger van Elst
1 Introduction
2 Contributions from Related Fields
2.1 Knowledge Engineering
2.1.1 Generel Methods and Guidelines
(Essential Theories, Critical Success Metrics, Sisyphus, HPKB)
2.1.2 Knowledge Acquisition
2.1.3 Ontologies
2.2 Human Computer Interaction
2.3 Information Retrieval
2.4 Software Engineering (Goal-Question-Metric Technique)
3 Implications for Organizational Memory Information Systems
3.1 Implications for the evaluation of OMIS
3.2 Relevant aspects of OMs for evaluations and
rules of thumb for conducting evaluative research
3.3 Preliminary sketch of an evaluation of FRODO
References
Appendix A: Technical evaluation of Ontologies
References
Aitken, S. & Reid, S. (2000). Evaluation of an ontology-based information retrieval tool. Proceedings of 14th European Conference on Artificial
Intelligence. http://delicias.dia.fi.upm.es/WORKSHOP/ECAI00/accepted-papers.html
Basili, V.R., Caldiera, G. & Rombach, H.D. (1994). Goal question metric paradigm. In John J. Marciniak, editor, Encyclopedia of Software Engineering,
volume 1, 528532. John Wiley & Sons
Berger, B., Burton, A.M., Christiansen, T., Corbridge, C., Reichelt, H. & Shadbolt, N.R.(1989) Evaluation criteria for knowledge acquisition,
ACKnowledgeproject deliverable ACK-UoN-T4.1-DL-001B. University of Nottingham, Nottingham
Chin, D. N. (2001). Empirical evaluation of user models and user-adapted systems. User Modeling and User-Adapted Interaction, 11: 181-194
Cohen, P. (1995). Empirical Methods for Artificial Intelligence. Cambridge: MIT Press.
Cohen, P.R., Schrag,R., Jones E., Pease, A., Lin, A., Starr, B., Easter, D., Gunning D., & Burke, M. (1998). The DARPA high performance knowledge
bases project. Artificial Intelligence Magazine. Vol. 19, No. 4, pp.25-49.
Gómez-Pérez, A. (1999). Evaluation of taxonomic knowledge in ontologies and knowledge bases. Proceedings of KAW'99.
http://sern.ucalgary.ca/KSI/KAW/KAW99/papers.html
Grüninger, M. & Fox, M.S. (1995) Methodology for the design and evaluation of ontologies, Workshop on Basic Ontological Issues in Knowledge
Sharing, IJCAI-95, Montreal.
Hays, W. L. (1994). Statistics. Orlando: Harcourt Brace.
Kagolovsky, Y., Moehr, J.R. (2000). Evaluation of Information Retrieval: Old problems and new perspectives. Proceedings of 8th International
Congress on Medical Librarianship. http://www.icml.org/tuesday/ir/kagalovosy.htm
Martin, D.W. (1995). Doing Psychological Experiments. Pacific Grove: Brooks/Cole.
Menzies, T. (1999a). Critical sucess metrics: evaluation at the business level. International Journal of Human-Computer Studies, 51, 783-799.
Menzies, T. (1999b). hQkb - The high quality knowledge base initiative (Sisyphus V: learning design assessment knowledge). Proceedings of
KAW'99. http://sern.ucalgary.ca/KSI/KAW/KAW99/papers.html
Menzies, T. & van Harmelen, F. (1999). Editorial: Evaluating knowledge engineering techniques. International Journal of Human-Computer
Studies, 51, 715-727.
Myers, B., Hollan, J. & Cruz, I. (Ed.) (1996). Strategic directions in human computer interaction. ACM Computing Surveys, 28, 4
Nick, M., Althoff, K., & Tautz, C. (1999). Facilitating the practical evaluation of knowledge-based systems and organizational memories using
the goal-question-metric technique. Proceedings of KAW ´99. http://sern.ucalgary.ca/KSI/KAW/KAW99/papers.html
Shadbolt, N., O'Hara, K. & Crow, L. (1999).The experimental evaluation of knowledge acquisition techniques and methods: history, problems and new
directions. International Journal of Human-Computer Studies, 51, 729-755.
Tallis, M., Kim, J., & Gil, Y. (1999). User studies of knowledge acquisition tools: methodology and lessons learned. Proceedings of KAW ´99
http://sern.ucalgary.ca/KSI/KAW/KAW99/papers.html
Tennison, J., O’Hara, K., Shadbolt, N. (1999) Evaluating KA tools: Lessons from an experimental evaluation of APECKS. Proceedings of KAW’99
http://sern.ucalgary.ca/KSI/KAW/KAW99/papers/Tennison1/
Tasks for Workflow Evaluation
Possibles Tasks for workflow evaluation experiment:
KiT: Please write a report about your personal greatest learning
achievements during the last semester. Find sources related to
these scientific areas in the Internet. Prepare a Power Point
Presentation.
To help you with these task you will be provided with
FRODO weakly-structured wf / classical workflow
Simple structured task: Please implement Netscape on your computer
and use the Internet to find all universities in Iowa that offer
computer sciences. Use e-mail to ask for further information.
To help you with these task you will be provided with
FRODO weakly-structured wf / classical workflow
GQM – Goals for CBR-PEB
Conceptual level (goals)
GQM-Goals should specify:
a Purpose
a quality Issue
a measurement Object
a Viewpoint
Object of Measurement can be:
Products
Processes
Resources
GQM of CBR-PEB GOAL 2
Nick, Althoff,
„Economic
Tautz, 1999
utility“
Analyze
retrieved
information
for the purpose
of
Monitoring
with respect to
Economic utility
From the
viewpoint of
CBR system
developers
in the context of
Decision supp.
for CBR devel.
GQM – Abstraction Sheet for CBR-PEB
Goal 2 „Economic Utility“ for CBR - PEB
Quality factors
Variation factors
1. Similarity of retrieved information
as modeled in CBR-PEB (Q-12)
2. Degree of maturity (desired:max.)
[development, prototype, pilot use] (Q-13)
[...]
1. amount of background knowledge
a. number of attributes (Q-8.1.1)
[...]
2. Case origin [university, industrial research,
industry]
[...]
Baseline hypothesis
Impact of variation factors
1.M.M.:0.2; N.N.:0.5 (scale: 0..1)
[...]
The estimates are on average.
1. The higher the amount of
backround knowledge, the higher
the similarity. (Q-8)
2. The more „industrial“ the case
origin, the higher the degree of
maturity. (Q-9)[...]
GQM – Questions and Metrics
GQM plan for CBR-PEB Goal 2 „Economic Utility“
Q-9 What is the impact of the case origin on the degree of
maturity?
Q-9.1 What is the case origin ?
M-9.1.1 per retrieval attempt: for each chosen case:
case origin [university, industrial research, industry]
Q-9.2 What is the degree of maturity of the system?
M-9.2.1 per retrieval attempt: for each chosen case:
case attribute „status“ [„prototype“, „being developed“,
„pilot system“, „application in practical use“;
„unknown“]
FRODO GQM-Goal concerning Ontologies
For the circumstances FRODO is designed for hybrid solutions are more
successful than AI solutions
Purpose: Compare
Issue: the efficiency of
Object (process): ontology construction and use
with respect to: Stability, Sharing Scope, Formality of Information
Viewpoint: from the user‘s viewpoint
GQM Abstraction Sheet for FRODO (Ont.)
Quality factors
efficiency of ontology construction and
use
Variation factors
•
•
•
Baseline hypothesis
the experimental design will provide a
controll group for comparison
Sharing Scope
Stability
Formality
Impact of variation factors
high Sharing Scope, medium Stability,
low Formality ->FRODO more successf.
low Sharing Scope, high Stability,
high Formality ->AI more successful
GQM Questions and Metrics
What is the efficiency of the ontology construction and use process using
FRODO for a situation with high sharing scope, medium stability and low
formality?
What is the efficiency of the ontology construction and use process using
FRODO for a situation with low sharing scope, high stability and high
Formality?
What is the efficiency of the ontology construction and use process using
AI systems for these situations.
Metrics:
efficiency of ontology construction: number of definitions / time
efficiency of ontology use: Information Retrieval (Recall and Precision)
Hypothesis
H1: for Situation 1 (high sharing scope, medium stability, low formality)
FRODO will yield a higher efficiency of ontology construction and use.
H2: for Situation 2 (low sharing scope, high stability and high formality)
an AI system will yield higher efficiency of ontology construction and use.
Experimental Design
2 x 2 factorial experiment
independent variables: Situation (1/2)
Systems (FRODO/AI)
Dependent variable: efficiency of ontology construction and use
Situation
System
Situation 1/
FRODO
Situation 2/
Situation 1/
AI System
Situation 2/
AI System
FRODO
Within Subject Design vs. Between Subject Design
Randomized Groups (15-20 for statistical inference)
Big evaluation versus small evaluation
Van Harmelen, ‘98
Distinguish different types of evaluation:
• Big evaluation = evaluation of KA/KE methodologies
• Small evaluation = evaluation of KA/KE components (e.g. a particular PSM)
• Micro evaluation = evaluation of KA/KE product (e.g. a single system)
Some are more interesting than others:
• Big evaluation is impossible to control
• Micro evaluation is impossible to generalize
• Small evaluation might just be the only option
Knowledge Acquisition
Problems with the Evaluation of the KA process (Shadbolt et al, 1999)

the availability of human experts

the need for a „gold standard“ of knowledge

the question of how many different domains and tasks should be
included

the difficulty of isolating the value-added of a single technique or
tool

how to quantify knowledge and knowledge engineering effort
Knowledge Acquisition
4) the difficulty of isolating the value-added of a single technique or
tool
a)
Conduct a series of experiments
b)
Test different implementations of the same techniqe against each other or
against a paper and pencil version
c)
Test groups of tools in complementary pairings or different orderings of the
same set of tools
d)
Test the value of single sessions against multiple sessions and the effect of
feedback in multiple sessions
e)
Exploit techniques from the evaluation of standard software to control for
effects from interface, implementation etc.
Problem: Scale-up of experimental programme
Essential Theory Approach
1)
2)
3)
4)
5)
Identify a process of interest.
Create an essential theory t for that process.
Identify some competing process description,  T.
Design a study that explores core pathways in both  T and T.
Acknowledge that your study may not be definitive.
Advantage: Broad conceptual approach; results are of interest for
the entire comunity
Problem: Interpretation of results is difficult (due to KE school or due
to concrete technology like implementation, interface etc?)
Three Aspects of Ontology Evaluation
Three aspects of evaluating ontologies:
• the process of constructing the ontology
• the technical evaluation
• end user assessment and ontology-user interaction
Assessment and Ontology-User Interaction
„Assessment is focused on judging the understanding, usability,
usefulness, abstraction, quality and portability of the definitions
by the user‘s-point of view.“ (Gómez-Pérez, 1999)
Ontology-user interaction in OMIS:
• more dynamic
• success of OMIS rely on active use
• users with heterogen skills, backgrounds and tasks