Increasing the Information Density in Digital Library Results

Transcript Increasing the Information Density in Digital Library Results

Increasing the
Information Density in Digital
Library Results
IndoUS Workshop
Gio Wiederhold
Stanford University
22 June 2003
Map
Courtesy of Univ. of
Pittsburgh DL Project
June 2003
IndoUS
Gio 2
Attention is the issue.
"What information consumes is rather obvious;
it consumes the attention of its recipients.
Hence a wealth of information creates
a poverty of attention, and
a need to allocate that attention
efficiently among the overabundance of
information sources that might consume it."
[Herb Simon]
Complementary objective:
Don't waste the attention afforded to information
June 2003
IndoUS
Gio 3
My
focus: Science & business use
(I ignore now the artistic aspects,
also important)
Required by customer:
1. knowledge to process information, and
2. tools to facilitate that process
–
–
–
–
–
Locate
Select
Articulate, not Integrate
Summarize
Project - exploit data mining
June 2003
IndoUS
Gio 4
Technologies to filter Information
Survey of Technologies from common to rare
1. Ranking
2. Eliminate redundancy
3. Assure novelty
4. Abstraction
5. Data mining
6. Reduction for visual presentation
7. Modeling
8. Prediction
9. Finding Abnormal Events
June 2003
IndoUS
Gio 5
1. Ranking
Assumption:
The consumer only considers a few
documents on the top of the list.
a. Ranking by authority.
•
Select sites that are valued in a context,
•
•
a journal versus a workshop report,
a recent document.
b. Ranking by reference authority
•
recursive value by references to it (Google)
•
extracts global communal knowledge
c. Rank by customer's context
June 2003
IndoUS
Gio 6
2. Eliminate redundancy
a. If similar documents are retrieved
•
•
present the latest one
present the highest ranked one, per a suitable
criterion, I.e, user's context.
b. Only report differences among documents
•
•
•
•
•
•
June 2003
look for additional material
decide what are significant differences
abstract differences (see later)
show differences in layout only as maps
compute metric if difference
deal with many documents
IndoUS
Gio 7
3. Value is in the Novelty
a. Information relative to a document collection
–
Exploits prior technologies
b. Information relative to a customer.
– What is the knowledge held by an individual
– Can it be captured ?
Domain recognition to determine context
Avoid (unsolvable?) problem of `common knowledge'
June 2003
IndoUS
Gio 8
4. Abstraction
a. Only present essentials of textual documents
1. Domain-independent abstraction selecting sentences
that appear to represent the contents;
2. Domain-specific text can be effectively abstracted
a. pathology reports -- being done
b. automatic annotation of gene-sequences from papers.
b. Abstracting contents of document collections
1.
2.
3.
4.
June 2003
Classify
Differentiate (2.b)
Integrate
Semantic matching if the sources are autonomous
IndoUS
Gio 9
5. Data mining
Out of scope for digital library research, but
a. Linking data-mining results with information
from textual sources
strengthens users' explanatory capabilities.
b. Data-mining develops models that can be
further exploited
June 2003
IndoUS
Gio 10
6. Reduction for Visualization
Motivated by modern customers’ settings
a.
Reduce numeric data for visual presentation
1.
2.
b.
Common
Can be automated, but rarely done well
Reduce textual information into visuals
Requires
1.
2.
Abstraction
Placing the result into some model:
I.e., temporal or spatial aspects:
•
•
•
June 2003
Progress notes for a patient – disease model
Description of an exploratory journey – attach to a map
Progress of a scientific project – versus proposal
IndoUS
Gio 11
7. Modeling
Models of a domain allow analysis &
manipulation
a. to discern novelty
b. representation of normal behavior
•
•
•
June 2003
corporate finances from 10-K
ecological processes, and global change
metabolic models, needed to formulate an
understanding of food, drug, and environmental
effects on organisms.
IndoUS
Gio 12
8. Prediction
Current information technologies,
{ databases, data-mining, digital libraries }
provide only background information for decision-making
Today: decision maker
1. copies results into a spreadsheet
2. add formulas to make extrapolations into the future
a. Continue models scenarios into the possible futures
1. Investments - monetary, personnel, research, . . .
2. Probabilities of outcomes etc.
b. Allow comparison of alternatives
Information systems should not terminate their support with the past,
but also to extrapolate the results with the models used for analysis
June 2003
IndoUS
Gio 13
9. Finding Abnormal Events
•
A hard challenge is discovering abnormal situations.
–
I.e., looking for terrorists.
Note: observables are the effect of many good and a few bad scenarios
•
Traditional data-mining finds frequent relationships
–
•
abduct the processes that generate those data
serves marketing folk,
Intelligence tasks seek unusual or abnormal behavior
1. Use model based on recent incidents,
•
flight-schools enrollments of terrorists
2. Create and use a reasonable, but hypothetical model
•
shipping containers can carry nuclear devices into the US
3. Create a model of normal findings
June 2003
IndoUS

Gio 14
9+. Create & exploit a normal state model
Prerequisite for finding abnormal events
abnormalities can only be identified if normality can be quantified
a.
Populate an initial model with normal findings
–
b.
Identify variation not due to known causes
•
c.
d.
Coverage: all likely causes of some observable(s)
Temporal tracking is better than static schemes
Increase coverage as needed - feedback to b.
Maintain models to recognize unexplainables
Such models will be large since observed data are the aggregate of
activities from many domains,
travel patterns: business, holidays, and family visits, emergencies.
June 2003
IndoUS
Gio 15
Benefits
A `business model' for justifying ongoing DL research is needed [Y.T.Chien]
•
1.
A business model includes benefits and costs
Benefits:
–
–
–
–
–
2.
Broad access to knowledge
Education of the next generation
Preservation of cultural heritage
Mutual, inter-cultural understanding, reduction of conflicts
Focus of prior slides
Improved decision-making
Costs
–
Time and money spent on information systems
•
–
–
–
Technology

Contents
Time spent on obtaining the information
Time spent on analyzing the information
Due to errors
June 2003
IndoUS
Focus of prior slides
Gio 16
Cost of Errors -- balance
• Type 1 errors
Omitted relevant
information
• Type 2 errors
Excess irrelevant
information
•
•
•
•
•
•
•
•
Lost opportunities
Unperceived risk
Suboptimal choices
Cost: f (variance)
– High if  is high
– Low if  is low

– High if excess is high
• human time is valuable
– Low if precision is high
• purchasing
June 2003
Overload
Inability to analyze all
Risk of being misled
Cost: delay, human
IndoUS
Gio 17
Exploiting Information
Action 3
Action 2
Action 1
Has not been an
explicit focus
of DL research.
It is the point that
generates benefits
Effects
Decision
knowledge
?
Data and their relationships
June 2003
IndoUS
Gio 18
The Major Feedback Loop
user exploiting communities
distilled knowledge, categorized
computer
knowledge
scientists, will
provide tools
user contributing communities
human knowledge, validated by data
June 2003
IndoUS
Gio 19
Conclusion
 Much work is left to be done with digital libraries
 Exploiting the results will motivate more investment
• In technology
• In content breadth and depth
 Customer's expectations will change
o Global access is here
o Heterogeneity will remain, cause errors
o Ubiquitous access is near
June 2003
IndoUS
Gio 20
Optional discussion points
•
•
•
•
•
•
Interfaces
Personalization
Heterogeneity
Computer scientists and their customers
Data versus relationships
Disruptive factors
June 2003
IndoUS
Gio 21
New user interface settings
The new generation
• is more comfortable with screen displays
• can navigate to analyses, backup
• considers paper to be heavy and awkward
• is poor in handwriting and spelling
• is facile in brief keyboard messages
• expects simple voice command technology
June 2003
IndoUS
Gio 22
Personalization, 2 models
1. Everything about an individual
– learn all about the individual
•
Context
slow, delayed, lags
2. An individual as a member of groups
•
learn about the likely memberships
{ 8th grader, ...; carpenter, ..; opera goer, ...; ...}
•
•
learn and assign knowledge to group
inherit knowledge collected in those groups
•
June 2003
leads so that individual also benefits
IndoUS
Gio 23
Interoperation/Interoperability
• Heterogeneity is a fact, and attempts at
enforcing consistency are misguided
• natural consistency will be an outcome
of collaboration,
June 2003
IndoUS
Gio 24
Data and their relationships
• Data are verifiable first-order objects
– observable
– automatic acquisition is common
• Relationships are also first-order objects
– defined by metadata in context
{ schemas, references, dependencies, is-as, causality, ... )
– Hard to discover
– Instances verifiable in contexts
– Needed for exploitation
June 2003
IndoUS
Gio 25
Customers and Computer Scientists
• Mutual arrogance fed by misunderstandings
• Differing scientific paradigms
– Mathematical: formal, definite
– Social, biological: case-based, indefinite
June 2003
IndoUS
Gio 26
Disruptive factors [June 03 NSF meet]
• Technologies
– ubiquitous access
– community empowerment
• data & semantics contribution
– Machine translation of modest quality
• Sociological
– Imposed privacy constraints
– TIA reactions national/international
– Commercial pressures • skimming the cream
June 2003
IndoUS
Gio 27
Roadblocks [Y.T. Chien]
• lack of a business model
• matching technology to user needs
• define a research pipeline [NAS HPCC report]
June 2003
IndoUS
Gio 28

Increasing the Information Density in Digital Library Results

Transcript Increasing the Information Density in Digital Library Results

Directory