PPT Slides - Center for Computational Sciences

Download Report

Transcript PPT Slides - Center for Computational Sciences

Automating the Committee
Meeting:
Intelligent Integration of Information
From Diverse Sources
Pedrito Maynard-Zhang
Department of Computer Science & Systems Analysis
Information Integration
Information integration is ubiquitous:
• Committee meetings
• Research papers
• Information retrieval on the web
• Assessing intelligence on the battlefield
•…
Information Integration
Outline
• Introduction
• Automating Information Integration
– Database Integration
– Model Integration
– Conflict Resolution and Meta-Information
• Integrating Learned Probabilistic
Information
• Conclusion and Current Work
Multi-Disciplinary Research
• Databases (e.g., Halevy’s group at U. of
Washington)
• Artificial Intelligence (e.g., Stanford’s
Knowledge Systems Laboratory)
• Business (e.g., MIT-Sloan’s Aggregators
Group)
• Decision Analysis (e.g., Clemen &
Winkler’s work at Duke)
Database Integration
Mediation Layer
bioinformatics query
OMIM
Genes
GeneClinics
Proteins
Nucleotide
Sequences
LocusLink
Source Databases
Entrez
Database Integration
• Application: Querying distributed
databases
• Examples
– Bioinformatics
– Corporate data management
– Question-answer systems on the web
– Detecting bioterrorism
Model Integration
super model
if cancer
then operate
…
expert system
cdi = CIRDE
BML
…
probabilistic model mathematical model
Model Integration
• Applications: Diagnosis and prediction
• Examples:
– Medical diagnosis
– NASA spacecraft design and diagnosis
– Expert system integration
– Combining commonsense knowledge bases
Challenges
• Efficient query processing and
optimization
– Parsing XML
• Defining expressive yet tractable
mediator languages
• Handling heterogeneous source
languages
– Wrapper technology development
Challenges
• Resolving ontological differences
– e.g., realizing that the field “Name” for one
source stores the same information as
“First Name” and “Last Name” for another.
• Detecting conflicts
• Resolving conflicts
– Resolution done manually in practice
– We can automate more!
Uninformed Integration
What’s the weather like?
raining
sunny
raining
Intelligent Integration
What’s the weather like?
raining
meteorologist
sunny
practical joker
raining
own eyes
Types of Meta-Information
• Credibility, experience, political clout
• Areas of expertise
• How source acquired information:
– Source’s sources
– Processes source used to accumulate
information
• Structure of the data representation
Outline
• Introduction
• Automating Information Integration
• Integrating Learned Probabilistic Information
–
–
–
–
–
Medical Scenario
Semantic Framework
LinOP-Based Aggregation
Aggregating Bayesian Networks
Experimental Validation
• Conclusion and Current Work
Medical Expert System
Scenario
Expert system
20 years experience
10 years experience
3 years experience
Source Meta-Information
• Doctors learned probabilistic models
from patient data using some known
standard learning algorithm.
• We know the relative amount of
experience doctors have had (i.e., years
of practice).
Popular Aggregation Approaches
• Intuition approach: Take simple
weighted averages, etc.  unexpected
behavior
• Axiomatic approach: Find aggregation
algorithm satisfying certain “obvious”
properties  impossibility results
• Problem: Not semantically grounded
Aggregation Semantics
learning
algorithm
p1
p2
learning
algorithm
…
…
M samples
generated from the
true distribution p
aggregation
algorithm
…
pL
learning
algorithm
learning
algorithm
aggregate
distribution
p^

optimal
distribution
p*
not available in practice
Linear Opinion Pool (LinOP)
• LinOP: Weighted sum of joint distributions.
• Precisely, for joint distributions pi and joint
variable instantiation w,
LinOP(p1, p2, …, pL)(w) =
i ipi(w).
• i weights: relative experience.
• Satisfies unanimity, non-dictatorship, and
marginalization.
• Doesn’t preserve shared independences.
LinOP and Joint Learning
If
– sources learn joint distributions using
maximum likelihood or MAP learning and
– the same learning framework would be
used on the combined data set to learn p*
then
p*  LinOP(p1, p2, …, pL).
Bayesian Network (BN)
• Summary: Compact, graphical
representation of a probability
distribution.
• Definition: Directed acyclic graph (DAG)
over nodes (random variables); each
node has a local conditional probability
distribution (CPD) associated with it.
• Exploits causal structure in the domain.
Alarm BN
P(B)
.001
Burglary
Earthquake
Alarm
A P(J)
+ .90
- .05
JohnCalls
B
+
+
-
E
+
+
-
P(A)
.95
.94
.29
.001
MaryCalls
P(E)
.002
A P(M)
+ .70
- .01
BN Advantages
• Compact representation and graph
encodes conditional independences.
• Elicitation easy in practice.
• Inference efficient in practice.
• Can be learned from data.
• Deployed successfully – medical
diagnosis, Microsoft Office, NASA
Mission Control, and more.
BN Learning
• Idea: Select BN most likely to have
generated data.
• Standard algorithm:
– Search over structures by adding, deleting,
and reversing edges.
– Parameterize and score structures using
statistics from the data.
– Penalize complex structures.
Aggregating BNs
• Each source i learns BN pi.
• p* is the BN we would learn from the
combined data set.
• We want to approximate p* as closely
as possible by aggregating p1, …, pL.
• Source information: estimates for the
relative experience of the sources and
the total amount of data seen (M).
AGGR: BN Aggregation Algorithm
• Idea: Use BN learning algorithm.
• Problem: We don’t have data.
• Key observation: We can use LinOP
to approximate the statistics needed for
the parameterization and scoring steps!
• Also, we can use LinOP properties to
make algorithm reasonably efficient.
Asia BN
Visit to
Asia
Tuberculosis
Smoking
Lung Cancer
Abnormality
in Chest
X-Ray
Bronchitis
Dyspnea
Experimental Setup
• Generate data for sources from wellknown ASIA BN which relates smoking,
visiting Asia, and lung cancer.
• Compare our algorithm AGGR against
the optimal algorithm OPT that has
access to the combined data set.
• Accuracy measure: KL divergence from
generating distribution.
Sensitivity to M Experiments
• Sensitivity to M
– Size of the combined data set M varies.
– AGGR’s estimate of M is accurate.
• Sensitivity to Estimate of M
– Size of the combined data set M is fixed.
– AGGR’s estimate of M varies.
KL Divergence
Sensitivity to M
0.100
0.090
0.080
0.070
0.060
0.050
0.040
0.030
0.020
0.010
0.000
S1
S2
OPT
AGGR
200
600
1000
1400
1800
M
2200
2600
3000
Sensitivity to Estimate of M
0.0016
S1
S2
OPT
AGGR
KL Divergence
0.0014
0.0012
0.0010
0.0008
0.0006
0.0004
0.0002
0.0000
-1.00 -0.75 -0.50 -0.25 0.00 0.25
M'/M (log10)
M=10k
0.50
0.75 1.00
Subpopulations
• Each source’s data may come from a
different subpopulation P(D|Si), where
D is the data.
• We want to learn P(D).
• P(D) = LinOP(P(D|S1), P(D|S2), …, P(D|SL))
with sources’ weights based on P(Si).
• We can apply the same algorithm.
Subpopulations Experiments
• In the Asia network domain, one doctor
practices in San Francisco, another in
Cincinnati.
• Subpopulations have different priors for
smoking and having visited Asia, so
doctors’ beliefs are biased.
• The aggregate distribution comes much
closer to the original distribution.
Doctor
Asia BN
Visit to
Asia
Tuberculosis
Smoking
Lung Cancer
Abnormality
in Chest
X-Ray
Bronchitis
Dyspnea
KL Divergence
Subpopulations
0.18
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00
S1
S2
OPT
AGGR
200
600 1000 1400 1800 2200 2600 3000
M
Contributions
• A semantic framework for
aggregating learned probabilistic
models.
• A LinOP-based algorithm for
aggregating learned BNs.
• Experiments showing algorithm
behaves well.
Outline
• Introduction
• Automating Information Integration
• Integrating Learned Probabilistic
Information
• Conclusion and Current Work
Conclusion
• Conflict resolution is key in automated
information integration.
• This is a difficult task in general.
• However, information about sources is
often readily available.
• Principled use of this information can
greatly enhance the ability to resolve
conflicts intelligently.
Current Work
• Allow dependence between sources’ data sets
in probabilistic aggregation work.
• Apply semantic framework to aggregation in
other learning paradigms.
• Explore application of algorithms to database
integration, RoboCup, stock market
prediction, etc.
• Making committee meetings obsolete!
Multi-Agent Research Zone
• Research interests:
– Information integration
– Multi-agent machine learning
– RoboCup soccer simulation league testbed
• Masters students
– Jian Xu: Information integration in medical
informatics
– Linxin Gan: Ensemble learning in stock market
prediction
CSA Graduate Program
• Masters in Computer Science
• Research areas include:
– machine learning, KRR, and MAS
– information retrieval, databases, and NLP
– networking and virtual environments
– simulation and evolutionary computation
– software engineering and formal methods
http://unixgen.muohio.edu/~maynarp/
[email protected]