PowerPoint-presentatie

Download Report

Transcript PowerPoint-presentatie

ECO
R
European Centre for
Ontological Research
Basic Introduction to
Ontology-based
Language Technology (LT)
for the Biomedical Sciences
(1st year Biomedicine, UG, Belgium)
Werner Ceusters
European Centre for Ontological Research
Universität des Saarlandes
Saarbrücken, Germany
ECO
R
European Centre for
Ontological Research
Purpose of this lecture
• Introduce some keywords
• Give just a taste for ontology-based LT in
Biomedicine
• Induce interest for further research
ECO
R
European Centre for
Ontological Research
•
•
•
•
•
•
Biomedicine:
A Great Area for LT
Educated users
High utility of NLP
Doesn’t require solution to general problem
Complex and interesting (not just IE)
Recent surge in data
Knowledge bases available
Hinrich Schütze, Novation Biosciences
Russ Altman, Stanford University
ECO
Biomedical Data Mining
R
European Centre for
Ontological Research
and DNA Analysis
• DNA sequences: 4 basic building blocks (nucleotides):
adenine (A), cytosine (C), guanine (G), and thymine (T).
• Gene: a sequence of hundreds of individual nucleotides
arranged in a particular order
• Humans have around 100,000 genes
• Tremendous number of ways that the nucleotides can be
ordered and sequenced to form distinct genes
• Semantic integration of heterogeneous, distributed
genome databases
– Current: highly distributed, uncontrolled generation and
use of a wide variety of DNA data
– Data cleaning and data integration methods developed
in data mining will help
Jiawei Han and Micheline Kamber
ECO
DNA Analysis: Examples
R
European Centre for
Ontological Research
• Similarity search and comparison among DNA sequences
– Compare the frequently occurring patterns of each class (e.g.,
diseased and healthy)
– Identify gene sequence patterns that play roles in various diseases
• Association analysis: identification of co-occurring gene sequences
– Most diseases are not triggered by a single gene but by a
combination of genes acting together
– Association analysis may help determine the kinds of genes that
are likely to co-occur together in target samples
• Path analysis: linking genes to different disease development stages
– Different genes may become active at different stages of the
disease
– Develop pharmaceutical interventions that target the different
stages separately
• Visualization tools and genetic data analysis
Jiawei Han and Micheline Kamber
ECO
Task descriptions
R • Sequence similarity searching
European Centre for
Ontological Research
•
•
•
•
•
•
•
•
•
•
•
•
•
•
– Nucleic acid vs nucleic acid
28
– Protein vs protein
39
– Translated nucleic acid vs protein
6
– Unspecified sequence type
29
– Search for non-coding DNA
9
Functional motif searching
35
Sequence retrieval
27
Multiple sequence alignment
21
Restriction mapping
19
Secondary and tertiary structure prediction
14
Other DNA analysis including translation 14
Primer design
12
ORF analysis
11
Literature searching
10
Phylogenetic analysis
9
Protein analysis
10
Sequence assembly
8
Location of expression
7
Miscellaneous
7
Stevens R, Goble C,
Baker P, and Brass A.
A Classification of
Tasks in
Bioinformatics.
Bioinformatics 2001: 17
(2):180-188.
ECO
R
European Centre for
Ontological Research
Three major challenges
• Analyse massive amounts of data:
– Eg: high throughput technologies based upon cDNA or
oligonucleotide microarrays for analysis of gene
expression, analysis of sequence polymorphisms and
mutations, and sequencing
• Appropriately link clinical histories to molecular or
other biomarker data generated by genomic and
proteomic technologies.
• Development of user-friendly computer-based
platforms
– that can be accessed and utilized by the average
researcher for searching, retrieval, manipulation, and
analysis of information from large-scale datasets
ECO
R
European Centre for
Ontological Research
BUT !!!
• Majority of data buried in
–huge amounts of texts
–Incompatibly annotated
databases
ECO
R
European Centre for
Ontological Research
Text overload
– According to a conservative estimate, the
number of digital libraries is more than 105.
• [Norbert Fuhr 03]
– Google indexed over 4.28 billion web pages;
• from Google press release.
– But, any single engine is prevented from
indexing more than one-third of the “indexable
web”.
• from Science.Vol.285, Nr.5426.
ECO
R
European Centre for
Ontological Research
Objectives of LT in
Biomedical Informatics
• Make large volumes of scientific texts better
accessable
• Assist annotation of genome and phenome
to allow better linking of the data
– CSB: Computational Systems Biology
• Link biomedical data with patient record
data
ECO
R Knowledge discovery and use
European Centre for
Ontological Research
ECO
R
European Centre for
Ontological Research
Text Mining Technologies
for Biomedicine
Hi
Artificial
Manual Knowledge
Intelligence
Representation
Cyc
Riboweb
Information Extraction
Fastus
Structure Mining
Primary
Literature
Reading
Keyword-based
Retrieval
PubMed
Low
Low
Cost
effectiveness
Hi
Hinrich Schütze, Novation Biosciences
Russ Altman, Stanford University
ECO
R
European Centre for
Ontological Research
Scientists in areas such as molecular biology and biochemistry
aim to discover new biological entities and their functions.
Typical cases could be discoveries of the implications of new
proteins and genes in an already known process, or implication
of proteins with previously characterized functions in a
separate process.
The use of available information (published papers, etc.) is a
key step for the discovery process, since in many cases weak or
indirect evidences about possible relations hidden in the
literature are used to substantiate working hypothesis that are
experimentally explored.
[C.Blaschke, A.Valencia: 2001]
ECO
R
European Centre for
Ontological Research
Text-based
knowledge discovery
• Goal:
Finding “new” biomedical scientific knowledge
through the combination of existing knowledge
as represented in the medical literature
• Motivation:
Prevention of re-inventing the wheel, re-usage of
specific knowledge outside the original domain of
discovery
ECO
R
European Centre for
Ontological Research
Swanson
Effects
B
Substance
A
Fish oil
High blood viscosity
Platelet aggregation
Disease
C
Raynaud’s
disease
ECO
R
European Centre for
Ontological Research
Protein-Protein
Interaction extracted
from texts
by C. Blaschke
ECO
R Steps of Knowledge Discovery
European Centre for
Ontological Research
• Training data gathering
• Feature generation
– k-grams, domain know-how, ...
• Feature selection
– Entropy, 2, CFS, t-test, domain know-how...
• Feature integration
– SVM, ANN, PCL, CART, C4.5, kNN, ...
Some classifiers/learning methods
Limsoon Wong
ECO
R
European Centre for
Ontological Research
Functional components
for text-based
feature generation system
• Basic use components:
end-user
– Corpus Management tool
– Parser
– Export module
• Management components:
–
–
–
–
–
Corpus editor
Grammar building workbench
Domain Ontology editor
Parser generator
Linguistic ontology (multi-lingual use)
super user
super user
super user
exporter
exporter
ECO
R
European Centre for
Ontological Research
What does it take
to build such a system ?
• Short term: single domain
– Corpus collection & analysis
– Domain model design & implementation
– Grammar Development
– Corpus Manipulation Engine
– Integration in Biomining package
• Long term: generic system
– Grammar Building Workbench
– Parser Generator
– Documentation
ECO
R
European Centre for
Ontological Research
22 page full paper
A “statistics only system”
ABSTRACT ONLY
ECO
R
Relative Concept/Node
identification (real)
European Centre for
Ontological Research
0,4
0,35
concepts
0,3
Statistic analysis
is powerful,
but not enough
0,25
0,2
0,15
0,1
0,05
nodes
0
0
500
1000
1500
2000
2500
Nr of words
3000
3500
4000
4500
5000
ECO
R
European Centre for
Ontological Research
Clean separation of
knowledge
for deep understanding
The Galen view:
–
–
–
–
–
linguistic knowledge
conceptual knowledge
pragmatic knowledge
criteria knowledge
terminological
knowledge
The LT view:
–
–
–
–
–
–
phonologic knowledge
morphologic knowledge
syntactic knowledge
semantic knowledge
pragmatic knowledge
world knowledge
ECO
R
One word – multiple meanings
European Centre for
Ontological Research
• Abbreviation Extraction (Schwartz 2003)
– Extracts short and long form pairs
Short form
Long form
AA
Alcoholic Anonymous
American
Americans
Arachidonic acid
arachidonic acid
amino acid
amino acids
anaemia
anemia
:
ECO
R
European Centre for
Ontological Research
• Corpus
Syntactic variant
detection
– MEDLINE: the largest collection of abstracts in
the biomedical domain
• Rule learning
– 83,142 abstracts
– Obtained rules: 14,158
• Evaluation
– 18,930 abstracts
– Count the occurrences of each generated
variant.
Tsuruoka, et.al. 03 SIGIR]
ECO
R
European Centre for
Ontological Research
Results:
“antiinflammatory effect”
Generation
Probability
Generated Variants
Frequency
1.0 (input)
antiinflammatory effect
7
0.462
anti-inflammatory effect
33
0.393
antiinflammatory effects
6
0.356
Antiinflammatory effect
0
0.286
antiinflammatory-effect
0
0.181
anti-inflammatory effects
23
:
:
:
ECO
R
Results:
“tumour necrosis
factor alpha”
European Centre for
Ontological Research
Generation
Probability
Generated Variants
Frequenc
y
1.0 (Input)
tumour necrosis factor alpha
15
0.492
tumor necrosis factor alpha
126
0.356
tumour necrosis factor-alpha
30
0.235
Tumour necrosis factor alpha
2
0.175
tumor necrosis factor alpha
182
0.115
Tumor necrosis factor alpha
8
:
:
:
ECO
R
European Centre for
Ontological Research
Biomedical NE Task
(Collier Coling00,Kazama ACL02, Kim ISMB02)
• Recognize “names” in the text
– Technical terms expressing proteins, genes,
cells, etc.
Thus, CIITA not only activates the expression of class II genes
PROTEIN
DNA
but recruits another B cell-specific coactivator to increase
transcriptional activity of class II promoters in B cells .
CELLTYPE
DNA
Identify and classify
Junichi Tsujii
ECO
Text mining and classification
R
European Centre for
Ontological Research
Generalised Possession
Human
Haspossessor
1
2
IS-A
1
IS-A
Healthcare phenomenon
Haspossessed
1
Having a healthcare phenomenon
IS-A
2
Is-possessor-of
Patient
3
Has-Healthcarephenomenon
IS-A
Malignant neoplasm
IS-A
3
Cancer patient
lung carcinoma
Mr. Smith has a pulmonary carcinoma
ECO
R
Data integration approaches
European Centre for
Ontological Research
at least, the beginnings of ...
•
•
•
•
•
•
Protein interaction databases
Small molecule databases
Genome databases
Pathway databases
Protein databases
Enzyme databases
Gene
Ontology
ECO
R
European Centre for
Ontological Research
ECO
R
System
Integrationapproaches
approaches
Data Integration
European Centre for
Ontological Research
1.
2.
3.
4.
5.
Data Warehousing :
Data from various data sources are converted, merged and stored in a
centralized DBMS. (Examples) Integrated Genomic Database
Hyperlinking approaches:
Where links are set up between related information and data sources.
SRS, Entrez (NCBI)
Standardization:
Efforts which address the need for a common metadata model for various
application domains.
Integration systems:
Systems that can gather and integrate information from multiple sources.
Some of these systems have a Mediator-Wrapper Architecture others are
language based systems like Bio-Kleisli.
Federated Database:
Cooperating, yet autonomous, databases map their individual schema’s to
a single global schema. Operations are preformed against the federated
schema.
Steve Brady
ECO
R
European Centre for
Ontological Research
CoMeDIAS (France)
ECO
R
European Centre for
Ontological Research
GenesTraceTM:
Biological Knowledge
Discovery via Structured
Terminology
ECO
R
European Centre for
Ontological Research
The XML misconception
<?XML version="1.0" ?>
<?XML:stylesheet type="text/XSL" href="cr-radio.xsl" ?>
<CR-RADIOLOGIE><ENTETE>
<INFORMATION-SERVICE>
<HOPITAL>Groupe hospitalier Léonard Devintscie</HOPITAL>
<SERVICE>Radiologie Centrale</SERVICE><MEDECIN>Dr. Bouaud</MEDECIN>
<TITRE-EXAMEN>Phlébographie des membres inférieurs</TITRE-EXAMEN>
</INFORMATION-SERVICE>
<INFORMATION-DEMANDE>
<SERVICE>Sce Pr. Charlet</SERVICE><MEDECIN>Dr. Brunie</MEDECIN>
<DATE>29-10-99</DATE>
</INFORMATION-DEMANDE>
<INFORMATION-PATIENT ID="236784020"><NOM>Donald</NOM>
<PRENOM>Duck</PRENOM></INFORMATION-PATIENT></ENTETE>
<BODY>
<INDICATION>Suspicion de phlébite de jambe gauche</INDICATION>
<TECHNIQUE>Ponction bilatérale d’une veine du dos du pied et injection
de 180cc de produit de contraste</TECHNIQUE>
<RESULTATS>image lacunaire endoluminale visible au niveau des veines péronières
gauche. Absence d’opacification des veines tibiales antérieures et postérieures gauches.
Les veines illiaques et la veine cave inférieure sont libres.
</RESULTATS>
<CONCLUSION>Trombophlébite péronière et probablement tibiale antérieure et
postérieure gauche.</CONCLUSION>
</BODY>
</CR-RADIOLOGIE>
ECO
Towards Machine Readable
R
European Centre for
Ontological Research
Semantics
Form
Data about
Structure
Meaning
Function
Usage
Style
Type
Definition
Document
Type
Definition
Information
Type
Definition
Knowledge
Type
Definition
Workflow
Type
Definition
Bold
Centred
Align Left
Title
Paragraph
Heading1
Subject
isPartOf
Date
Utility
affectedBy
Actor
Formalism
Cases
Static
Dynamic
Standard
Blink
Play
After_value
Receive
Protect
Layout
Outline
Content
Behaviour
Receival
Maintenance
Archival
Process
Hao Ding, Ingeborg T. Sølvberg
ECO
R
Triadic models of meaning:
European Centre for
Ontological Research
The Semiotic/Semantic triangle
Reference:
Concept / Sense / Model / View
Sign:
Language/
Term/
Symbol
Referent:
Reality/
Object
ECO
R There is ontology and “ontology”
European Centre for
Ontological Research
• Ontology in Information Science:
– “An ontology is a
description (like a formal
specification of a program)
of the concepts and
relationships that can exist
for an agent or a community
of agents.”
• Ontology in Philosophy:
– “Ontology is the science of
what is, of the kinds and
structures of objects,
properties, events,
processes and relations in
every area of reality.”
concept
definition
term
referent
ECO
R
European Centre for
Ontological Research
Why are concepts
not enough?
• Why must our theory address also the
referents in reality?
– Because referents are observable fixed
points in relation to which we can work out
how the concepts used by different
communities relate to each other ;
– Because only by looking at referents can
we establish the degree to which concepts
are good for their purpose.
ECO
R
European Centre for
Ontological Research
Or you get nonsense:
Definition of “cancer gene”
ECO
Take home message:
R
Language Technology requires
European Centre for
Ontological Research
a clean separation of knowledge
AND (the right sort of) ontology
Pragmatic knowledge: what users usually say or think, what they
consider important, how to integrate in software
Knowledge of classification and coding systems: how an expression has
been classified by such a system
Knowledge of definitions and
criteria: how to determine if a
concept applies to a particular instance
Surface linguistic knowledge:
how to express the concepts in
any given language
Conceptual knowledge: the knowledge of sensible domain concepts
Ontology: what exists and how what exists
relates to each other