Transcript Slide 1

National Centre for Text Mining
• Mission
To provide TM tools for users, in particular, scientists and
researchers
To coordinate activities in the TM community
・ Core Partners
University of Manchester: NLP and DM
Salford University: Terminology
Liverpool University: IR and Digital Archive
・ External Partners
San Diego SC, UC Berkeley, University of Geneva,
University of Tokyo
1
National Centre for Text Mining
• Mission
To provide TM tools for users, in particular, scientists and
researchers
Biomedical domain
To coordinate activities in the TM community
・ Core Partners
University of Manchester: NLP and DM
Salford University: Terminology
Liverpool University: IR and Digital Archive
・ External Partners
San Diego SC, UC Berkeley, University of Geneva,
University of Tokyo
2
Strategy and Roadmap
for TM in Biomedicine
Vast number of Google/Yahoo users, satisfied
Huge Demand for specialized tools for
TM in Bio-Medical Domains
Small number of users, unsatisfied
The current TM tools, though successful in some business applications,
do not meet requirements of users in bio-medical domains.
More publicity and marketing
More demand-oriented approach
What are the requirements for TM for users in bio-medical domains?
What technologies should be integrated in future TM for science?
Is the nature of TM in scientific fields different from that of business
applications?
3
From technological seeds
4
Science: Knowledge
Raw Data
Unstructured Information (Text)
Semi-structured Information (XML+Text)
Structured Information (Data bases)
Effective management of text and knowledge is the key
Natural Language
Processing
Intelligent
Text Management
System
Ontology-based
KMS
5
Intelligent TM systems
Retrieval
Intelligent Information Retrieval and Question Answering
Integration
Integration of Text with Data and Knowledge
Discovery
Text Mining and Knowledge Discovery
6
From Text to Knowledge
Non-Trivial
Mappings
Ontology
Relationships among concepts
Metabolic Pathways
Signal Pathways
Association between
Diseases and Genes
……
Terminology
NLP
Paraphrasing
Language Domain
Knowledge Domain
Motivated
Independently of language
7
Examples of Technical Seeds
• Term Variants
– Terms (names of proteins, genes, diseases,
symptoms, etc.) denote basic conceptual units in the
knowledge domain.
• Syntactic Variants
– Relationships and complex conceptual units are
mapped to sentences.
• Term Acquisition from Text
– New terms (basic conceptual units) are constantly
introduced. Resource building for specialized
domains is crucial.
8
Examples of Technical Seeds
• Term Variants
– Terms (names of proteins, genes, diseases,
symptoms, etc.) denote basic conceptual units in the
knowledge domain.
• Syntactic Variants
– Relationships and complex conceptual units are
mapped to sentences.
• Term Acquisition from Text
– New terms (basic conceptual units) are constantly
introduced. Resource building for specialized
domains is crucial.
9
Hypernym
acronym
NF-kappa B
NF kappa B
NFKB factor
NF-KB
NF kB
Spelling variation
Expanded form
Synonym
nuclear factor-kappa B
nuclear-factor kappa B
nuclear factor kappa B
nuclear factor κB
Nuclear Factor kappa B
………..
10
Automatic Generated Term Variants (1)
1.000
0.500
0.429
0.286
0.286
0.286
0.286
0.286
0.286
0.286
0.286
0.273
0.273
0.214
0.214
0.214
0.200
NF kappa B
Transcription Factor NF kappa B
NF-kappa B
NF kB
Immunoglobulin Enhancer-Binding Protein
Immunoglobulin Enhancer Binding Protein
Transcription Factor NF-kB
Transcription Factor NF kB
Factor NF-kB, Transcription
nuclear factor kappa beta
NF kappaB
NF kappa B chain
NF kappa B subunit
Transcription Factor NF-kappa B
NF-kB, Transcription Factor
NF-kB
Neurofibromatosis Type kappa B
128
0
912
0
0
0
0
0
0
2
1
0
0
0
0
67
0
11
Automatic Generated Term Variants (2)
1.000
0.316
0.200
0.158
0.133
0.133
0.133
0.133
0.133
0.133
0.133
0.133
0.133
0.133
tumor necrosis factor A
TNF A
tumor necrosis factor
TNF alpha
TNFA
TNF
Tumour necrosis factor alpha
Tumor Necrosis Factor alpha
Tumor Necrosis Factor-Alpha
TUMOR NECROSIS FACTOR.ALPHA
Tumor necrosis factor alpha
Tumor Necrosis Factor-alpha
TNF-Alpha
TNF-alpha
0
1
1653
358
32
2631
14
2
0
0
52
8
0
6899
12
Examples of Technical Seeds
• Term Variants
– Terms (names of proteins, genes, diseases, symptoms, etc.)
denote basic conceptual units in the knowledge domain.
• Syntactic Variants
– Relationships and complex conceptual units in the knowledge
domain are mapped to sentences in the language domain.
• Term Acquisition from Text
– New terms (basic conceptual units) are constantly introduced.
Resource building for specialized domains is crucial.
13
Syntactic Variants
[A] protein activates [B]
(Pathway extraction)
Full-strength
Straufen
lacking this
insertion isholoenzyme
able to assocaite
Transcription
initiation
by theprotein
sigma(54)-RNA
polymerase
requires
Since ……., we postulate that only phosphorylated
osker mRNA and
activate
but failssigma(54)
to ….. to activate
an with
enhancer-binding
protein
thatitsis translation,
thought to contact
PHO2 protein could activate the transcription of PHO5 gene.
transcription.
Non-trivial
Mapping
Spelling
Variants
Same relations
Synonyms
with
different
Acronyms
Structures
Language Domain
Knowledge Domain
Independently motivated
of
14
Language
Predicate-argument structure
Parser based on Probabilistic HPSG (Enju)
s
vp
vp
np
arg2
mod
dt
np
DT
The
NN
protein
vp
pp
arg1
vp
pp
np
VBZ VBN IN PRP
is activated by
it
15
Text Archive with Feature Obejcts
Managing texts, data representation and their semantics
Data representation
Semantics
Text ID
Data Base Module
DB of Feature Objects
content

Ubiquitin

 問題




 content
内容 核開発
 Event



 
 P r ed


 
content
bind
 
 agent  

 agent
 
Ubiquitin E is bound with
Text DB
Text
Copy and Unification
Start Position of the region
 extent

 textid

wsj 02


End
 startp

10


30
 endp

dc : creator
ninom i 

 Event

 content
 Pr ed bind 


 
Position of the region
Annotator
Content
Specialization by unification
 extent
 textid

 startp

 endp
dc : creator


 content




wsj 02


10

30


ninom i
prot ein int eract in

 event  type

bind



agent
ubiquitin 
Fine grained units of
information
Context dependency
Persistent nature of
knowledge and information
16
Demo
(The website demo is
not available now. )
17