PacRim 2002 invited presentation

Download Report

Transcript PacRim 2002 invited presentation

Unstructured Machine Learning:
Providing the link between Genetic Data and
Published Research
Dr Tony C Smith
Reel Two, Inc.
9 Hartley Street
Hamilton, New Zealand
+64 7 839 7808
www.reeltwo.com
0
What is Machine Learning?
 creating computer programs that get better with experience
 learn how to make expert judgments
 discover previously hidden, potentially useful information (data mining)
How does it work?
 user provides learning system with examples of concept to be learned
 induction algorithm infers a characteristic model of the examples
 model is used to predict whether or not future novel instances are also
examples – and it does this very consistently, and very, very quickly!
1
Structured Learning
Mushroom Data
weight
Weight
Damage
Dirt
Firmness
Quality
heavy
heavy
normal
light
Light
normal
heavy
...
high
high
high
medium
clear
clear
medium
mild
mild
mild
mild
clean
clean
mild
hard
soft
hard
hard
hard
soft
hard
poor
poor
good
good
good
poor
poor
heavy
dirt
firmness
mild
clean
poor
good
2
normal
light
hard
good
good
soft
poor
Unstructured Learning
 data does not have fixed fields with specific values
 examples: images, continuous signals, expression data, text
 learning proceeds by correlating the presence or absence of any and all
salient attributes
Document Classification
 given examples of documents covering some topic, learn a semantic
model that can recognize whether or not other documents are relevant
 prioritize them: i.e. quantify “how relevant” documents are to the topic
 not limited to keywords (nor is it misled by them)
 adapt to the user’s needs (ephemeral or long-term)
3
How Text Mining Works
Users supply the system with training data
• Documents that are good examples of the desired category
The system builds ‘classifiers’
• Statistical models based on the training data
The system classifies novel data
• Identifies other documents about the desired category
Results are displayed or stored
• Files can be viewed, routed to end users or stored in databases
4
Classification System
Client-specific categories
Familiar Windows-style interface
Drag-and-drop documents to
create custom categories
Classified documents are
ranked by relevance
View contents of individual documents –
sentences are highlighted by their relevance to
the category
5
The Gene Ontology – A Good First Step
The Initial Problem: Individual
curators evaluate data differently
While scientists can agree to use the
word "kinase," they must also agree to
support this by stating how and why
they use "kinase," and consistently
apply it. Only in this way can they hope
to compare gene products and find out if
and how they are related.
Activation of
p38 MAP
Kinase
Protein
Modification
MAPK-KK
Cascade
The Initial Solution: The
Gene Ontology (GO) – A
controlled vocabulary with
defined relationships between
items.
GO consists of more than 13,000 nodes, or
‘GO Terms’, divided into three main trees:
Biological Process, Cellular Component
and Molecular Function
Of these, only about 3800 GO Terms are
‘active’ – that is, terms appended with more
than just one or two publications.
6
The Gene Ontology Knowledge Discovery System
GO KDS – Filling the gaps in GO
GO is only a partial solution
• GO KDS) bridges the gap by
classifying all of MEDLINE.
• New documents are classified
as they’re added
• Scientists can now annotate
gene targets quickly and reliably
• GO KDS is updated along with
GO and MEDLINE
• Enormous gap between GOannotated docs (27,000) and
full MEDLINE database (12
million entries).
• Updates lag behind.
• Scientists must understand
and agree to use the GO
• Knowledge changes and
alters definitions.
Using GO “as
is” takes too
long and
delivers too
little
7
GO KDS Interface Tour
All sub-terms for the listed term:
click on a term to further refine
your search
Current GO term(s) open
Location
of listed
term in
GO
Enter a keyword to search
in this GO category
Opens abstract in separate window
Color of stars identifies the
GO branch: number of
stars indicates confidence
of category placement
KDS discovers novel
classifications
8
Original GO
classifications
(by domain-expert)
GO KDS Key Benefits
www.go-kds.com
 Quickly sort documents into most relevant categories to the user
 Replace laborious annotation by domain experts with a trainable,
automated system
 Discover conceptual links between previously unrelated scientific
domains
 Identify key articles for pertinent research
 Integrate public, private and proprietary documents
9
How is document classification useful?
Life Science Research
Patent preparation
Finding relevant literature
Prioritizing articles/reports
Discovering hidden connections
Distributing information
Searching patent databases
Collecting relevant documents
Synthesizing information
Drug Approval
Collecting information
Organizing/Collating documents
Satisfying approval criteria
10
Intelligent Text Mining: Therapeutic Courses
One Reel Two client is using Classification System to rapidly sort through large volumes of medical
documentation in disparate therapeutic areas.
The Problem: Client must generate E-Learning Courses from hundreds of
pages of reports, literature and product documentation supplied by client
Old Solution: Manually read
through documents to find
paragraphs related to ‘Diagnosis’,
Etiology, Epidemiology etc.
New Solution: Use Reel Two Classification System to
build a custom taxonomy, then automatically classify
and extract relevant document sections into
Therapeutic Area categories
11
Intelligent Text Mining – Patent Analysis
Search patent filings for the ideas or concepts behind one’s analysis
– Explore state of prior art, competitive landscape or ‘innovation gaps’
– Overcome intentionally vague language in patent filings
Identifying ‘Mechanism of Action’ in life science patents
Example
Project
Patents are classified according to a taxonomy built by the client:
Alzheimer’s Patents
MoA: 5-HT Inhibitor
MoA: Acetylcholinesterase
MoA: Antioxidant
MoA: Antiviral…
Sample
Output
ACTIVITY - Analgesic; neuroprotective; nootropic; antiparkinsonian; neuroleptic;
tranquilizer; antiinflammatory; antidepressant; anabolic; anorectic; anticonvulsant;
uropathic; gastrointestinal; antiaddictive; gynecological.
MECHANISM OF ACTION - Neurotransmitter release modulator.
In an in vitro assay, 2-chloro-5-(3-(R)-pyrrolidinylmethoxy)-3-pyridinecarbaldoxime (Ia)
exhibited a Ki value for binding to neuronal nicotinic acetylcholine receptors of 0.012 nM.
The Mechanism of Action listed for this patent is "Neurotransmitter release
modulator." However Classification System identified that this chemical modulator
binds to the acetylcholine receptor, which is the true mechanism of action, and
classified this patent in “MoA: Acetylcholinesterase”.
12
“Life Science Information Management will form the largest
unmet need for IT companies in the 21st Century”
Caroline Kovak,
General Manager, IBM Life Sciences
13
Appendix: GO KDS Interface
1. Search for a particular GO term by opening one of the main branches
14
Appendix
2. ‘Drill down” through the taxonomy to find a term of interest. Click on that term.
15
Appendix
3. Select the desired GO term. ‘Open’ the category by clicking on ‘new search
with this term.’
16
Appendix
4. Scroll down to view abstracts.
17
Appendix
5. Discover conceptual links to other GO categories. Click on the category
to add the term to your search.
18
Appendix
6. View the data intersection between GO categories. Scroll through to
view abstract.
19
Appendix
7. GO terms identify concepts embodied in the abstracts, enabling quick review.
20
Appendix
8. Select an abstract of interest, and click to open the complete abstract.
21
Appendix
9. The abstract will open in a new window, allowing you to continue with your
search, or to link directly to the journal.
22