Transcript Document

Biological literature mining
• Information retrieval (IR): retrieve papers relevant
to specific keywords
• Entity recognition (ER): specific biological
entities (e.g., genes) identified in papers
• Information extraction (IE): enable specific facts
to be automatically pulled out of papers
Example sentence
“Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly
phosphorylated Swe1 and this modification served as a priming
step to promote subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation”
Its context is the cell cycle of the yeast Saccharomyces cerevisiae
and it allows us to demonstrate the powers and pitfalls of current
literature-mining approaches.
Information Retrieval: finding the
papers
• Aim is to identify text segments pertaining to
a particular topic (here, “yeast cell cycle”)
• Topic may be a user provided query
– ad hoc IR
• Topic may be a set of papers
– text categorization
Ad hoc IR
• Pubmed is an example
• Supports “boolean model” as well as
“vector model”
• Boolean model: combination of terms using
logical operations (OR, AND)
• Vector model: We’ll see more of this later
Ad hoc IR: tricks
• Lessons learned from regular IR also applicable
to biomedical literature
• Removal of “stop words” such as the, it, etc.
• Truncating common word endings such as -ing, -s
• Use of thesaurus to automatically “expand” query
– e.g., “yeast AND cell cycle” => “(yeast OR
Saccharomyces cerevisae) AND cell cycle”
Ad hoc IR
“Even with these improvements, current ad hoc IR systems
are not able to retrieve our example sentence when
they are given the query ‘yeast cell cycle’. Instead, this
could be achieved by realizing that ‘yeast’ is a synonym
for S. cerevisiae, that ‘cell cycle’ is a Gene Ontology term,
that the word ‘Cdc28’ refers to an S. cerevisiae protein
and finally, by looking up the Gene Ontology terms
that relate to Cdc28 to connect it to the yeast cell cycle.”
Entity recognition (ER)
• Goal: to identify biological entities (e.g.,
genes, proteins) in text
• Two sub-goals:
– recognition of the words in text that represent
these entities
– unique identification of these entities (the
synonym problem)
ER goals
• In our example, Clb2, Cdc28, Cdk1, Swe1,
Cdc5 should be recognized as gene or
protein names
• Additionally, they should be identified by
their respective “Saccharomyces Genome
Database” accession numbers
• Perhaps the most difficult task in
biomedical text mining
ER approaches: rule based
• Manually built rules that look for typical
features of names, e.g., names followed by
numbers, the ending “-ase”, occurrences of
word “gene”, “receptor” etc in proximity
• Automatically built rules using machine
learning techniques
ER approaches: dictionary based
• Comprehensive list of gene names and their
synonyms
• Matching algorithms that allow variations in
those names, e.g., ‘CDC28’, ‘Cdc28’, ‘Cdc28p’ or
‘cdc-28.
• Advantage: they can also associated the
recognized entity with its unique identifier
Why ER is difficult
• Each gene has several names and abbreviations,
e.g., ‘Cdc28’ is also called ‘Cyclin-dependent
kinase 1’ or ‘Cdk1’
• Gene names may also be
– common english names, e.g., hairy
– biological terms, e.g., SDS
– names of other genes, e.g., ‘Cdc2’ refers to two
different genes in budding yeast and in fission yeast
Information Extraction (IE)
• IR extracts texts on particular topics
• IE extracts facts about relationship between
biological entities
• e.g., deduce that
– Cdc28 binds Clb2,
– Swe1 is phosphorylated by the Cdc28–Clb2 complex
– Cdc5 is involved in Swe1 phosphorylation
IE approaches: co-occurrence
• Identify entities that co-occur in a sentence,
abstract, etc.
• Two co-occuring entities may be unrelated,
but if they co-occur repeatedly, then likely
related. Therefore, some statistical analysis
used
• Finds related entities but not necessarily the
type of relationship
IE approaches: NLP
• Natural Language Processing (NLP)
• Tokenize text and identify word and sentence
boundaries
• Part of speech tag (e.g., noun/verb) for each word
• Syntax tree for each sentence, delineating noun
phrases and their interrelationships
• ER used to assign semantic tags for biological
entities (e.g., gene/protein names)
• Rules applied to syntax tree and semantic labels to
extract relationships between entities
Summary
• Information retrieval: getting the texts
• Entity recognition: identifying genes,
proteins etc.
• Information extraction: recovering reported
relationships between entities
Automatically Generating Gene
Summaries from Biomedical
Literature
(Ling et al. PSB 2006)
CS 466
Outline
• Introduction
– Motivation
• System
– Keyword Retrieval Module
– Information Extraction Module
• Experiments and Evaluations
• Conclusion and Future Work
Motivation
• Finding all the information we know about
a gene from the literature is a critical task in
biology research
• Reading all the relevant articles about a
gene is time consuming
• A summary of what we know about a gene
would help biologists to access the alreadydiscovered knowledge
An Ideal Gene Summary
• http://flybase.org/reports/FBgn0000017.html
GP
EL
SI
GI
MP
WFPI
Above summary is from ca. 2006
Problem with Manual Procedure
• Labor-intensive
• Hard to keep
updated with the
rapid growth of
the literature
information
How can we generate such summaries automatically?
The solution
• Structured summary on 6
aspects
1.
2.
3.
4.
Gene products (GP)
Expression location (EL)
Sequence information (SI)
Wild-type function and
phenotypic information
(WFPI)
5. Mutant phenotype (MP)
6. Genetic interaction (GI)
•
2-stage summarization
– Retrieve relevant articles
by keyword match
– Extract most informative
and relevant sentences for
6 aspects.
Outline
• Introduction
– Motivation
• System
– Keyword Retrieval Module
– Information Extraction Module
• Experiments and Evaluations
• Conclusion and Future Work
System Overview: 2-stage
IE = Information Extraction; KR = Keyword Retrieval
Keyword Retrieval Module (IR)
•
Dictionary-based keyword retrieval: to
retrieve all documents containing any
synonyms of the target gene.
–
–
1.
2.
Input: gene name
Output: relevant documents for that gene
Gene SynSet Construction
Keyword-based retrieval
KR module
Gene SynSet Construction &
Keyword Retrieval
• Gene SynSet: a set of synonyms of the target gene
• Issues in constructing SynSet
– Variation in gene name spelling
• gene cAMP dependent protein kinase 2:
PKA C2, Pka C2, Pka-C2,…
• normalized to “pka c 2”
– Short names are sometimes ambiguous, e.g., gene name
“PKA” is also a chemical term
– Require retrieved document to have at least one
synonym that is >= 5 characters long
• Retrieving documents based on keywords:
Enforce the exact match of the token sequence
Information Extraction Module
•
Takes a set of documents returned from
the KR module, and extracts sentences
that contain useful factual information
about the target gene.
–
–
1.
2.
Input: relevant documents
Output: gene summary
Training data generation
Sentence extraction
IE module
Training Data Generation
• Construct a training data set consisting of “typical”
sentences for describing a category (e.g., sequence
information)
• Training data is not about the gene to be
summarized. It is about a “type” of information in
general.
• These sentences come from a manually curated
database
– e.g., Flybase has separate sections for each category.
Sentence Extraction
• Extract sentences from the documents related to
our gene
• Then try to identify key sentences talking about a
certain aspect of the gene (“category”)
• In determining the importance of a sentence,
consider 3 factors
– Relevance to the specified category (aspect)
– Relevance to its source document
– Sentence location in its source abstract
Scoring strategies
• Category relevance score (Sc):
– “Vector space model”
– Construct “category term vector” Vc for each category c
– Weight of term ti in this vector is wij=TFij*IDFi
• TFij is frequency of ti in all training sentences of category j
• IDFi is “inverse document frequency” = 1+log(N/ni), N = total #
documents, ni = number of documents containing ti.
• TF measures how relevant the term is, IDF measures how rare it is
– Similarly, vector Vs for each sentence s
– Category relevant score Sc = cosine(Vc, Vs )
Scoring strategies
• Document relevance score (Sd):
– Sentence should also be related to this document.
– Vd for each document, Sd = cos(Vd, Vs )
• Location score (Sl):
– News: early sentences are more useful for summarization
– Scientific literature: last sentence of abstract
– Sl = 1 for the last sentence of an abstract, 0 otherwise.
• Sentence Ranking: S=0.5Sc+0.3Sd+0.2Sl
Summary generation
• Keep only 2 top-ranked categories for each
sentence.
• Generate a paragraph-long summary by
combining the top sentence of each
category
Outline
• Introduction
– Motivation
– Related Work
• System
– Keyword Retrieval Module
– Information Extraction Module
• Experiments and Evaluations
• Conclusion and Future Work
Experiments
• 22092 PubMed abstracts on “Drosophila”
• Implementation on top of Lemur Toolkit
– Variety of information retrieval functions
• 10 genes are randomly selected from
Flybase for evaluation
Evaluation
• Precision of the top k sentences for a category evaluated
• Three different methods evaluated:
– Baseline run (BL): randomly select k sentences
– CatRel: use Category Relevance Score to rank sentences and select
the top-k
– Comb: Combine three scores to rank sentences
• Ask two annotators with domain knowledge to judge the
relevance for each category
• Criterion: A sentence is considered to be relevant to a
category if and only if it contains information on this aspect,
regardless of its extra information, if any.
Precision of the top-k sentences
Discussion
• Improvements over the baseline are most
pronounced for EL, SI, MP, GI categories.
– These four categories are more specific and thus easier to
detect than the other two GP, WFPI.
• Problem of predefined categories
– Not all genes fit into this framework. E.g., gene Amy-d,
as an enzyme involved in carbohydrate metabolism, is
not typically studied by genetic means, thus low
precision of MP, GI.
– Not a major problem: low precision in some occasions is
probably caused by the fact that there is little research on
this aspect.
Summary example (Abl)
Summary example (Camo|Sod)
Outline
• Introduction
– Motivation
– Related work
• System
– Keyword Retrieval Module
– Information Extraction Module
• Experiments and evaluations
• Conclusion and future work
Conclusion and future work
• Proposed a novel problem in biomedical text mining:
automatic structured gene summarization
• Developed a system using IR techniques to automatically
summarize information about genes from PubMed abstracts
• Dependency on the high-quality training data in FlyBase
– Incorporate more training data from other model
organisms database and resources such as GeneRIF in
Entrez Gene
– Mixture of data from different resources will reduce the
domain bias and help to build a general tool for gene
summarization.
References
1.
2.
3.
L. Hirschman, J. C. Park, J. Tsujii, L. Wong, C. H. Wu,
(2002) Accomplishments and challenges in literature
data mining for biology. Bioinformatics 18(12):15531561.
H. Shatkay, R. Feldman, (2003) Mining the Biomedical
Literature in the Genomic Era: An Overview. JCB,
10(6):821-856.
D. Marcu, (2003) Automatic Abstracting. Encyclopedia
of Library and Information Science, 245-256.
Vector Space Model
• Term vector: reflects the use of different words
• wi,j: weight of term ti in vactor j