powerpoint - Stanford University

Download Report

Transcript powerpoint - Stanford University

CS276B
Text Information Retrieval, Mining, and
Exploitation
Lecture 16
Bioinformatics II
March 13, 2003
(includes slides borrowed from J. Chang, R. Altman, L. Hirschman, A.
Yeh, S. Raychaudhuri)
Bioinformatics Topics

Last week



Basic biology
Why text about biology is special
Text mining case studies


Microarray analysis, Abbreviation mining
Today

Combined text mining and data mining I




Text-enhanced homology search
Text mining in biological databases
KDD cup: Information extraction for biojournals
Combining text mining and data mining II
Text-Enhanced Homology Search
(Chang, Raychaudhuri, Altman)
Sequence Homology Detection



Obtaining sequence information is easy;
characterizing sequences is hard.
Organisms share a common basis of
genes and pathways.
Information can be predicted for a novel
sequence based on sequence similarity:
Function
 Cellular role
 Structure

PSI-BLAST



Used to detect protein sequence
homology. (Iterated version of universally
used BLAST program.)
Searches a database for sequences with
high sequence similarity to a query
sequence.
Creates a profile from similar sequences
and iterates the search to improve
sensitivity.
PSI-BLAST Problem: Profile
Drift


At each iteration, could find nonhomologous (false positive) proteins.
False positives create a poor profile,
leading to more false positives.
Addressing Profile Drift

PROBLEM: Sequence similarity is
only one indicator of homology.


More clues, e.g. protein functional
role, exists in the literature.
SOLUTION: we incorporate MEDLINE
text into PSI-BLAST.
Modification to PSI-BLAST




Before including a sequence, measure similarity
of literature. Throw away sequences with least
similar literatures to avoid drift.
Literature is obtained from SWISS-PROT gene
annotations to MEDLINE (text, keywords).
Define domain-specific “stop” words (< 3
sequences or >85,000 sequences) = 80,479 out
of 147,639.
Use similarity metric between literatures (for
genes) based on word vector cosine.
Evaluation


Created families of homologous proteins
based on SCOP (gold standard site for
homologous proteins-http://scop.berkeley.edu/ )
Select one sequence per protein family:
Families must have >= five members
 Associated with at least four references
 Select sequence with worst performance
on a non-iterated BLAST search

Evaluation
Compared homology search
results from original and our
modified PSI-BLAST.
 Dropped lowest 5%, 10% and 20%
of literature-similar genes during
PSI-BLAST iterations

Results




46/54 families had identical performance
2 families suffered from PSI-BLAST drift,
avoided with text-PSI-BLAST.
3 families did not converge for PSI-BLAST,
but converged well with text-PSI-BLAST
2 families converged for both, with slightly
better performance by regular PSI-BLAST.
Discussion
Profile drift is rare in this test
set and can sometimes be
alleviated when it occurs.
 Overall PSI-BLAST precision
can be increased using text
information.

Mining Text in
Biological Databases
Where is the Information?
What is the Data?
GenBank – genetic sequences
 Swiss-prot – protein sequences
 DNA chips / microarrays
 Metabolic pathways
 Signaling pathways / regulatory
networks
 Medline – biomedical literature
 Taxonomies / Ontologies

Genetic Information in GenBank
100000000000.00
10000000000.00
1000000000.00
100000000.00
10000000.00
Base Pairs
Sequences
1000000.00
100000.00
10000.00
•Numbers are
for all species.
1000.00
100.00
10.00
1.00
1983
1988
1993
1998
•Biology is
fundamentally
an
information
Species represented in
GENBANK





















Entries
Bases
4323294 7028540140
2595599 1385749133
166778
488340565
182124
247830592
114669
203787073
189000
165542107
159412
136005048
219183
107771966
166688
75404535
155647
68679866
109941
56390403
70448
51527034
104773
51202716
91352
50512383
56416
49410018
77536
47598841
49939
44524589
86706
42479448
79696
37899117
71318
37381894
Species
Homo sapiens
Mus musculus
Drosophila melanogaster
Arabidopsis thaliana
Caenorhabditis elegans
Tetraodon nigroviridis
Oryza sativa
Rattus norvegicus
Bos taurus
Glycine max
Lycopersicon esculentum
Hordeum vulgare
Medicago truncatula
Trypanosoma brucei
Giardia intestinalis
Strongylocentrotus purpuratus
Entamoeba histolytica
Danio rerio
Zea mays
Xenopus laevis
Complete Genomes
Aquifex aeolicus
Archaeoglobus fulgidus
Bacillus subtilis
Borrelia burgdorferi
Chlamydia trachomatis
Escherichia coli
Haemophilus
influenzae
 Methanobacterium
thermoautotrophicum

Caulobacter crescentus

Helicobacter pylori
Methanococcus jannaschii
Mycobacterium
tuberculosis
Mycoplasma genitalium
Mycoplasma pneumoniae
Pyrococus horikoshii
Treponema pallidum
Saccharomyces cerevisiae
Drosophila melanogaster
Arabidopsis thaliana
Homo sapiens

Where is the Information?
What is the Data?
GenBank – genetic sequences
 Swiss-prot – protein sequences
 DNA chips / microarrays
 Metabolic pathways
 Signaling pathways / regulatory
networks
 Medline – biomedical literature
 Taxonomies / Ontologies

Protein Sequences

Swiss-prot (as of 3/03)
122,564 sequences
 Almost 45,000,000 total amino
acids
 103,486 references


http://www.expasy.ch/sprot/
Three-Dimensional Structures


Protein three-dimensional Structures
Protein Data Bank (PDB), as of March
27, 2001
13,158 proteins
 939 nucleic acids
 616 protein/nucleic acid complex
 18 carbohydrates


http://www.rcsb.org/pdb/
Where is the Information?
What is the Data?
GenBank – genetic sequences
 Swiss-prot – protein sequences
 DNA chips / microarrays
 Metabolic pathways
 Signaling pathways / regulatory
networks
 Medline – biomedical literature

Complete
yeast
genome
(6000 genes)
on a chip.
Online access to DNA chip Data






http://genomewww4.stanford.edu/MicroArray/SMD/
O(10) data sets available from Stanford site
10,000 to 40,000 genes per chip
Each set of experiments involves 3 to 40
“conditions”
Each data set is therefore near 1 million data
points.
People gearing up for these measurements
everywhere…
Where is the Information?
What is the Data?
GenBank – genetic sequences
 Swiss-prot – protein sequences
 DNA chips / microarrays
 Metabolic pathways
 Signaling pathways / regulatory
networks
 Medline – biomedical literature
 Taxonomies / Ontologies

A Reaction in EcoCYC
KEGG
Where is the Information?
What is the Data?
GenBank – genetic sequences
 Swiss-prot – protein sequences
 DNA chips / microarrays
 Metabolic pathways
 Signaling pathways / regulatory
networks
 Medline – biomedical literature
 Taxonomies / Ontologies

Signaling Pathways
Where is the Information?
What is the Data?
GenBank – genetic sequences
 Swiss-prot – protein sequences
 DNA chips / microarrays
 Metabolic pathways
 Signaling pathways / regulatory
networks
 Medline – biomedical literature
 Taxonomies / Ontologies

Where’s the Information?
Medical Literature on line.
 Online database of published literature
since 1966 = Medline = PubMED resource
 4,000 journals
 10,000,000+ articles (most with
abstracts)
 www.ncbi.nlm.nih.gov/PubMed/

PubMed
SwissProt
103,000
references
100s Mb of text
100,000s
unique words

Abstracts Referenced in SP37
Number of abstracts associated
with sequences in Swiss Prot.
(# sequences truncated at 100)
(as of 2001)
Where is the Information?
What is the Data?
GenBank – genetic sequences
 Swiss-prot – protein sequences
 DNA chips / microarrays
 Metabolic pathways
 Signaling pathways / regulatory
networks
 Medline – biomedical literature
 Taxonomies / Ontologies

MESH = Medical Entity Subject
Headings
Controlled vocabulary for indexing
biomedical articles.
 19,000 “main headings” organized
hierarchically
 Browser at

http://www.nlm.nih.gov/mesh/MBrowser.
html
MESH
UMLS: Semantic Model of
Biomedical Language



Representing more of semantics of
words and more relationships.
UMLS = Unified Medical Language
System
http://www.nlm.nih.gov/research/u
mls/
UMLS Elements




Semantic concepts (475K) = specific terms
connected to semantic categories (e.g.
Munchausen syndrome linked to
Behavioral-Dysfunction)
Concept maps (1,000K) = mapping from a
terminology to a semantic concept (e.g.
ICD-9 Billing code to Munchausen
syndrome)
Categorizations = relate semantic concepts
Conceptual links (7K) = relate two semantic
concepts with a semantic relationship
Gene Ontology
(http://www.geneontology.org/)

A controlled listing of three types of
function:
Molecular Function
 Biological Process
 Cellular Component


Vision: universal language for
molecular biology across species
Molecular Function








<molecular_function ; GO:0003674
%anti-toxin ; GO:0015643
%lipoprotein anti-toxin ; GO:0015644
%anticoagulant ; GO:0008435
%antifreeze ; GO:0016172
%ice nucleation inhibitor ; GO:0016173
%antioxidant ; GO:0016209
%glutathione reductase (NADPH) ; GO:0004362
; EC:1.6.4.2 % flavin-containing electron transporter ;
GO:0015933 % oxidoreductase\, acting on NADH or
NADPH\, disulfide as acceptor ; GO:0016654
Current Genome Annotations
http://www.geneontology.org
Where is the Information?
What is the Data?
GenBank – genetic sequences
 Swiss-prot – protein sequences
 DNA chips / microarrays
 Metabolic pathways
 Signaling pathways / regulatory
networks
 Medline – biomedical literature
 Taxonomies / Ontologies

KDD Cup 2002:
Information Extraction for
Biological Text
Task Background: Flybase

Flybase project




Flybase goals




Distillation of literature on the fruitfly
Table of contents function
Support search of literature
Current methodology: Manual curation


Curates biomedical publications on the fruitfly
Uses GO (gene ontology) as ontology
Fruitfly (Drosophila melanogaster) is one of the key “model
organisms”
Curators read the literature and manually update flybase
Goal of KDD Cup 2002: Can this be (partially) automated?
FlyBase: Example of Data
Curation
Curators Cannot Keep Up with
the Literature!
FlyBase References By Year
Task Rationale and Description

FlyBase provided the


Data annotation (plus biological expertise)
Input on the task formulation


What can be useful to the curators
Start fairly simple. Try to help automate part of
what one group of FlyBase curators needs to
do:


Determine which papers need to be curated for
fruit fly gene expression information
Want to curate those papers containing
experimental results on gene products
(RNA transcripts and proteins)
Some Data (Text) Preparation
Challenges

Abstracts are not enough, need the full
papers

E.g., for one paper on Appl proteins
(PubMed ID #8764652), FlyBase lists 19
“when-where” pairs for Appl protein
expression

A “when-where” pair indicates when in the life
cycle and where in the body some transcript or
protein is found



“When-where” pair example: adult-brain
Only 2 of the 19 pairs (11%) are mentioned in the
abstract. The rest are only mentioned in the body
of the full paper
So need full papers in electronic form
Some Data (Text) Preparation
Challenges

Full papers are copyrighted by publishers


For the contest, only use “free” papers
As a result of all these complications, out of
the ~7100 papers in FlyBase that were of
interest only ~1100 were used
Some Data (Text) Preparation
Challenges (Continued)
Plain text is not enough, also need things
like superscripts, subscripts, italics, Greek
letters (in English text)
 E.g., represent alleles (variants of a gene)
with superscripts

d


s
sd
Some Appl gene alleles: Appl , Appl , Appl
If lose the superscripts, these appear as:
Appld, Appls, Applsd

This would make it harder to determine that these
refer to the same gene

Need to know what suffixes to remove before
trying to match
Some Data (Text) Preparation
Challenges (Continued)

FlyBase has certain conventions to
represent superscripts, etc. in ASCII


E.g., represent those alleles as
Appl[d], Appl[s], Appl[sd]
In general, gene and protein names are
already hard to match because they often
have a complicated word structure
(morphology)

One needs to know what morphological
transformations (like prefix or suffix
removal) to perform before attempting to
match the names
Information Extraction Task


Given for each paper
 The full text of that paper
 A list of the genes mentioned in that paper
Determine for each paper
 For each gene mentioned in the paper, does
that paper have experimental results for
 Transcript(s) of that gene (Yes/No)?
 Protein(s) of that gene (Yes/No)?
Task is Harder Than It First
Appears




Interested in results applicable to “regular” (found in the
wild) flies, not mutants
Genes have multiple names (synonyms)
 Given a list of the known synonyms
 But list may be incomplete
Some names can refer to multiple genes
 E.g., “Clk” is a symbol for one gene (Clock) and is
also a synonym for another gene (period, symbol is
“per”)
Contestants given evidence of experimental results found
in the training data,
 But only in the form that is recorded in the FlyBase
database
Training Data in Flybase


Database (DB) records what evidence is found in a
training paper, but not where in that paper
The evidence is often recorded in a “normalized” form
and domain knowledge is needed to find the
corresponding text, e.g.,
 DB: Assay mode: “immunolocalization”
Text (PubMed ID#9006979):
“Figure 12. …Whole-mount tissue staining using an affinitypurified anti-PHM antibody in the CNS … This view
displays only a portion of the CNS”
 Term “immunolocalization” is not in the text
 Instead, text describes the process of performing
an immunolocalization
Typical NLP Training Data:
More Detailed

These systems assume every mention of an
entity or relation of interest in the text is
annotated
So anything not annotated is not a mention
 E.g.,
Annotations to train a “Northern blot”
Paper
#7540168:
...
detector:
transcripts
on Northern analyses, raising questions whether @norpA@

...
@Northern Blots@
...
Northern blots were carried out as described by Zhu @et al.@(1993)
...
@Northern Analysis of Adult RNA@
...
Figure 3: Northern blot analysis of @norpA@ transcripts in adult
... I
This paper has a total of 19 mentions.
Task Details

Task has 3 sub-tasks, that contribute equally to
the overall score
 1. Ranked-list of papers (curatable before
non-curatable)
 2. Yes/No decisions on the papers being
curatable (having any results of interest)
 3. Yes/No decisions for having results for
each type of product (transcript, protein) for
each gene mentioned in a paper
Some Numbers






Training set: 862 articles
Test set: 213 articles (non-public!)
Time Allowed
 Release training set, wait ~6 weeks
 Release test set, results due ~2 weeks later
18 teams submitted 32 entries
Entries from 7 “countries”:
 Japan, Taiwan, Singapore, India, UK,
Portugal, USA
 About equal numbers of universities and
companies
Evaluation measure: F measure
Results

Winner: a team from ClearForest and Celera
Used manually generated rules and patterns
to perform information extraction
 Also had the best score in each of the 3
sub-tasks
Best
Median
Ranked-list:
84%
69%
Yes/No curate paper:
78%
58%
Yes/No gene products: 67%
35%

Summary
Reliance on partial annotations is key.
 “Information retrieval” task easiest to solve
and immediately useful.
 Electronic availability of full-text is big
issue.
 Mundane format problems (subscripts etc)
are a big issue.
 Best results were 67% for information
extraction.

Curated Databases
Flybase is an example of a curated
database.
 A lot of biological research is organized
around such databases (cf. building and
publishing software packages in CS)
 There are hundreds (thousands?) of curated
databases.



13 important databases just for one area:
nuclear receptors.
Maintaining curated databases is laborintensive.
Curated Databases

Text mining can be used for:




Cost savings
Time savings
Consistency
Freshness
Curated Databases: Uses

Protein-protein interactions


Support information retrieval


Find all transcription factors that are
involved in cell death
Interpretation of data-intensive experiments


Which proteins interact with X?
Microarray case study presented last week
In silico biology
E-Cell (http://e-cell.org/)
Curated Databases: Uses (cont.)
Summary/selection of what is known
 Support search
 Knowledge discovery



Contradictory findings
Nobel Prize


He/She who points out a critical genedisease link first, wins the Nobel Prize.
You better do a thorough literature search.
Combining
Text Mining and Data Mining
Combining Text and Links

Recall: Classifying a web document based on




The text they contain
The categories of other pages pointing to it
The categories of other pages it is pointing to
Also

Usage information (Pitkow et al.)
Clustering: Example
(Eisen et al.)
Combining Gene
Expression&Text


Clustering of genes in a microarray
experiment
Last week





Clustering based on text only, or:
Clustering based on gene expression only
What about combining the two?
There is a large number of “good
clusterings” for a particular problem
Use literature to guide clustering
Comments





Yeast : genes were grouped by expression.
Functional labels guided us to find key subgroups.
Once key subgroups are identified, supervised approaches can
refine identification process.
Cancer : cell line were grouped by semantic category (hypoxia
versus normoxia).
Used supervised approaches to refine identification process
Literature as a guide

Free text documentation is widely available

Patient records to describe pathological specimens

~20,000 documents describing specific yeast genes

May have the information to guide us in searching
for similarities in genes and expression
Goal of algorithm


To identify subgroups of genes with
commonalities in gene expression and in
biological function.
Literature is the means by which we identify
functional commonalities
Projections in Linear Discriminant Analysis

A normal
distribution is
estimated for the
features of each
population of the
training set.


Each distribution
is centered at the
mean of the
population
Linear
discriminant
analysis assumes
a pooled
covariance
matrix.
Our approach



Look for projections that separate specific
groups of genes
In a good projection, the separated genes have
some functional commonalities
These commonalities should be evident in the
gene literature
Challenges



C1 : Can we identify biologically meaningful
concepts from simple text representations?
C2 : In a group of genes with some biological
similarity, can we detect that similarity in the
literature?
C3 : Can we then find projections in the
expression data that group genes appropriately?
Resources





NLP sessions of PSB: psb.stanford.edu
www.bionlp.com
bioperl.org, biopython.org
National Library of Medicine:
www.nlm.nih.gov
http://www.ai.ucsd.edu/rik/annblast/abbm.html (out of date, but still
comprehensive)
Links to Today’s Topics






http://www.smi.stanford.edu/projects/helix/psb01/chang
.pdf Pac Symp Biocomput. 2001;:374-83. PMID:
11262956
Blast: http://www.ncbi.nlm.nih.gov/BLAST/
http://www-smi.stanford.edu/projects/helix/psb03
Genome Res 2002 Oct;12(10):1582-90 Using text analysis
to identify functionally coherent gene groups.
Raychaudhuri S, Schutze H, Altman RB
www.biostat.wisc.edu/~craven/kddcup/
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats
.html
http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?d
b=Genome (complete genomes)
Links to Today’s Topics

http://www.nlm.nih.gov/mesh/meshhome.html