Presentation Template

Download Report

Transcript Presentation Template

Opportunities & Challenges in
Applying IR Techniques to
Bioinformatics
ChengXiang (“Cheng”) Zhai
Department of Computer Science
Institute for Genomic Biology
Graduate School of Library & Information Science
University of Illinois at Urbana-Champaign
Include slides from NCBI training tutorials & slides from
the website of the book “An Intro. to Bioinformatics Algorithms”
Where in the US is UIUC?
Chicago
Picture from Netherlands Consulate Website http://www.netherlands-embassy.org/
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
2
Outline
•
•
•
•
•
What is Bioinformatics?
Typical Problems in Bioinformatics
Information Retrieval & Bioinformatics
Biomedical Literature Access & Mining
Summary
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
3
What is Bioinformatics
• Management & Exploitation of Biological
Data/Info
– Biological information (DNA, Gene expression,
Proteins, Literature….)
– Information management (search, organization,
classification)
– Information exploitation (pattern analysis, data
mining)
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
4
Why is Bioinformatics Important?
• Biology perspective
– More and more biological information is available
– Need to effectively access and use the information
– Information analysis supplements (even may replace) wet
lab experiments
• Computer science perspective
– Excellent application domain
– Poses special computational challenges
– Brings computer science closer to scientific discovery
• Currently growing …
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
5
Bioinformatics and Other Fields
Computer Science
Biology
Information
Management
Biochemistry
Molecular
Biology
Biophysics
Bioinformatics
Optimization
Theoretical CS
Machine Learning
Data Mining
Applied Mathematics & Statistics
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
6
Some background about
molecular biology…
Life begins with Cell
• A cell is the smallest structural unit of an organism
that is capable of independent functioning
• Cells have some common features
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
8
All Cells have common Cycles
• Born, eat, replicate, and die
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
9
Example of cell signaling
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
10
Some Terminology
• The genome is an organism’s complete set of DNA.
– a bacteria contains about 600,000 DNA base pairs
– human and mouse genomes have some 3 billion.
• Gene
– basic physical and functional unit of heredity.
– specific sequence of DNA bases that encodes
instructions on how to make a protein.
• Protein
– Makes up the cellular structure
– large, complex molecules made up of smaller
subunits called amino acids.
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
11
Life Depends on 3 Critical Molecules
• DNAs
– Hold information on how cell works
• RNAs
– Act to transfer short pieces of information to different
parts of cell
– Provide templates to synthesize proteins
• Proteins
– Form enzymes that send signals to other cells and
regulate gene activities
– Form body’s major components (e.g. hair, skin, etc.)
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
12
The Central Dogma
• Central Dogma:
DNARNAprotein
• Transcription:
DNAmRNA
• Translation
mRNAprotein
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
13
Biology Research Questions
• Are two genes the same?
• How are genes regulated?
• What are the relations between gene functions
and transcription factors?
• How can we detect gene regulation networks?
• How can we determine protein structures?
• How can we determine protein functions?
• ….
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
14
The Central Dogma & Biological Data
Original DNA Sequences
(Genomes)
Expressed DNA sequences
( = mRNA Sequences
= cDNA sequences)
Expressed Sequence Tags
(ESTs)
Protein Sequences
-Inferred
-Direct sequencing
Protein structures
-Experiments
-Models (homologues)
Literature information
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
15
Entrez Integrates Most Biological DBs
CancerChromosomes
Gene
UniST
S
UniGene
Homologen
e
SNP
Genome
PopSet
Nucleotide
GEO
Books
MeSH
PubMed
OMIM
Entrez
Taxonomy
GEO
Datasets
Protein
PMC
Journal
s
Domains
Structur
e
3D Domains
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
16
Web Access: http://www.ncbi.nlm.nih.gov
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
17
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
18
Number of Users and Hits Per Day
450,000
400,000
1997 1998
1999
2000
2001
2002
2003
Number of Users
350,000
300,000
250,000
200,000
Currently more than
10,000,000 to 50,000,000
hits per day!
150,000
100,000
50,000
0
Christmas &
New Year’s Days
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
19
Types of Databases
• Primary Databases
– Original submissions by experimentalists
– Content controlled by the submitter
• Examples: GenBank, SNP, GEO
• Derivative Databases
– Built from primary data
– Content controlled by third party (NCBI)
• Examples: Refseq, RefSNP, GEO Datasets,
UniGene, TPA, NCBI Protein, Structure,
Conserved Domain
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
20
Primary vs. Derivative
Sequence Databases
RefSeq
Labs
Sequencing
Centers
TATAGCCG
AGCTCCGATA
CCGATGACAA
Curators
TATAGCCG
TATAGCCG
TATAGCCG
TATAGCCG
Updated
continually
by NCBI
UniGene
GenBank
Updated ONLY
by submitters
Genome
Assembly
Algorithms
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
21
A Traditional
GenBank Record
Header
The Flatfile Format
Feature Table
Sequence
LOCUS
DEFINITION
AY182241
1931 bp
mRNA
linear
PLN 04-MAY-2004
Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,
complete cds.
ACCESSION
AY182241
VERSION
AY182241.2 GI:32265057
KEYWORDS
.
SOURCE
Malus x domestica (cultivated apple)
ORGANISM Malus x domestica
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;
rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
REFERENCE
1 (bases 1 to 1931)
AUTHORS
Pechous,S.W. and Whitaker,B.D.
TITLE
Cloning and functional expression of an (E,E)-alpha-farnesene
synthase cDNA from peel tissue of apple fruit
JOURNAL
Planta 219, 84-94 (2004)
REFERENCE
2 (bases 1 to 1931)
AUTHORS
Pechous,S.W. and Whitaker,B.D.
TITLE
Direct Submission
JOURNAL
Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REFERENCE
3 (bases 1 to 1931)
AUTHORS
Pechous,S.W. and Whitaker,B.D.
TITLE
Direct Submission
JOURNAL
Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REMARK
Sequence update by submitter
COMMENT
On Jun 26, 2003 this sequence version replaced gi:27804758.
FEATURES
Location/Qualifiers
source
1..1931
/organism="Malus x domestica"
/mol_type="mRNA"
/cultivar="'Law Rome'"
/db_xref="taxon:3750"
/tissue_type="peel"
gene
1..1931
/gene="AFS1"
CDS
54..1784
/gene="AFS1"
/note="terpene synthase"
/codon_start=1
/product="(E,E)-alpha-farnesene synthase"
/protein_id="AAO22848.2"
/db_xref="GI:32265058"
/translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK
NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF
EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE
DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK
GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI
LSLLFQPLVN"
ORIGIN
1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat
61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg
121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt
181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga
241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt
//
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
22
Bioinformatics Tools
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
23
Topics in Bioinformatics
…In this paper, we report the
discovery of a new gene that
affects DNA reproduction in …
Genes
AATTCATGAAAATCGTATACTGGTCTGGTACCGGC
TGAGAAAATGGCAGAGCTCATCGCTAAAGGTA
TCTGGTAAAGACGTCAACACCATCAACGTGTC
ACATCGATGAACTGCTGAACGAAGATATCCTG
TTGCTCTGCCATGGGCGATGAAGTTCTCGAGG
Genomics
Biology Literature
…
Gene expression & regulation
DNA Sequences
Retrieval &
Text Mining
Microarray data
1.2 2.2 ...1.5 
3.2 2.0 ...5.6 
....

0.5 1.5 ... 4.3
Transcriptomics
…
Proteins (Function)
Protein Sequences
MKIVYWSGTGNTEKMAELIAKGIIESGKDV
DELLNEDILILGCSAMGDEVLEESEFEPFIE
KVALFGSYGWGDGKWMRDFEERMNGYG
PDEAEQDCIEFGKKIANI
Proteomics
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
24
Typical Problems in Bioinformatics
• Sequence alignment
– Pairwise
– Multisequence
•
•
•
•
•
•
This is a quite incomplete list…
Motif finding
Gene finding
Protein structure/function prediction
Protein motif function prediction
Literature access & mining
…
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
25
Topic 1: Sequence Search
• Biological problems
– How do we know whether two genes are similar?
– Given a gene, how can we find similar genes in the genome of another
organism?
– Given a protein, how can we find similar proteins
• Computational problems
– Sequence matching/alignment/search
– Given a query sequence, retrieve similar sequences from a database
of sequences
• Related IR techniques
– Inverted index, sequence similarity, sequence retrieval
• Their sequence search engine is BLAST, which is the most
useful bioinformatics tool and is used routinely by all
biologists!
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
26
Topic 2: Multiple Sequence Alignment
Biological problem: Given a family of proteins, how can we
characterize their function domains/motifs?
Computational problem: Given a set of sequences, find the
best alignment
Related IR techniques: Summarization
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
27
Topic 3: Motif Finding
• Biological problem:
– Given a set of genes with similar functions
– Find the common transcription factor binding site
• Computational problem:
– Given a positive set of sequences and a
background set of sequences, find a common
pattern that is shared by all/many of the positive
sequences, but not common in the background
• Related IR techniques: relevance feedback
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
28
Motif Discovery
• Motif = subsequence pattern
…. G-G-T-C-C-T-G-G …
• Motif discovery
– Given a target set of sequences (and possibly a
background set of sequences)
– Find motifs that characterize the target set
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
29
AlignACE Example: Input Data Set
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
300-600 bp of upstream sequence
per gene are searched in
Saccharomyces cerevisiae.
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
30
AlignACE Example: The Target Motif
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
AAAAGAGTCA
AAATGACTCA
AAGTGAGTCA
AAAAGAGTCA
GGATGAGTCA
AAATGAGTCA
GAATGAGTCA
AAAAGAGTCA
**********
MAP score = 20.37 (maximum)
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
31
Topic 4: Microarray Data Analysis
Funcational group
Biological problem: Given
expression values of a set of genes
in different conditions, how do we
detect genes that are coexpressed/co-regulated?
Computational problem: Given 2-D
matrix data, perform clustering
Related IR techniques: clustering
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
32
Topic 5: Profile HMM
• Biological problem: How do we know if a new
protein has the same function as any known
proteins?
• Computational problem:
– Given a set of proteins in the same function family,
build an HMM profile for the family
– Given examples of proteins in different families,
learn a classifier to classify new proteins
• Related IR techniques: text categorization,
HMMs
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
33
An Example Profile HMM
Dj
Ij
Begin
Mj
End
• 3 kinds of states: Match, Insertion, Deletion
• Output symbols: amino acids
• Can be trained with aligned multiple sequences
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
34
Uses of Profile HMMs
• Detecting potential membership in a family
– Matching a sequence to the profile HMMs
– Score a sequence S by p(S|HMM)/p(S|Random)
• Return top k best matching profile HMMs for a given sequence
• Given an HMM, find additional sequences in the family
• Aligning a sequence to an existing family
– Decoding the sequence using Viterbi
– Using the state transition path to align the sequence with
the existing sequences in the family
• HMMs have many other uses (e.g., gene finding)
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
35
Protein Motif Function Prediction
Protein sequences
MAPVRKPDMRGLAVFIS
DIRNCKPDSKGLEAEVKR
TEIRESIAS
SPLASH
Potential
Motif Patterns
KPD . . GL
LQ . . D. . FTD
……
MLQPAKPDLPGLCIYPSVKE
FMLKPDKMGLLTDFGQIA
?
?
…
Functions=?
Tyrosine kinase
Signal transducer
……
KAVFS . . . . GQIA
How to determine the function of
a new protein motif?
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
36
Motif Function Prediction Method
Motif
sequence
APLL..VQY
KCI..SP..LR
GO term
GO:0005221 "intracellular cyc. n. a. action channel"
fb|FBgn0028428
GO:0005244 "voltage-gated ion channel"
GO:0005886 “plasma membrane"
GSGSGS
• Exploit the correlation between motif matching and GO
assignment
• Which GO term is most strongly correlated with “APLL..VQY”?
• Related IR Techniques: Cross-Lingual IR, Mutual Information
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
37
Opportunities for Applying IR Techniques
DNA Sequences
Protein Sequences
Proteins Structures
Microarray Data
Text
Search
Filtering
Clustering
Classification
Summarization
Functional Annotations
…
Text Mining
…
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
38
Text Information Management
Search
Applications
Visualization
Summarization
Filtering
Information
Access
Search
Mining
Applications
Mining
Information
Organization
Categorization
Extraction
Knowledge
Acquisition
Clustering
Natural Language Content Analysis
Text
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
39
Sequence Information Management
Search
Applications
Summarization
Filtering
Information
Access
Search
Visualization
Mining
Applications
Mining
Information
Organization
Categorization
Extraction
Knowledge
Acquisition
Clustering
Sequence Content Analysis
Sequences
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
40
Challenges in Applying IR to Bioinformatics
• Domain expertise barrier
– Causes difficulty in problem definition & evaluation
• Signal/noise ratio is poor
– Unlike English, which we know well, the “DNA language” is
largely unknown
– Techniques working well for English text may not work well
for DNA sequences
• Inaccuracy and errors inevitably exist in biological data
– Measurement errors (e.g., sequencing errors)
– Very few derived data (e.g., annotations) have been validated
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
41
Challenges in Applying IR to Bioinformatics
• Exploiting all available information about a problem is
critical
– How to incorporate domain/prior knowledge? (Need to
formalize a biologist’s knowledge)
– Many resources are available, but figuring out how to
appropriately take advantage of them is a challenge
• Variation of problem formulation
– While a problem may be similar to one in IR at the high-level,
it is often quite different at the low level
– E.g., sequence search differs from text search in two ways:
• Query is different
• Matching criterion is different (need alignment)
– Direct applications of standard IR techniques may not be
effective
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
42
Biomedical Literature Access & Mining:
General Problems
• Basic literature search
– High accuracy, vocabulary matching/switching,
entity recognition
• Integrative information access
– Literature linked with databases
• Information/Knowledge extraction
– Genes, relations, networks, inferences
• Hypothesis generation/testing
– Exploratory analysis & QA
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
43
Some of Our Work
• Applying language models to biomedical
literature retrieval (TREC 2003 & 2005)
• Applying entity recognition to gene name
recognition (HLT/NAACL 2006)
• Applying summarization to gene
summarization (PSB 2006)
• Developing an integrated and exploratory
biological information system ($5M NSF project)
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
44
Biomedical Literature Retrieval
• Task: Given an ad hoc query, find relevant
literature abstracts from Medline
• Challenge: Semi-structured queries
• Standard language models are not directly
applicable
• Solution: Semi-structured query language
models
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
45
Semi-Structured Queries
• TREC-2003 Genomics Track, Topic 1:
Find articles about the following gene:
OFFICIAL_GENE_NAME activating transcription factor 2
OFFICIAL_SYMBOL ATF2
Bag-of-word Representation:
ALIAS_SYMBOL HB16
activating transcription factor 2,
ALIAS_SYMBOL CREB2
ATF2, HB16, CREB2, TREB7,
ALIAS_SYMBOL TREB7
CRE-BP1
ALIAS_SYMBOL CRE-BP1
• Problems with unstructured representation
– Intuitively, matching “ATF2” should be counted more than
matching “transcription”
– Such a query is not a natural sample of a unigram
language model, violating the assumption of the language
modeling retrieval approach
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
46
Semi-Structured Language Models
Q  (Q1 ,..., Qk )
Semi-structured query
1,...,k
Semi-structured query model
k
p( w | Q )   i p( w | i )
i 1
Semi-structured LM estimation:
Fit a mixture model to pseudo feedback documents using EM
TREC 2003 (Uniform weights)
TREC 2005 (Estimated weights)
Unstruct
Semi-struct
Imp.
Unstruct
Semi-struct
Imp.
MAP
0.16
0.185
+13.5%
0.242
0.258
+6.6%
Pr@10docs
0.14
0.154
+10%
0.382
0.412
+7.8%
Query
Model
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
47
Biomedical Named Entity Recognition
• Task: Recognizing gene names in biomedical
literature
• Challenge: Irregular name variations
• Standard machine learning suffers from overfitting
• Solution: Domain-Aware adaptation
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
48
Challenges in Recognizing Gene Names
• No complete dictionaries
– Biologists constantly name newly discovered genes
• Long, descriptive gene names
– muscle-specific Xenopus cardiac actin gene promoter
• Ambiguity
– Synonyms: octopamine receptor (oa1, oar, amoa1)
– Lexical variations: MIP-1-alpha, MIP-1alpha, (MIP)-1alpha
– Confused with common English words: for (foraging), at
(arctops)
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
49
Domain overfitting problem
• When a learning based gene tagger is applied to
a domain different from the training domain(s),
the performance tends to decrease significantly.
• The same problem occurs in other types of text,
e.g., named entities in news articles.
Training domain
mouse
fly
Reuters
Reuters
Test domain
mouse
mouse
Reuters
WSJ
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
F1
0.541
0.281
0.908
0.643
50
Adapting Biological Named Entity Recognizer
T1
Tm
…
E
training data
test data
λ0, λ1, … , λm
individual domain
feature ranking
O1
testing
learning
entity
recognizer
…
Om
d features
d = λ0d0 + (1 – λ0)
(λ1d1 + … + λmdm)
feature re-ranking
generalizable
features
feature selection for D0
feature selection for D1
domain-specific
features
top d0 features for D0
top d1 features for D1
…
O’
feature selection for Dm
top dm features for Dm
Preliminary Evaluation Results
•
•
•
•
Recognizing gene names
Maximum entropy/Logistic regression recognizer
Text data from BioCreAtIvE (Medline)
3 organisms (Fly, Mouse, Yeast), each contributes 5,000
sentences with 2,500 with gene mentions
Training Set
Fly, Mouse
Fly, Yeast
Mouse, Yeast
Test Set
Yeast
Mouse
Fly
Baseline
0.537
0.418
0.148
Domain
0.549
0.454
0.215
Improvement
+2.2%
+8.6%
+45%
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
52
Gene Summarization
• Task: Automatically generate a text summary
for a given gene
• Challenge: Need to summarize different
aspects of a gene
• Standard summarization methods would
generate an unstructured summary
• Solution: A new method for generating semistructured summaries
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
53
An Ideal Gene Summary
• http://flybase.bio.indiana.edu/.bin/fbidq.html?FBgn0000017
GP
EL
SI
GI
MP
WFPI
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
54
Semi-structured Text Summarization
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
55
Summary example (Abl)
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
56
BeeSpace Project
• $5M National Science Foundation Project
• A campus-wide collaborative project involving
computer scientists, biologists, and
information scientists
• Develop an integrative exploratory information
system allowing a user to navigate from
biological experiment data to literature for
functional information about honeybee’s social
behavior
• URL: http://www.beespace.uiuc.edu/
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
57
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
58
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
59
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
60
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
61
Summary
• Bioinformatics has already involved a lot of IR
– PubMed search
– Entrez integrated search (e.g., find related articles)
– BLAST
• Many IR techniques can be either directly applied to
or adapted to biomedical literature access & mining
• High similarities between bioinformatics problems
and text mining problems.
• High similarities between the methods used in
bioinformatics and in IR (should be mutually
beneficial)
• Many opportunities for an IR researcher to contribute
to bioinformatics research/development
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
62
Stepping into Bioinformatics
• Find a biologist partner and treat him/her as your “customer”
• Learn basic molecular biology to eliminate “language barrier”
• Attend bioinformatics conferences (ISMB, ECCB, RECOMB,
PSB, CSB, …)
• Start with biomedical literature access & mining
– Participate in TREC Genomics Track
– Apply/Adapt existing IR techniques
– Develop new IR techniques
• Move to information integration (Text + Databases)
• Look for methodology connections
– Language models (especially HMMs, translation models)
– IR heuristics (TF-IDF, pseudo feedback)
– Machine learning
• Build systems! (Biologists love easy-to-use software tools)
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
63
The Road to Bioinformatics…
….
Good luck!
$1x109
$1x106
$100
$10
IR
Biomedical science
Human health
…
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai
64
Thank You!