Lab 1 - personal homepage server for the University of Michigan

Download Report

Transcript Lab 1 - personal homepage server for the University of Michigan

Bioinfo/Stat 545 Biostat646
Data Analysis in Molecular Biology
Lab 1: Bioinformatics Online Resources
Dongxiao Zhu
Overview
Main types of biological data




Sequence Data
Interaction Data
Microarray and gene expression data
Others, macromolecule structure data,
human genes and disease data
Information Retrieval Strategies
Part I. Online Biological Data Resources
2004 Nucleic Acid Research database issue
http://www3.oup.co.uk/nar/database/cap/ (database list)
Total 548 databases listed, 162 more than last year
Main types of biomedical data




Sequence Data (DNA and Protein Sequence)
 Gene sequencing, “Whole genome shotgun” and Lander
& Waterman Assembly Algorithm
 Protein sequencing, de novo sequencing from tandem
Mass Spectra
 Gene Prediction, Sequence alignment and BLAST
 Gene Annotation and Gene Ontology
 Protein/RNA secondary/tertiary structure prediction
Interaction data – Biological pathway and network
Microarray and Gene Expression Data
Others, structure data, human genes and disease
Gene/Protein sequencing –data
acquiring and data accuracy
Whole genome shotgun[1]

Double end sequencing
 short reads off both ends of large inserts
 additional information for assemble


Clone coverage vs. sequence coverage
Scaffolds
 ordered and oriented contigs
 sequence gaps
De novo protein sequencing from Tandem Mass
Spectra[2]
Accuracy issues:





Large scale repeats
Missing and contaminating data
Plasmids and minichoromosomes
Signature of tandem repeats
Polymorphism
Gene Prediction, Annotation and Gene Ontology
Genescan webservice[3]
 http://genes.mit.edu/GENSCAN.html
 Sensitive in recognizing at least on exon
Biochemical Functional Annotation (Biochemical
View)
 Clone, expression and functional studies
 Database homolog/ortholog search
 Sequence alignment (similar seq -> similar function)
 Structure alignment (similar structure -> similar
function)
Protein sub-cellular location prediction using
primary sequence alone (Cellular View)


Codon usage bias in differently localized protein
Signal peptide
Gene ontology – consistent descriptions of gene
products in different databases
Sequence Alignment/BLAST and Literature Search –
Bioinformatics approaches to gene annotation
Why BLAST?



Explosively increasing novel sequences, in arguable most characterized
~4200 E.coli proteins, half of them are not experimental studied. Moreover,
every newly sequenced genome encodes hundreds to thousands novel
proteins
There is a need to infer functional roles of these novel proteins.
compare novel sequences with previously characterized genes to annotate
function
BLAST algorithm[4]

http://www.bioinformatics.med.umich.edu/Courses/526/lecturenotes.html
BLAST program selection guide

http://www.ncbi.nlm.nih.gov/BLAST/producttable.shtml
BLAST tutorial

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html
Literature Search (Part II)
Gene ontology (GO)[5]
Why GO?

Use of GO terms by several collaborating
databases facilitates uniform queries across them

Hierarchical structured to allow query a
vocabulary at different levels.


For example, you can use GO to find all the gene
products in the mouse genome that are involved in
signal transduction, or you can zoom in on all the
receptor tyrosine kinases
Allows annotators to assign properties to gene
products at different levels, depending on how
much is known about a gene product
http://www.geneontology.org/index.shtml#downloads
What GO[5] is?
GO is designed to be a structured, precisely defined,
common, controlled vocabulary for describing the roles of
genes and gene products in any organism. GO is used to
annotate genes and gene products
Three categories of GO



Biological Process: a biological objective to which the gene or gene
product contributes. A process is accomplished via one or more ordered
assemblies of molecular functions. E.g. “cell growth and maintenance” ,
“signal transduction”, “cAMP biosynthesis”.
Molecular Function: the biochemical activity of a gene product. E.g.
“enzyme”, “ligand”, “Toll receptor ligand”.
Cellular Component: the place in the cell where a gene product is active.
E.g. “ribosome” or “proteasome”, “nuclear membrane”.
An interesting analog of GO
Statistician’s view

A multivariate definition
DB developer’s view

A entity/attributes definition in a DB schema
Biologist’s view

A nomenclature accepted by Biochemist/Molecular
Biologist, Cell Biologist, Geneticist, Neuroscientist
and Development Biologist
What GO is NOT?
GO is not a database of gene sequences, nor a catalog of gene
products. Rather, GO describes how gene products behave in a
cellular context.
GO is not a way to unify biological databases (i.e. GO is not a
'federated solution'). Sharing vocabulary is a step towards
unification, but is not, in itself, sufficient. Reasons for this include the
following.
a.
Knowledge changes and updates lag behind.
b.
Individual curators evaluate data differently. While we can
agree to use the word 'kinase', we must also agree to
support this by stating how and why we use 'kinase', and
consistently apply it. Only in this way can we hope to
compare gene products and determine whether they are
related.
c.
GO does not attempt to describe every aspect of biology.
For example, domain structure, 3D structure, evolution
and expression are not described by GO.
GO is not a dictated standard, mandating nomenclature across
databases. Groups participate because of self-interest, and
cooperate to arrive at a consensus
Protein/RNA secondary/tertiary
structure prediction
Protein secondary/tertiary structure prediction



Server list,
http://www.embl-heidelberg.de/predictprotein/doc/explain_meta.html#list
Prediction methologies: Sliding window based and Machine learning based
Easier and feasible at this moment: prediction of 2D topology for some
functional important and simple patterned protein, e.g. Transmembrane
protein [7].
RNA secondary/tertiary structure prediction



Algorithms: Biological sequence analysis, R.Durbin et.al. Cambridge
University Press, 1988 p267
Michael Zuker’s prediction server [6]
http://www.bioinfo.rpi.edu/applications/mfold/old/rna/form1.cgi
Interaction Data – Biological Pathway
and Network
Three main types of interaction data



Signal transduction or transcription regulation
Protein-protein interaction
Metabolic pathway (best in terms of studying network
topology)
Interaction databases


KEGG database, metabolic pathways and signal transduction
pathways in 107 organisms
http://dip.doe-mbi.ucla.edu/dip/Links.cgi
Network model (random vs. scale free, small world)
Network analysis and visualization software


http://www-personal.umich.edu/~mejn/courses/2004/cscs535/syllabus.pdf
Pajek, AT&T DOT etc.
Metabolic Network in Homo sapiens
Summary statistics of network
analysis in 16 organisms
Num Num Max Max
Single Mutual
 of
 of
nodes edges Kout Kin
edges edges
roots
leaves
Eukarya
H.sapiens
1040 1528 12
11 0.130769 0.161538 572
478
R.norvegious
763 1028 10
8
348
340
0.138925 0.165138
C.elegans
706
974
10
9
324
325
0.15864 0.157224
S. cerevisiae
748 1072
9
10 0.129679 0.140374 396
338
Bacteria Proteobacteria
gamma
E.coli
893 1365 12
14 0.139978 0.113102 459
453
V.cholerae
738 1076 12
12 0.150407 0.123306 370
353
beta
R.solanacearum 864 1238 11
12 0.138889 0.118056 406
416
Firmicutes
Bacillales
B.subtilis
787 1151 12
12 0.133418 0.125794 401
375
Lactobacillales
L.lactis
545
778
11
11 0.157798 0.12844
280
249
Actinobacteria
S.coelicolor
814 1154 12
12
406
374
0.14742 0.135135
Cyanobacteria
T.elongates
509
697
12
12 0.143418 0.133595 237
230
Archaea
Euryarchaeota
M.acetivorans
489
633
8
7
209
212
0.143149 0.134969
T.acidophilum
458
593
8
8
197
198
0.170306 0.135371
Crenarchaeota
S.solfataricus
586
730
8
7
256
237
0.187713 0.151877
S.tokodaii
522
651
8
7
229
211
0.180077 0.149425
P.aerophilum
482
622
8
7
204
209
0.161826 0.120332
Domain, Kingdom and Phylum
Organism
Microarray and Gene Expression Data
Assumptions



Measured signal is proportional to amount of corresponding
cDNA/mRNA
Amount of mRNA determines amount of protein, i.e. there is
no regulation on translation level
Both of assumptions have NOT been proven yet.
DNA microarray databases (useful links)




http://industry.ebi.ac.uk/~alan/MicroArray/
http://genome-www5.stanford.edu/resources.html
http://www.ebi.ac.uk/microarray/
A lot more, you explore it!
Download Gene Expression Data
from SMD – An example
Stanford Microarray Database (SMD)
Retrieving public data from SMD

Retrieving data for an organism
 ftp://genome-ftp.stanford.edu/pub/smd/organisms
 One directory per organism, whose names are two-letter
code used by SMD
 Under each directory, one file per experiment
 Three ways to retrieve



Web Client. i.e. IE, Netscape, etc.
Graphic ftp client, e.g. Flashget, etc
Command line ftp client
 ftp –i genome-ftp.stanford.edu (-i get them all)
 Name: anonymous Password: XX@
 cd pub/smd/organisms/SC
 mget *gz
Continued
Retrieving all public data for an publication





Go to
http://genome-www5.stanford.edu/cgibin/tools/display/listMicroArrayData.pl?tableName=publication
Click any entry in column “Data in SMD”
Click “view” to read brief experiment design
description
Click “display data” to do experiment-wise query.
Click “Data Retrieval and Analysis” to filter data
and retrieve data
Part II. Information Retrieval in
Bioinformatics
Master effective information retrieval techniques can keep your
research thinking and works up-to-date
My steps in doing biomedical research





Identify an interesting topic and rise a scientific hypothesis
Start from NCBI Entrez, the life science search engine.
http://www.ncbi.nlm.nih.gov/Entrez/
Input the keyword or phrase into the query box and click GO
Numbers of pieces of retrieved information are displayed
Briefly go through each kinds of resources
NCBI Entrez (Good starting point)



Common retrieval interface to many databases
Controlled links between databases
Maintained at the National Center for Biotechnology Information (NCBI) in
the National Library of Medicine (NLM)
Pubmed and related IR Strategies
-
biomedical literature and books
What is pubmed?
PubMed is a web-based database of bibliographic information drawn primarily
from the life sciences literature
Pubmed tutorial:
http://www.nlm.nih.gov/bsd/pubmed_tutorial/m1001.html
Search Mechanisms




PubMed uses an Automatic Term Mapping feature
Look first in the MeSH Translation Table (Translate keywords into MeSH
term, e.g. from “renal transplant” to “kidney transplant”)
Then look into journal translation table
Finally in author index
As soon as PubMed finds a match, the mapping stops. That is, if
a term matches in the MeSH Translation Table, PubMed does not
continue looking in the next table. Its absolutely necessary
to specify the “Limit” in NCBI. E.g. “cell” is MeSH term and
also a journal name
Pubmed - Continued
What if “no match” is found?



PubMed is unable to match a search term with either of the
translation tables or the Author Index
PubMed will then search the individual words in All Fields.
Individual terms will be combined (ANDed) together.
Example: TATA Box associated transcription factor
Phrase Searching



These formats for phrase searching instruct PubMed to
bypass automatic term mapping. Instead PubMed looks for
the phrase in its Index of searchable terms. If the phrase is
in the Index, PubMed will retrieve citations that contain the
phrase.
PubMed may fail to find a phrase because it is not in the
Index.
Your phrase may actually appear in citation and abstract
data, but may not be in the Index. If this is the case, the
double quotes are ignored and the phrase is processed using
Automatic Term Mapping.
MeSH database (“GO” in
literature search)
Database of indexing terms
Entry example
NF-kappa B
Ubiquitous, inducible, nuclear transcriptional activator that binds
to enhancer elements in many different cell types and is
activated by pathogenic stimuli. The NF-kappa B complex is a
heterodimer composed of two DNA-binding subunits: NF-kappa
B1 and relA.
Year introduced: 1991
Entrez => MeSH
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=mesh
NLM => MeSH
http://www.nlm.nih.gov/mesh/meshhome.html
Structure of MeSH
(Combination of EC and GO)
Divisions
Anatomy [A]
Organisms [B]
Diseases [C]
Chemicals and Drugs [D]
Analytical, Diagnostic and Therapeutic
Techniques and Equipment [E]
Psychiatry and Psychology [F]
Biological Sciences [G]
Physical Sciences [H]
Anthropology, Education, Sociology and Social
Phenomena [I]
Technology and Food and Beverages [J]
Humanities [K]
Information Science [L]
Persons [M]
Health Care [N]
Geographic Locations [Z]
Hierarchy with Multiple Inheritance
Amino Acids, Peptides, and Proteins [D12]
Proteins [D12.776]
DNA-Binding Proteins [D12.776.260]
NF-kappa B [D12.776.260.600]
Amino Acids, Peptides, and Proteins [D12]
Proteins [D12.776]
Nuclear Proteins [D12.776.660]
NF-kappa B [D12.776.260.600]
Amino Acids, Peptides, and Proteins [D12]
Proteins [D12.776]
Transcription Factors [D12.776.930]
NF-kappa B [D12.776.260.600]
MeSH Full Listing
NF-kappa B
Previous Indexing:
Ubiquitous, inducible, nuclear transcriptional activator that binds to
enhancer elements in many different cell types and is
activated by pathogenic stimuli. The NF-kappa B complex is a
heterodimer composed of two DNA-binding subunits: NFkappa B1 and relA.
Year introduced: 1991
Subheadings:
administration and dosage agonists analysis antagonists and
inhibitors biosynthesis blood cerebrospinal fluid chemistry
classification deficiency diagnostic use drug effects genetics
immunology isolation and purification metabolism
pharmacokinetics pharmacology physiology radiation effects
secretion therapeutic use toxicity ultrastructure
Restrict Search to
Major Topic headings only
Do Not Explode this term
Entry Terms:
(i.e., do not include MeSH terms found below
this term in the MeSH tree).
NF-kB
NF kB
Nuclear Factor kappa B
kappa B Enhancer Binding Protein
Immunoglobulin Enhancer-Binding Protein
Enhancer-Binding Protein, Immunoglobulin
Immunoglobulin Enhancer Binding Protein
Transcription Factor NF-kB
Factor NF-kB, Transcription
NF-kB, Transcription Factor
Transcription Factor NF kB
Ig-EBP-1
Ig EBP 1
DNA-Binding Proteins (1987-1990)
Transcription Factors (1987-1990)
See Also:
I-kappa B
All MeSH Categories
Chemicals and Drugs Category
Amino Acids, Peptides, and Proteins
Proteins
DNA-Binding Proteins
NF-kappa B
All MeSH Categories
Chemicals and Drugs Category
Amino Acids, Peptides, and Proteins
Proteins
Nuclear Proteins
NF-kappa B
All MeSH Categories
Chemicals and Drugs Category
Amino Acids, Peptides, and Proteins
Proteins
Transcription Factors
NF-kappa B
Tips for increasing your searching
sensitivity and specificity
Chop query yourself with logic AND, OR, look a term
up yourself in MeSH database, and use MeSH terms
in your query
Use tags to do efficient search




[au],”author”, e.g. States DJ[au].
[dp],”date of publication”,e.g. 2004[dp].
[ad], “address”, e.g. Ann Arbor[ad], etc.
[MeSH], “MeSH term”, e.g. Transcription factor[MeSH]
Select “Limited to” option to prevent the search
stopping prematurely
Use phrase searching “” if you don’t want your
phrase to be partially searched.
Entrez Clipboard and Address Issue
Send to “clipboard”
Place to save results collected from multiple searches
Saved for ~ 1hr
Task: Find a local expert on NF kappa B
“NF kappa B” AND (48109 [ad] OR “Ann Arbor” [ad] NOT
Pfizer [ad])
(scan results for the most common senior author)
Need to think about all the ways people write addresses
“University of Michigan” fails to pick up “Univ. Mich.” or
“UMMS” etc.
Zipcodes are very specific, but only get about 70%
Won’t catch co-authored articles with a remote
collaborator
IR Strategies
Term search
Simple search for term matches (exact or stemmed)
“Find articles containing ‘p53’”

Boolean
Logical combination of term matches
“Find articles containing ‘p53’ AND ‘apoptosis’”

Statistical neighboring
Assume that articles on the same subject will use similar words

Rank articles by similarity of word use
“Find articles using vocabulary similar to the vocabulary in this title/abstract”

Deeper parsing
Natural language processing and deeper understanding

The field is still in its infancy
“Find articles describing the mechanism of p53 activation in apoptosis”

Boolean Searches
Entrez attempts to intelligently parse your query
Query: dna binding transcription factor macrophage
Details => (((("dna"[MeSH Terms] OR dna[Text Word]) AND
(("pharmacokinetics"[MeSH Subheading] OR "pharmacokinetics“
[MeSH Terms])
OR binding [Text Word]))
AND ("transcription factors“ [MeSH Terms] OR transcription factor
[Text Word]))
AND ("macrophages"[MeSH Terms] OR macrophage [Text Word]))
You can force a Boolean search
Query: “dna binding” AND “transcription factor” AND
macrophage
Details => (("dna binding"[All Fields] AND "transcription factor"[All
Fields]) AND ("macrophages"[MeSH Terms] OR macrophage[Text
Word]))
Phrase Searching
Specify with quotes
“transcription factor” vs. “transcription”
“factor”
Precomputed



Fast
Often mapped to synonyms and MeSH
terms
Just because you get a “phrase not found”
message does not mean it is not present
Text Neighboring
Related articles link (single or multiple articles)

Term usage similarity
 Articles talking about the same thing are likely to use the same
words


Good recall (sensitivity)
Precomputed and fast
Limitations

Strictly algorithmic, no understanding
 “Ras activates PI3K” vs. “PI3K activates Ras”



Historical and author biases in vocabulary
Poor precision (specificity)
Ranking can not satisfy everyone
Computational Issues in Statistical
Text Retrieval
Stop words

Simple words like “the” and “and” are not worth scoring
Term weights

Should weight matches of rare words more heavily than matches of
common words
Stemming and synonyms


Need to stem verbs and plural forms
May or may not be able to reduce to a normalized set of synonms
Normalizing for length

Don’t want to exclude short articles or articles without an abstract
All vs. all comparison is not feasible


107 articles => 1014 comparisons, not feasible
Compute demands of the task are growing faster than Moore’s law
Acknowledgements
Some slides in Part II are taken from
Dr.States’ Bioinfo 526 class
http://www.bioinformatics.med.umich.edu/Courses/526
Dr. Zhaohui (Steve) Qin for helpful
discussion
All authors of references that I have
cited