Biology and computers

Download Report

Transcript Biology and computers

NCBI data, sliding window
programs and dot plots
Sept. 25, 2012
Learning objectives-Become familiar with OMIM and
PubMed. Understand the difference between a research
article and review article. Understand the concept of
sliding window programs. Understand difference between
identity, similarity and homology. Appreciate that proteins
can be modular
Workshop-Learn how to use OMIM and obtain DNA and
proteins sequences associated with diseases. Perform
sliding window to compute %(G+C) as a function of
position in sequence.
Homework due Tuesday, Oct. 2nd.
Primary public domain
bioinformatics servers
Public Domain
Bioinformatics
Facilities
National Center
For Biotechnology
Information (NCBI)
United States
Databases
Analysis
Tools
European Bioinformatics
Institute (EBI)
United Kingdom
Databases
Analysis
Tools
Genome
Net
(KEGG & DDBJ)
Japan
Databases
Analysis
Tools
NCBI ENTREZ
A platform that provides access to and links to
databases with biological information
ENTREZ
PubMed
MedLine
GenBank
Protein Genomes
databases
PopSet
Taxonomy
OMIM
NCBI ENTREZ
MedLine
OMIM
Literature Database
Database of human genes and genetic disorders
GenBank
Database of all publicly available DNA sequences
Protein
databases
Database of amino acid sequences from Uniprot, Protein Research
Foundation, PDB.
Genomes
Database of genomes from organisms and viruses
PopSet
Taxonomy
Database of DNA sequences that have been collected to
analyze the evolutionary relatedness of a population.
Database of names of organisms with sequences in GenBank.
Literature Databases
Medline/Pubmed
OMIM
CSULA Library
Bookshelf (from NCBI)
Melvyl (Books at UC Libraries)
Other molecular life science databases





Science Direct
Pub Med Central
Free Medical Journals
LinkOut Journals
Wiley InterScience
OMIM-Online Mendelian
Inheritance in Man
A catalog of human genes linked to diseases
Victor A. McKusick at Johns Hopkins University
A good place to start when you want to research a
certain disease or biological molecule
This database is cross-referenced to PubMed and
other NCBI-based databases
Sliding window
A sliding window-gathers information about properties of
nucleotides or amino acids.
4
GCATATGCGCATATCCCGTCAATACCA
5
GCATATGCGCATATCCCGTCAATACCA
6
GCATATGCGCATATCCCGTCAATACCA
A simple example is to calculate the %(G+C) content
within a window. Then move the window one nucleotide
and repeat the calculation.
Sliding window
If the window is too small it is difficult to detect the trend
of the measurement. If too large you could miss meaningful
data.
Small window size
%(G+C)
Sequence number
Large window size
%(G+C)
Sequence number
Sliding window
Adapted from Zhao et al, BMC Genomics. 2007 Nov 7;8:403.
Amino acid characteristics
Amino Acid Hydrop. VALUE
A
1.8
C
2.5
D
-3.5
E
-3.5
F
2.8
G
-0.4
H
-3.2
I
4.5
K
-3.9
L
3.8
M
1.9
N
-3.5
P
-1.6
Q
-3.5
R
-4.5
S
-0.8
T
-0.7
V
4.2
W
-0.9
Y
-1.3
Four levels of protein structure
AGHIPLLQ
1) Primary
Linear sequence-
2) Secondary
Initial folding patterns-AGHIPLLQ
aaaTTTbb
3) Tertiary
4) Quaternary
Complex folding patterns-
Interactions between
polypeptides
Kyte-Doolittle Hydropathy
– A sliding window software program [J. Mol. Biol. 157:105-132 (1982)].
The seven known membrane-spanning regions are numbered 1-7 in red on the plot. Note that this particular software program averaged the hydropathy values in the window
(http://www.vivo.colostate.edu/molkit/hydropathy/index.html). The original program by Kyte and Doolittle summed the hydropathy values.
Dot Plot with window = 1
A
A
T
G
T
A
G
●
Window = 1
●
●
●
●
G
C
● ●
● ●
C
G
C
●
T
T
A
C
●
●
●
●
●
●
Note that 25% of
the table will be
filled due to random
chance. 1 in 4 chance
at each position
Dot Plot with window = 3
A
AG
{
T
G
C
C
T
A
G
T
●
G
●
C C
●
●
T
●
A G
●
Window = 3
The larger the window
the more noise can
be filtered
What is the
percent chance that
you will receive a
match randomly? One
in (four)3 chance.
(¼)3 * 100 = 1.56%
Do workshop #2
Answer questions 1-3
Evolutionary Basis of Sequence
Alignment
1. Identity: Quantity that describes how much
two sequences are alike in the strictest terms.
2. Similarity: Quantity that relates how much
two amino acid sequences are alike.
3. Homology: A conclusion drawn from data
suggesting that two genes share a common
evolutionary history.
Purpose of finding differences and similarities
of amino acids in two proteins.
Infer structural information
Infer functional information
Infer evolutionary relationships
Modular nature of proteins
Proteins possess local regions of similarity.
Proteins can be thought of as assemblies of
modular domains.
Two proteins that are similar in
certain regions
Tissue plasminogen activator (PLAT)
Coagulation factor 12 (F12).
Baxevanis and Ouellette, Bioinformatics, Wiley-Interscience, New York, 2001
The Dotter Program
• Program consists of three components:
•Sliding window
•A table that gives a score for each amino acid match
•A graph that converts the score to a dot of certain density
(the higher the dot density the higher the score)
Dot plot of sequence alignment highlighting Kringle domain alignments. Adapted from Baxevanis, Ouellette:
Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, 2nd Edition.