it_health_summary - Center for Biological Sequence Analysis

Download Report

Transcript it_health_summary - Center for Biological Sequence Analysis

It & Health 2009
Summary
Thomas Nordahl Petersen
Teachers
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
Thomas Nordahl Petersen
QuickTime™ and a
decompressor
are needed to see this picture.
Bent Petersen
Rasmus Wernersson
QuickTime™ and a
decompressor
are needed to see this picture.
Ramneek Gupta
Lisbeth Nielsen Fink
QuickTime™ and a
decompressor
are needed to see this picture.
Thomas Blicher
QuickTime™ and a
decompressor
are needed to see this picture.
Anders Gorm Pedersen
Outline of the course
• Topics will cover a general introduction to bioinformatics
– Evolution
– DNA / Protein
– Alignment and scoring matrices
• How does it work & what are the numbers
– Visualization of multiple alignments
• Phylogenetic trees and logo plots
– Commonly used databases
• Uniprot/Genbank & Genome browsers
– Protein 3D-structure
– Artificial neural networks & case stories
– Practical use of bioinformatics tools
• Preparation for exam
Topics covered - (some of them)
Information flow in biological systems
Amino Acids
Amine and carboxyl groups. Sidechain ‘R’ is attached to C-alpha carbon
The amino acids found in Living organisms are L-amino acids
Amino Acids - peptide bond
N-terminal
C-terminal
1 and 3-letter codes
1. There are 20 naturally occurring amino acids
2. Normally the one/three codes are used
Ala - A
Cys - C
Asp - D
Glu - E
Phe - F
Gly - G
His - H
Ile - I
Lys - K
Leu - L
Met - M
Asn - N
Pro - P
Gln - Q
Arg - R
Ser - S
Thr - T
Val - V
Trp - W
Tyr - Y
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
Theory of evolution
Charles Darwin
1809-1882
Phylogenetic tree
Global versus local alignments
Global alignment: align full length of both sequences.
“Needleman-Wunsch” algorithm).
(The
Global alignment
Local alignment: find best partial alignment of two sequences
(the “Smith-Waterman” algorithm).
Seq 1
Local alignment
Seq 2
Pairwise alignment: the solution
”Dynamic programming”
(the Needleman-Wunsch algorithm)
Sequence alignment - Blast
Sequence alignment - Blast
Blosum & PAM matrices
• Blosum matrices are the most commonly used
substitution matrices.
• Blosum50, Blosum62, blosum80
• PAM - Percent Accepted Mutations
• PAM-0 is the identity matrix.
• PAM-1 diagonal small deviations from 1, offdiag has small deviations from 0
• PAM-250 is PAM-1 multiplied by itself 250
times.
Sequence profiles (1J2J.B)
>1J2J.B mol:aa PROTEIN TRANSPORT
NVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEK
Log-odds scores
• BLOSUM is a log-likelihood matrix:
• Likelihood of observing j given you have i is
– P(j|i) = Pij/Pi
• The prior likelihood of observing j is
– Qj , which is simply the frequency
• The log-likelihood score is
– Sij = 2log2(P(j|i)/log(Qj) = 2log2(Pij/(QiQj))
– Where, Log2(x)=logn(x)/logn(2)
– S has been normalized to half bits, therefore the factor 2
BLAST Exercise
Genome browsers - UCSC
Intron - Exon structure
Single Nucleotide polymorphism - SNP
SNPs
Protein 3D-structure
Protein structure
Primary structure: Amino acids sequences
Secondary structure: Helix/Beta sheet
Tertiary structure: Fold, 3D cordinates
Protein structure
-helix
helix
-helix
Pi-helix
3 residues/turn - few, but not uncommon
3.6 residues/turn - by far the most common helix
4.1 residues/turn - very rare
Protein structure
strand/sheet
Protein folds
Class
4’th is ‘few secondary structure
Architecture
Overall shape of a domain
Topology
Share secondary structure connectivity
Protein 3D-structure
Neural Networks
From knowledge to information
Protein sequence
Biological feature
Use of artificial neural networks
• A data-driven method to predict a feature, given a set of training data
• In biology input features could be amino acid sequence or nucleotides
• Secondary structure prediction
• Signal peptide prediction
• Surface accessibility
• Propeptide prediction
C
N
Signal
peptide
Propeptide
Mature/active protein
Prediction of biological features
Surface accessible
QuickTime™ and a
decompressor
are needed to see this picture.
Predict surface accessible from
amino acid sequence only.
Logo plots
Information content, how is it calculated - what does it mean.
Logo plots - Information Content
Calculate Information Content
I = a palog2pa + log2(4), Maximal value is 2 bits
Sequence-logo
Completely conserved
~0.5 each
• Total height at a position is the ‘Information Content’ measured in bits.
• Height of letter is the proportional to the frequency of that letter.
• A Logo plot is a visualization of a mutiple alignment.