Genes - Lyle School of Engineering

Download Report

Transcript Genes - Lyle School of Engineering

Introduction to Bioinformatics
CSE 8331
Spring 2010
Margaret H. Dunham
Southern Methodist University
Dallas, Texas 75275
[email protected]
CSE 8331, Sp2010
1
Outline – Topics to be Covered
Introduction
Online Resources
Sequence Analysis
Working with Genes
Organism Classification
Phylogenetic Trees
CSE 8331, Sp2010
2
Outline – Topics to be Covered
Introduction
Online Resources
Sequence Analysis
Working with Genes
Organism Classification
Phylogenetic Trees
CSE 8331, Sp2010
3
Outline – Topics to be Covered
Working with Genes
 Finding Genes
 Unusual “Genes”
 Microarrays
 Analyzing Gene Expression Data
Organism Classification
Phylogenetic Trees
CSE 8331, Sp2010
4
Bioinformatics Definition
 “The use of computer science, mathematics, and
information theory to model and analyze biological
systems, especially systems involving genetic material.”
http://www.answers.com/topic/bioinformatics
 "The mathematical, statistical and computing methods
that aim to solve biological problems using DNA and
amino acid sequences and related information."
Fredj Tekaia - http://www.bioinformatics.org/wiki/Bioinformatics
 “Bioinformatics derives knowledge from computer
analysis of biological data”
http://www.pasteur.fr/recherche/unites/Binfs/definition/bioinformatics_definiti
on.html
CSE 8331, Sp2010
5
Related Terms
 Computational Biology – “Biology that involves
computation”
http://www.bioinformatics.org/wiki/Bioinformatics
 Medical Informatics – “the use of computers and
software to solve clinical or health care
problems and the use of algorithms to improve
communication, understanding and
management of medical information”
http://www.mondofacto.com/facts/dictionary?medical+informatics
 Clinical Informatics - “scientific study of the
effective use of information in patient care,
clinical research and medical education”
http://e-healthexpert.org/defclininfo
CSE 8331, Sp2010
6
Performing Biology Experiments
 In vivo – In a living organism
 In vitro – In glass; outside a living
organism; in an artificial environment
 In silico – Using computers
In bioinformatics, experiments are
performed in silico. However they
may be verified in vivo or in vitro.
CSE 8331, Sp2010
7
Central Dogma: DNA -> RNA -> Protein
DNA
CCTGAGCCAACTATTGATGAA
transcription
RNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
Amino Acids
CSE 8331, Sp2010
www.bioalgorithms.info; chapter 6; Gene Prediction
8
DNA
 Deoxyribonucleic Acid
 Basic building blocks of
organisms
 Located in nucleus of cells
 Composed of 4 nucleotides




Adenine (A)
Cytosine (C)
Guanine (G)
Thymine (T)
 Two strands bound together
 Contains genetic information
CSE 8331, Sp2010
Image source:
http://www.visionlearning.com/library/m
odule_viewer.php?mid=63
9
DNA Sequence
 Describe 5’ to 3’ order
 Backbone
 Deoxyribose Sugar
 Phosphate group (PO4)
 Hydroxyl group (OH)
 DNA Pairing: A-T and G-C
 DNA Structure
http://www.blc.arizona.edu/Molecular_Graphics/DNA_Structu
re/DNA_Tutorial.HTML#deoxyribose
 DNA Visualization
http://www.youtube.com/watch?v=4PKjF7OumYo
CSE 8331, Sp2010
10
DNA Terms (Eukaryotes) [1]
 Gene – Sequence of DNA that is
transcribed (converted) into protein
(one per 100,000 bp). Less than 5%
codes for protein. (Will look at later)
 Chromosome – Linear string of DNA
(about 100 million bp)
 Genome – Complete identification of
DNA sequence for an organism
(about 670,000 million bp)
CSE 8331, Sp2010
11
Inside the cell
http://www.ams.org/featurecolu
mn/archive/primer1.html
CSE 8331, Sp2010
12
Transcription
 Copying a DNA sequence into equivalent RNA
 T converted to U
 RNA (mRNA) is then free to move out of the
nucleus
 RNA Polymerase – Enzyme that produces mRNA
 Enzyme is itself a protein that causes or assists in
chemical reactions.
CSE 8331, Sp2010
13
Transcription
http://ghs.gresham.k12.or.us/science/ps/sci/
ibbio/chem/nucleic/chpt15/transcripti
on.gif
CSE 8331, Sp2010
14
RNA
Ribonucleic Acid
Contains A,C,G but U (Uracil) instead
of T
Needed to create proteins
Single Stranded
May fold back on itself by pairing
nucleotides.
Hairpin
RNA folding
CSE 8331, Sp2010
15
Translation
Synthesis of Proteins from mRNA
Coding regions are converted into proteins
Nucleotide sequence of mRNA converted
in amino acid sequence of protein
Twenty amino acids
Codon – Group of 3 nucleotides
Amino acids have many codings
http://algoart.com/aatable.htm
Protein – sequence of amino acids
Proteins actually do the work
CSE 8331, Sp2010
16
Biology vs Computer Program
Biology
Nucleotide
DNA
Nucleus
Transcription
RNA
Outside
Nucleus
Translation
Protein
CSE 8331, Sp2010
Program
Machine Instruction
Executable
Disk
Code
Library
Copy executable into
memory
Executable
RAM
Execution
Program in Execution
17
http://www.time.com/time/magazine/article/0,9171,1541283,00.html
CSE 8331, Sp2010
18
Human Genome
Scientists originally thought there would
be about 100,000 genes
Appear to be about 20,000
WHY?
Almost identical to that of Chimps.
What makes the difference?
Answers appear to lie in the noncoding
regions of the DNA (formerly thought to
be junk)
CSE 8331, Sp2010
19
More Questions
If each cell in an organism contains the
same DNA –
 How does each cell behave differently?
 Why do cells behave differently during
childhood development?
 What causes some cells to act differently
– such as during disease?
DNA contains many genes, but only a few
are being transcribed – why?
One answer - miRNA
CSE 8331, Sp2010
20
miRNA
Short (20-25nt) sequence of noncoding RNA
Single strand
Previously assumed to be garbage
Impact/Prevent translation of mRNA
Bind to target areas in mRNA – Problem is
that this binding is not perfect (particularly in
animals)
mRNA may have multiple (nonoverlapping)
binding sites for one miRNA
CSE 8331, Sp2010
21
miRNA Functions
Causes some cancers
Embryo Development
Cell Differentiation
Cell Death
Prevents the production of a
protein that causes lung
cancer
Control brain development in
zebra fish
Associated with HIV
CSE 8331, Sp2010
22
Outline – Topics to be Covered
Introduction
Online Resources
 PubMed/Medline
 BLAST
 ClustalW
Sequence Analysis
Working with Genes
Organism Classification
Phylogenetic Trees
CSE 8331, Sp2010
23
NCBI
 National Center for Biotechnology
Information
http://www.ncbi.nlm.nih.gov/
 Data Mining Tools
http://www.ncbi.nlm.nih.gov/About/tools/restable_d
ata.html
CSE 8331, Sp2010
24
PubMed/MEDLINE
 PubMed - Biomedical citations and
search engine
 MEDLINE – Database containing
articles
 There are many other databases:
 http://www.ncbi.nlm.nih.gov/Databa
se/index.html
 Entrez is the general search engine
connecting to all online NCBI
resources.
CSE 8331, Sp2010
25
Nucleotide Sequence Resources
 International Nucleotide Sequence
Database Collection
http://www.insdc.org/
 http://www.ncbi.nlm.nih.gov/Genbank/index.html
CSE 8331, Sp2010
26
FASTA Format
>Name
CAGTGG
>
 DNA/RNA/Protein Sequence Format
 Default format for BLAST/CLUSTALW
CSE 8331, Sp2010
27
BLAST
 Pairwise Alignment
 Basic Local Alignment Search Tool
www.ncbi.nlm.nih.gov/BLAST/
 BLAST Algorithm
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/BLAST_algorit
hm.html
 Tutorials
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.ht
ml
http://www.digitalworldbiology.com/BLAST/index.html
CSE 8331, Sp2010
28
ClustalW
 Multiple Sequence Alignment
http://www.ebi.ac.uk/Tools/clustalw2/index.html
 ClustalW and BLAST
https://www.its.caltech.edu/~bi1/files_2009/sections/week9/S
ection_Week9.pdf
CSE 8331, Sp2010
29
Other Sites
 Wormbase
 Data for c elegans and other worms
http://www.wormbase.org/
 Protein Sequences
 Swiss-Prot Database
www.expasy.org/sprot
CSE 8331, Sp2010
30
Outline – Topics to be Covered
Introduction
Online Resources
Sequence Analysis
(Previous) String Matching Algorithms
 Comparing Sequences/Alignment
 Sequence Visualization
Working with Genes
Organism Classification
Phylogenetic Trees

CSE 8331, Sp2010
31
String Matching Problem
 Input:
 Pattern – length m
 Text string – length n
 Find one (next, all) occurrences of
string in pattern
 Ex:
 String:
00110011011110010100100111
 Pattern: 011010
CSE 8331, Sp2010
32
String Matching Algorithms
Brute Force
Knuth-Morris Pratt
Boyer Moore
CSE 8331, Sp2010
33
Brute Force String Matching
Brute Force
Handbook of Algorithms and Data Structures
http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/711a.srch.c.html


Space O(m+n)
Time O(mn)
00110011011110010100100111
011010
011010
011010
CSE 8331, Sp2010
34
FSR
CSE 8331, Sp2010
35
Creating FSR
 Create FSM:
 Construct the “correct” spine.
 Add a default “failure bus” to state
0.
 Add a default “initial bus” to state 1.
 For each state, decide its
attachments to failure bus, initial
bus, or other failure links.
CSE 8331, Sp2010
36
Knuth-Morris-Pratt
 Apply FSM to string by processing characters
one at a time.
 Accepting state is reached when pattern is
found.
 Space O(m+n)
 Time O(m+n)
 Handbook of Algorithms and Data Structures
http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/712.srch.c.html
CSE 8331, Sp2010
37
Boyer-Moore
 Scan pattern from right to left
 Skip many positions on illegal character string.
 O(mn)
 Expected time better than KMP
 Expected behavior better
 Handbook of Algorithms and Data Structures
http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/713.preproc.c.html
CSE 8331, Sp2010
38
String-to-String Correction
 Measure of similarity between strings
 Can be used to determine how to
convert from one string to another
 Cost to convert one to the other
 Transformations



Match: Current characters in both strings
are the same
Delete: Delete current character in input
string
Insert: Insert current character in target
string into string
CSE 8331, Sp2010
39
Distance Between Strings
CSE 8331, Sp2010
40
Approximate String Matching
 Find patterns “close to” the string
 Fuzzy matching
 Applications:
 Spelling checkers
 IR
 Define similarity (distance) between
string and pattern
CSE 8331, Sp2010
41
String Matching vs Sequence
Comparison
 Scalability is an issue with longer sequences
 Exact matching not needed
 Usually need a similarity or distance score
 Some regions more important
 Biological function
 Inheritance
CSE 8331, Sp2010
42
Sequence Search Applications
 Find a target sequence in a larger sequence
 Find binding sites for miRNA
 Finding Motif
 Common subsequence
 Biological function
 Finding Genes
CSE 8331, Sp2010
43
Homologous Sequences
 Evolved from common ancestors
 Similar structure
 May have same function
 Rule of Thumb [1]
 >= 25% amino acids the same
 >= 70% nt overlap for DNA
CSE 8331, Sp2010
44
Alignment
 Finding common DNA/RNA/protein
sequences in multiple organisms
 Local alignment
 Global alignment
 Pairwise alignment
 Multiple sequence alignment
 Dynamic Programming may be used
CSE 8331, Sp2010
45
Alignment cont’d
 Traditionally used before comparing sequences
 Identify maximum similarity between sequences
 EMBOSS
http://www.ebi.ac.uk/Tools/emboss/align/index.html
 Scoring used to determine similarity
 Usually based on probability of pair occuring
during alignment
 Gap penalties
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Scoring2.html

BLOSUM
http://en.wikipedia.org/wiki/BLOSUM
CSE 8331, Sp2010
46
Similarity Scores
 Comparing homologous sequences
 Values of scores may be based on the likelihood
of that mutation
 Substitution
 Insert/Delete
 http://cromatina.icb.ufmg.br/FMG/blast/Eddy.pdf
 Substitution Matrices
CSE 8331, Sp2010
47
Dot Matrix Alignment
 Global & Pairwise
 Dot placed in conversion matrix where matches
 Shows all possible alignments
 Best alignment not shown
 Mismatch – point mutation
 Gaps – indel (insert delete mutations)
 O(n2)
 http://www.dnabaser.com/articles/sequence%20alignment/index.html
 http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=genomes&part=A8863&ren
dertype=figure&id=A8903
CSE 8331, Sp2010
48
Dot Plot
•Visually shows alignment
•Diagonal lines show
matches
http://lectures.molgen.mpg.de/Pairwise/DotPlots/index.html
CSE 8331, Sp2010
49
Dynamic Programming Alignment
 Global - Needleman Wunsch (1970) [5]
 Levenshtein Distance
http://en.wikipedia.org/wiki/Levenshtein_distance
 Local – Smith Waterman (1981) [6]
 Scores for each match, mismatch, gap
 Best alignment is maximum sum of scores.
 Score may be generated by PAM or BLOSUM
 “String Alignment using Dynamic
Programming” by Gina M. Cannarozzi
http://www.biorecipes.com/DynProgBasic/code.html
CSE 8331, Sp2010
50
2-D vs 3-D Alignment Grid
V
W
2-D edit graph
3-D edit graph
CSE 8331, Sp2010
www.bioalgorithms.info; chapter 6; Multiple Alignment
51
Multiple Sequence Alignment
 Align multiple sequences
 Motivation
 By finding large subsequences that
match, you can identify sections that are
functionally similar
 Find gene conservation across species
 Structure prediction for RNA
 Dynamic Programming
CSE 8331, Sp2010
52
Hidden Markov Model
 Many bioinformatics applications
 Alignment algorithms
 RNA structure prediction
 RNA (miRNA) prediction
 Binding prediction
An Introduction to Bioinformatics Algorithms; www.bioalgorithms.info; Ch 11
CSE 8331, Sp2010
53
Edit Graph vs Profile HMM
An Introduction to Bioinformatics Algorithms; www.bioalgorithms.info; Ch 11
CSE 8331, Sp2010
54
BLAST





Basic Local Alignment Search Tool
Most popular alignment tool
Protein or DNA/RNA sequences
Local alignment
Search known sequence databases using
input query
 http://www.ncbi.nlm.nih.gov/blast/
CSE 8331, Sp2010
55
Odds Ratio
 “the odds of an event occurring is the
probability of the event divided by the
probability of an event not occurring.”
(http://stats.org/stories/2008/odds_ratios_april4_2008.html)
 Measures association between two groups
 Ratio of odds for first group and odds for
second group:
OR = (p/(1-p))/(q/(1-q))
 Examples

http://www.socialresearchmethods.net/tutorial/Kim/oddsratio.html
 Value of 1 means each is equally likely
CSE 8331, Sp2010
56
Odds Score
 Estimates how likely these two amino acids
(nucleotides) would occur at same position randomly
 (observed frequency of co-occurrence)/ (expected
frequency)
 Assumes order or position immaterial
 Log odds used so scores are summed
 http://cromatina.icb.ufmg.br/FMG/blast/Eddy.pdf
CSE 8331, Sp2010
57
E-Value
 Significance of similarity alignment value
 Likelihood that sequence similarity is due to
chance.
 Lowest E-value best and most significant.
 “Expectation value. The number of different
alignents with scores equivalent to or better
than S that are expected to occur in a database
search by chance. The lower the E value, the
more significant the score. “
(http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html)
CSE 8331, Sp2010
58
Word Alignment
 Search for matched short words then
expand region
 BLAST
 Local Alignment
 Approximate E-Value
 Underestimates
-4 desireable [1]
 Lower than 10
CSE 8331, Sp2010
59
Logo Representation
 Size of letter in sequence indicates
relative frequency.
 Can use to show conservation among
sequences with different letters
possible in each column.
 http://weblogo.berkeley.edu/
CSE 8331, Sp2010
60
RNA Secondary Structure
 Many possible structures
 Stability of structure
depends on strength of
chemical bonds
 Secondary Structure used
to predict tertiary structure
 Prediction of structure
important research area
CSE 8331, Sp2010
http://www.bioinfo.rpi.edu/~zukerm/lectures
/RNAfoldhtml/node1.html#SECTION000110000
00000000000
61
RNA Tertiary Structure
 Three dimensional
shape of molecule
 Determines function
of molecule
 Harder to predict than
secondary structure
http://www.santafe.edu/~pth/rna.html
CSE 8331, Sp2010
62
Predicting RNA Secondary
Structure
http://www.tbi.univie.ac.at/
CSE 8331, Sp2010
 Multiple potential foldings
 GU Wobble
 Techniques
 Deterministic –
optimize chemical
bonding.
 Stochastic - Simulated
annealing
 Alignment
63
RNA Structure
 2D structure
 Watson-Crick pairing
 Predicted based on strength of chemical
bonds.
 http://intranet.cs.man.ac.uk/ai/Software/phas
e/manual/node98.html
CSE 8331, Sp2010
64
16S RNA
http://www.ncbi.nlm.nih.go
v/bookshelf/br.fcgi?book=g
enomes&part=A7594
CSE 8331, Sp2010
65
Protein Structure
 http://www.contexo.info/DNA_Basics/Protein_Structure.htm
 Protein sequence – primary structure
 Secondary structure
 Coil, helical (spring), extended , …
 Often predicted using HMM or NN.

http://bioinf.mpiinf.mpg.de/conferences/ismb99/WWW/TUTORIALS/Brutla
g/brutlag.html
 Tertiary structure
 Functionality dictated by 3D structure
http://chemistry.umeche.maine.edu/MAT500/Proteins10.ht
CSEml
8331, Sp2010
66

Assembling Sequences
 DNA divided into fragments
 Fragments sequenced separately
 Fragments reassembled together
 Shotgun sequence
 500-1000 nt
 Read
 Find overlaps
 http://www.learner.org/courses/mathilluminated/intera
ctives/dna/
 CAP3
http://pbil.univ-lyon1.fr/cap3.php
CSE 8331, Sp2010
67
Chaos Game Representation (CGR)
Scatter plot showing occurrence of
patterns of nucleotides.
CSE 8331, Sp2010
University of the Basque Country
http://insilico.ehu.es/genomics/my_words/
68
FCGR
Chaos Game Representation
 2D technique to visually
(CGR)
see
the distribution of
subpatterns
 Our technique is based on
the following:



Generate totals for
each subpattern
Scale totals to a [0,1]
range. (Note scaling
can be a problem)
Convert range to
red/blue
A CA C
G UG U
A CA C
G UG U
A CA C
G UG U
A CA C
G UG U
A CA C
G UG U
A CA C
G UG U
A CA C
G UG U
A CA C
G UG U
• 0-0.5: White to
Blue
• 0.5-1: Blue to Red
CSE 8331, Sp2010
70
FCGR Example
Homo sapiens – all mature miRNA
Patterns of length 3
UUC
GUG
CSE 8331, Sp2010
72
Outline – Topics to be Covered
Introduction
Online Resources
Sequence Analysis
Working with Genes
Finding Genes
 Unusual “Genes”
 Microarrays
 Analyzing Gene Expression Data
Organism Classification
Phylogenetic Trees

CSE 8331, Sp2010
73
Finding Genes
 Open Reading Frame - area of DNA that
become protein
 Finding ORF
http://www.ncbi.nlm.nih.gov/projects/gorf/
 Start codon: http://en.wikipedia.org/wiki/Start_codon
 Stop codon: http://en.wikipedia.org/wiki/Stop_codon
CSE 8331, Sp2010
74
Gene Prediction: Computational Challe
aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcg
gctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatg
acaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccg
atgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgc
ggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaa
gctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctggga
Gene
tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctat
gctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc
tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggg
aatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcg
gctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgct
aagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgaca
atgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgc
ggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttg
gaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttacc
ttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggat
ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactat
gctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg
CSE 8331, Sp2010
www.bioalgorithms.info; chapter 6; Gene Prediction
75
Gene Prediction
 Determine where a gene is in the genome
 Start (TAA, TAG, TGA )and stop codons
(ATG)
 Matches to motifs are not exact
 Coding segments (exons) interrupted by
noncoding segments (introns)
 Noncoding make up about 95% of genome
 Techniques
 Statistical distribution of nucleotides in
exons and introns
 Motifs
 Similarity to known genes (conservation
across species)
 Start and stop codons
CSE 8331, Sp2010
76
Motif
 Pattern
 Problems
 Fuzzy Match
 Complementary Sequence
CSE 8331, Sp2010
77
Gene Prediction Tools
 GENSCAN
 HMMs
 GenomeScan
 MIT
 Adds similarity to GENSCAN
 TwinScan
 HMM
 Alignment (similarity)
 Glimmer
 Gene Locator and Interpolated Markov
ModelER
 Markov Chain
 GenMark
 Markov Chain
CSE 8331, Sp2010
78
RNA Interference (RNAi)
 Short non-coding RNAs interfere with gene
expression
 Shorter RNA binds to section of larger RNA
 Binding based on complementarity of sequences
 siRNA
 Double stranded
 Primarily in plants kills expression of gene
 miRNA
 Single stranded
 Stem loop fold
CSE 8331, Sp2010
79
MiRNA Applications
 Finding/Prediction
 No known start/stop codons
 Predicting Function
 Finding Binding Site
CSE 8331, Sp2010
80
Motivation
2000bp Flanking Upstream Region
mir-258.2 in C elegans
a) All 2000 bp
CSE 8331, Sp2010
b) First 240 bp
b) Last 240 bp
81
Temporal CGR (TCGR)
 Temporal version of Frequency CGR

In our context temporal means the starting location of a window
 2D Array

Each Row represents counts for a particular window in sequence
• First row – first window
• Last row – last window
• We start successive windows at the next character location

Each Column represents the counts for the associated pattern in
that window
• Initially we have assumed order of patterns is alphabetic
Size of TCGR depends on sequence length and subpattern lengt
 As sequence lengths vary, we only examine complete windows
 We only count patterns completely contained in each window.

CSE 8331, Sp2010
82
TCGR Example
acgtgcacgtaactgattccggaaccaaatgtgcccacgtcga
Moving Window
Pos 0-8
Pos 1-9
A
2
1
C
3
3
G
3
3
T
1
2
4
2
1
C
0.6
0.6
G
0.6
0.6
T
0.2
0.4
0.8
0.4
0.2
…
Pos 34-42 2
Pos 0-8
Pos 1-9
A
0.4
0.2
…
Pos 34-42 0.4
CSE 8331, Sp2010
83
TCGR Example (cont’d)
TCGRs for Sub-patterns of length 1, 2, and 3
CSE 8331, Sp2010
84
TCGR Example (cont’d)
ACGT
Window 0: Pos 0-8
Window 1: Pos 1-9
acgtgcacg
cgtgcacgt
Window 17: Pos 17-25
Window 18: Pos 18-26
tccggaacc
ccggaacca
Window 34: Pos 34-42
ccacgtcga
CSE 8331, Sp2010
85
TCGR – Mature miRNA
(Window=5; Pattern=3)
C. elegans
Homo sapiens
Mus musculus
All Mature
CSE 8331, Sp2010
ACG
CGC
GCG
UCG
87
Microarray
http://www.carleton.ca/catalyst/2006s/ima
ges/dk-PersMed3.jpg
 Animation
http://www.bio.davidson.edu/Courses/genomics/chip/chip.html
CSE 8331, Sp2010
88
Affymetrix GeneChip® Array
http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affx
CSE 8331, Sp2010
89
Microarray Data Analysis






Each probe location associated with gene
Measure the amount of mRNA
Color indicates degree of gene expression
Compare different samples (normal/disease)
Track same sample over time
Questions
 Which genes are related to this disease?
 Which genes behave in a similar
manner?
 What is the function of a gene?
 Clustering
 Hierarchical
 K-means
CSE 8331, Sp2010
90
Microarray Clustering
Hierarchical
Clustering of
normal and
cancer tissue
from 11 patients.
http://www.nature.com/app_notes/nmeth/
2007/072606/fig_tab/nmeth1067_F3.html
CSE 8331, Sp2010
91
Microarray Problems
 Noisy data
 Inaccurate data
 Bleeding
 Normalizing readings
 Timing
 Based on snapshot reading
 No continuous values
 Expensive
CSE 8331, Sp2010
92
Outline – Topics to be Covered
Introduction
Online Resources
Sequence Analysis
Working with Genes
Organism Classification
Phylogenetic Trees
CSE 8331, Sp2010
93
Classification
Place an unknown
organism into the
classification hierarchy.
 http://www.infoplease.com/cig/bi
ology/systematics-taxonomyclassification.html
 Excellent Tutorial
http://www.google.com/search?sourceid
=navclient&ie=UTF8&rlz=1T4GGLL_enUS348US348&q=
organism+classification+burgeis
CSE 8331, Sp2010
Wikipedia, “Biological Classification,”
http://en.wikipedia.org/wiki/Biological_c
lassification, 6/24/09.
Outline – Topics to be Covered
Introduction
Online Resources
Sequence Analysis
Working with Genes
Organism Classification
Phylogenetic Trees
CSE 8331, Sp2010
95
Phylogenetic Tree
 Evolutionary Tree
 Hierarchy that shows organism evolution.
 Similar to dendrogram – not same as
classification tree
 Shows organism diversity and similarity.
 Unrooted – no ancestry information
 Rooted
 Closest parents most recent ancesters
 May have edges labeled or length to show
time.
CSE 8331, Sp2010
96
Parsimony
 Build tree to optimize number of
substitutions (minimum) among
sequences
 Candidates trees assigned cost based
on number of substitutions
CSE 8331, Sp2010
97
Construction
 Much like hierarchical classification
 Uses evolutionary distance or
similarity measures
 Can’t measure accuracy!
CSE 8331, Sp2010
98
Bibliography
1.
2.
3.
4.
5.
6.
7.
8.
Jean-Michel Claverie and Cedric Notredame, Bioinformatics for Dummies 2nd Edition, Wiley
Publishing, 2007.
D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and
Computatiotional Biology, University Press, 1997.
P. A. Pevzner, Computational Molecular Biology: An Algorithmic Approach, MIT Press, 2000.
M. S. Waterman, Introduction to Computational Biology, Chapman & Hall, 1995.
S. B. Needleman and C. D. Wunsch, “A General Method Applicable to the Search for Similarities
in the Amino Acid Sequence of Two Proteins,” Journal of Molecular Biology, vol 48, 1970, pp
443-453.
T. F. Smith and M. S. Waterman, “Identification of Common Molecular Subsequences,” Journal
of Molecular Biology, vol 147, 1981, pp 195-197.
David W. Mount, Bioinformatics Sequence and Genome Analysis, Cold Spring Harbor
Laboratory Press, 2001.
Richard Durbin, Sean Eddy, Anders Krogh, and Graeme Mitchison, Biological Sequence
Analysis Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1998.
CSE 8331, Sp2010
99
Online Sequence Data
Resources
 GenBank
 National Center for Biotechnology
Information (NCBI)
 Biological sequence data
 www.ncbi.nih.gov/genbank
 DNA Data Bank of Japan (DDBJ)
 http://www.ddbj.nig.ac.jp/
 EMBL Nucleotide Sequence Database
 European Bioinformatics Institute
 http://www.ebi.ac.uk/embl/
CSE 8331, Sp2010
100
Genome Sites
 WormBase
 www.wormbase.org
 Ensemble Genome Browser
 www.ensembl.org
 UCSC Genome Browser
 www.genome.ucsc.edu
 AceDB
 www.acedb.org
 Microbial Data
 www.tigr.org/tigrscripts/CMR2/CMRHomePage.spl
 FlyBase
 http://flybase.bio.indiana.edu
CSE 8331, Sp2010
101
Gene Databases
 RefSeq

mRNA sequences

www.ncbi.nlm.nih.gov/RefSeq/
 Transcribed Sequences

www.allgenes.org
 GDB

Human Genome database

www.gdb.org
 miRNA Registry

http://microrna.sanger.ac.uk/sequences/
 Stanford Microarray Database

http://genome-www.stanford.edu/microarray/
 KEGG

www.genome.ad.jp/kegg/
CSE 8331, Sp2010
102
Protein Databases
 InterPro

Protein families

www.ebi.ac.uk/interpro
 PIR

Protein Information Resource

http://pir.georgetown.edu
 SWISS-PROT/TrEMBL

www.expasy.ch/sprot
 UniProt

www.expasy.uniprot.org/index.shtml
 PDB

Protein Data Bank

http://www.rcsb.org/pdb/home/home.do
CSE 8331, Sp2010
103
Other Useful Links
 Bioinformatics Analysis Tools Links

http://www.geocities.com/bioinformaticsweb/toollink.html
 Molecular Visualization Resources

http://www.umass.edu/microbio/chime/
 Links from Washington University

http://ccb.wustl.edu/indexfor2ndpage/links.html
 ISCB (International Society for Computational Biology)

http://www.iscb.org
 NCBI (National Center for Biotechnology Information)

http://ncbi.nlm.nih.gov
 http://www.bioinformatics.org
 Online Bioinformatics Courses

http://lectures.molgen.mpg.de/online_lectures.html

www.bioalgorithms.info with An Introduction to
Bioinformatics Algorithms
CSE 8331, Sp2010
104