Genes - Lyle School of Engineering
Download
Report
Transcript Genes - Lyle School of Engineering
Introduction to Bioinformatics
CSE 8331
Spring 2010
Margaret H. Dunham
Southern Methodist University
Dallas, Texas 75275
[email protected]
CSE 8331, Sp2010
1
Outline – Topics to be Covered
Introduction
Online Resources
Sequence Analysis
Working with Genes
Organism Classification
Phylogenetic Trees
CSE 8331, Sp2010
2
Outline – Topics to be Covered
Introduction
Online Resources
Sequence Analysis
Working with Genes
Organism Classification
Phylogenetic Trees
CSE 8331, Sp2010
3
Outline – Topics to be Covered
Working with Genes
Finding Genes
Unusual “Genes”
Microarrays
Analyzing Gene Expression Data
Organism Classification
Phylogenetic Trees
CSE 8331, Sp2010
4
Bioinformatics Definition
“The use of computer science, mathematics, and
information theory to model and analyze biological
systems, especially systems involving genetic material.”
http://www.answers.com/topic/bioinformatics
"The mathematical, statistical and computing methods
that aim to solve biological problems using DNA and
amino acid sequences and related information."
Fredj Tekaia - http://www.bioinformatics.org/wiki/Bioinformatics
“Bioinformatics derives knowledge from computer
analysis of biological data”
http://www.pasteur.fr/recherche/unites/Binfs/definition/bioinformatics_definiti
on.html
CSE 8331, Sp2010
5
Related Terms
Computational Biology – “Biology that involves
computation”
http://www.bioinformatics.org/wiki/Bioinformatics
Medical Informatics – “the use of computers and
software to solve clinical or health care
problems and the use of algorithms to improve
communication, understanding and
management of medical information”
http://www.mondofacto.com/facts/dictionary?medical+informatics
Clinical Informatics - “scientific study of the
effective use of information in patient care,
clinical research and medical education”
http://e-healthexpert.org/defclininfo
CSE 8331, Sp2010
6
Performing Biology Experiments
In vivo – In a living organism
In vitro – In glass; outside a living
organism; in an artificial environment
In silico – Using computers
In bioinformatics, experiments are
performed in silico. However they
may be verified in vivo or in vitro.
CSE 8331, Sp2010
7
Central Dogma: DNA -> RNA -> Protein
DNA
CCTGAGCCAACTATTGATGAA
transcription
RNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
Amino Acids
CSE 8331, Sp2010
www.bioalgorithms.info; chapter 6; Gene Prediction
8
DNA
Deoxyribonucleic Acid
Basic building blocks of
organisms
Located in nucleus of cells
Composed of 4 nucleotides
Adenine (A)
Cytosine (C)
Guanine (G)
Thymine (T)
Two strands bound together
Contains genetic information
CSE 8331, Sp2010
Image source:
http://www.visionlearning.com/library/m
odule_viewer.php?mid=63
9
DNA Sequence
Describe 5’ to 3’ order
Backbone
Deoxyribose Sugar
Phosphate group (PO4)
Hydroxyl group (OH)
DNA Pairing: A-T and G-C
DNA Structure
http://www.blc.arizona.edu/Molecular_Graphics/DNA_Structu
re/DNA_Tutorial.HTML#deoxyribose
DNA Visualization
http://www.youtube.com/watch?v=4PKjF7OumYo
CSE 8331, Sp2010
10
DNA Terms (Eukaryotes) [1]
Gene – Sequence of DNA that is
transcribed (converted) into protein
(one per 100,000 bp). Less than 5%
codes for protein. (Will look at later)
Chromosome – Linear string of DNA
(about 100 million bp)
Genome – Complete identification of
DNA sequence for an organism
(about 670,000 million bp)
CSE 8331, Sp2010
11
Inside the cell
http://www.ams.org/featurecolu
mn/archive/primer1.html
CSE 8331, Sp2010
12
Transcription
Copying a DNA sequence into equivalent RNA
T converted to U
RNA (mRNA) is then free to move out of the
nucleus
RNA Polymerase – Enzyme that produces mRNA
Enzyme is itself a protein that causes or assists in
chemical reactions.
CSE 8331, Sp2010
13
Transcription
http://ghs.gresham.k12.or.us/science/ps/sci/
ibbio/chem/nucleic/chpt15/transcripti
on.gif
CSE 8331, Sp2010
14
RNA
Ribonucleic Acid
Contains A,C,G but U (Uracil) instead
of T
Needed to create proteins
Single Stranded
May fold back on itself by pairing
nucleotides.
Hairpin
RNA folding
CSE 8331, Sp2010
15
Translation
Synthesis of Proteins from mRNA
Coding regions are converted into proteins
Nucleotide sequence of mRNA converted
in amino acid sequence of protein
Twenty amino acids
Codon – Group of 3 nucleotides
Amino acids have many codings
http://algoart.com/aatable.htm
Protein – sequence of amino acids
Proteins actually do the work
CSE 8331, Sp2010
16
Biology vs Computer Program
Biology
Nucleotide
DNA
Nucleus
Transcription
RNA
Outside
Nucleus
Translation
Protein
CSE 8331, Sp2010
Program
Machine Instruction
Executable
Disk
Code
Library
Copy executable into
memory
Executable
RAM
Execution
Program in Execution
17
http://www.time.com/time/magazine/article/0,9171,1541283,00.html
CSE 8331, Sp2010
18
Human Genome
Scientists originally thought there would
be about 100,000 genes
Appear to be about 20,000
WHY?
Almost identical to that of Chimps.
What makes the difference?
Answers appear to lie in the noncoding
regions of the DNA (formerly thought to
be junk)
CSE 8331, Sp2010
19
More Questions
If each cell in an organism contains the
same DNA –
How does each cell behave differently?
Why do cells behave differently during
childhood development?
What causes some cells to act differently
– such as during disease?
DNA contains many genes, but only a few
are being transcribed – why?
One answer - miRNA
CSE 8331, Sp2010
20
miRNA
Short (20-25nt) sequence of noncoding RNA
Single strand
Previously assumed to be garbage
Impact/Prevent translation of mRNA
Bind to target areas in mRNA – Problem is
that this binding is not perfect (particularly in
animals)
mRNA may have multiple (nonoverlapping)
binding sites for one miRNA
CSE 8331, Sp2010
21
miRNA Functions
Causes some cancers
Embryo Development
Cell Differentiation
Cell Death
Prevents the production of a
protein that causes lung
cancer
Control brain development in
zebra fish
Associated with HIV
CSE 8331, Sp2010
22
Outline – Topics to be Covered
Introduction
Online Resources
PubMed/Medline
BLAST
ClustalW
Sequence Analysis
Working with Genes
Organism Classification
Phylogenetic Trees
CSE 8331, Sp2010
23
NCBI
National Center for Biotechnology
Information
http://www.ncbi.nlm.nih.gov/
Data Mining Tools
http://www.ncbi.nlm.nih.gov/About/tools/restable_d
ata.html
CSE 8331, Sp2010
24
PubMed/MEDLINE
PubMed - Biomedical citations and
search engine
MEDLINE – Database containing
articles
There are many other databases:
http://www.ncbi.nlm.nih.gov/Databa
se/index.html
Entrez is the general search engine
connecting to all online NCBI
resources.
CSE 8331, Sp2010
25
Nucleotide Sequence Resources
International Nucleotide Sequence
Database Collection
http://www.insdc.org/
http://www.ncbi.nlm.nih.gov/Genbank/index.html
CSE 8331, Sp2010
26
FASTA Format
>Name
CAGTGG
>
DNA/RNA/Protein Sequence Format
Default format for BLAST/CLUSTALW
CSE 8331, Sp2010
27
BLAST
Pairwise Alignment
Basic Local Alignment Search Tool
www.ncbi.nlm.nih.gov/BLAST/
BLAST Algorithm
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/BLAST_algorit
hm.html
Tutorials
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.ht
ml
http://www.digitalworldbiology.com/BLAST/index.html
CSE 8331, Sp2010
28
ClustalW
Multiple Sequence Alignment
http://www.ebi.ac.uk/Tools/clustalw2/index.html
ClustalW and BLAST
https://www.its.caltech.edu/~bi1/files_2009/sections/week9/S
ection_Week9.pdf
CSE 8331, Sp2010
29
Other Sites
Wormbase
Data for c elegans and other worms
http://www.wormbase.org/
Protein Sequences
Swiss-Prot Database
www.expasy.org/sprot
CSE 8331, Sp2010
30
Outline – Topics to be Covered
Introduction
Online Resources
Sequence Analysis
(Previous) String Matching Algorithms
Comparing Sequences/Alignment
Sequence Visualization
Working with Genes
Organism Classification
Phylogenetic Trees
CSE 8331, Sp2010
31
String Matching Problem
Input:
Pattern – length m
Text string – length n
Find one (next, all) occurrences of
string in pattern
Ex:
String:
00110011011110010100100111
Pattern: 011010
CSE 8331, Sp2010
32
String Matching Algorithms
Brute Force
Knuth-Morris Pratt
Boyer Moore
CSE 8331, Sp2010
33
Brute Force String Matching
Brute Force
Handbook of Algorithms and Data Structures
http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/711a.srch.c.html
Space O(m+n)
Time O(mn)
00110011011110010100100111
011010
011010
011010
CSE 8331, Sp2010
34
FSR
CSE 8331, Sp2010
35
Creating FSR
Create FSM:
Construct the “correct” spine.
Add a default “failure bus” to state
0.
Add a default “initial bus” to state 1.
For each state, decide its
attachments to failure bus, initial
bus, or other failure links.
CSE 8331, Sp2010
36
Knuth-Morris-Pratt
Apply FSM to string by processing characters
one at a time.
Accepting state is reached when pattern is
found.
Space O(m+n)
Time O(m+n)
Handbook of Algorithms and Data Structures
http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/712.srch.c.html
CSE 8331, Sp2010
37
Boyer-Moore
Scan pattern from right to left
Skip many positions on illegal character string.
O(mn)
Expected time better than KMP
Expected behavior better
Handbook of Algorithms and Data Structures
http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/713.preproc.c.html
CSE 8331, Sp2010
38
String-to-String Correction
Measure of similarity between strings
Can be used to determine how to
convert from one string to another
Cost to convert one to the other
Transformations
Match: Current characters in both strings
are the same
Delete: Delete current character in input
string
Insert: Insert current character in target
string into string
CSE 8331, Sp2010
39
Distance Between Strings
CSE 8331, Sp2010
40
Approximate String Matching
Find patterns “close to” the string
Fuzzy matching
Applications:
Spelling checkers
IR
Define similarity (distance) between
string and pattern
CSE 8331, Sp2010
41
String Matching vs Sequence
Comparison
Scalability is an issue with longer sequences
Exact matching not needed
Usually need a similarity or distance score
Some regions more important
Biological function
Inheritance
CSE 8331, Sp2010
42
Sequence Search Applications
Find a target sequence in a larger sequence
Find binding sites for miRNA
Finding Motif
Common subsequence
Biological function
Finding Genes
CSE 8331, Sp2010
43
Homologous Sequences
Evolved from common ancestors
Similar structure
May have same function
Rule of Thumb [1]
>= 25% amino acids the same
>= 70% nt overlap for DNA
CSE 8331, Sp2010
44
Alignment
Finding common DNA/RNA/protein
sequences in multiple organisms
Local alignment
Global alignment
Pairwise alignment
Multiple sequence alignment
Dynamic Programming may be used
CSE 8331, Sp2010
45
Alignment cont’d
Traditionally used before comparing sequences
Identify maximum similarity between sequences
EMBOSS
http://www.ebi.ac.uk/Tools/emboss/align/index.html
Scoring used to determine similarity
Usually based on probability of pair occuring
during alignment
Gap penalties
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Scoring2.html
BLOSUM
http://en.wikipedia.org/wiki/BLOSUM
CSE 8331, Sp2010
46
Similarity Scores
Comparing homologous sequences
Values of scores may be based on the likelihood
of that mutation
Substitution
Insert/Delete
http://cromatina.icb.ufmg.br/FMG/blast/Eddy.pdf
Substitution Matrices
CSE 8331, Sp2010
47
Dot Matrix Alignment
Global & Pairwise
Dot placed in conversion matrix where matches
Shows all possible alignments
Best alignment not shown
Mismatch – point mutation
Gaps – indel (insert delete mutations)
O(n2)
http://www.dnabaser.com/articles/sequence%20alignment/index.html
http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=genomes&part=A8863&ren
dertype=figure&id=A8903
CSE 8331, Sp2010
48
Dot Plot
•Visually shows alignment
•Diagonal lines show
matches
http://lectures.molgen.mpg.de/Pairwise/DotPlots/index.html
CSE 8331, Sp2010
49
Dynamic Programming Alignment
Global - Needleman Wunsch (1970) [5]
Levenshtein Distance
http://en.wikipedia.org/wiki/Levenshtein_distance
Local – Smith Waterman (1981) [6]
Scores for each match, mismatch, gap
Best alignment is maximum sum of scores.
Score may be generated by PAM or BLOSUM
“String Alignment using Dynamic
Programming” by Gina M. Cannarozzi
http://www.biorecipes.com/DynProgBasic/code.html
CSE 8331, Sp2010
50
2-D vs 3-D Alignment Grid
V
W
2-D edit graph
3-D edit graph
CSE 8331, Sp2010
www.bioalgorithms.info; chapter 6; Multiple Alignment
51
Multiple Sequence Alignment
Align multiple sequences
Motivation
By finding large subsequences that
match, you can identify sections that are
functionally similar
Find gene conservation across species
Structure prediction for RNA
Dynamic Programming
CSE 8331, Sp2010
52
Hidden Markov Model
Many bioinformatics applications
Alignment algorithms
RNA structure prediction
RNA (miRNA) prediction
Binding prediction
An Introduction to Bioinformatics Algorithms; www.bioalgorithms.info; Ch 11
CSE 8331, Sp2010
53
Edit Graph vs Profile HMM
An Introduction to Bioinformatics Algorithms; www.bioalgorithms.info; Ch 11
CSE 8331, Sp2010
54
BLAST
Basic Local Alignment Search Tool
Most popular alignment tool
Protein or DNA/RNA sequences
Local alignment
Search known sequence databases using
input query
http://www.ncbi.nlm.nih.gov/blast/
CSE 8331, Sp2010
55
Odds Ratio
“the odds of an event occurring is the
probability of the event divided by the
probability of an event not occurring.”
(http://stats.org/stories/2008/odds_ratios_april4_2008.html)
Measures association between two groups
Ratio of odds for first group and odds for
second group:
OR = (p/(1-p))/(q/(1-q))
Examples
http://www.socialresearchmethods.net/tutorial/Kim/oddsratio.html
Value of 1 means each is equally likely
CSE 8331, Sp2010
56
Odds Score
Estimates how likely these two amino acids
(nucleotides) would occur at same position randomly
(observed frequency of co-occurrence)/ (expected
frequency)
Assumes order or position immaterial
Log odds used so scores are summed
http://cromatina.icb.ufmg.br/FMG/blast/Eddy.pdf
CSE 8331, Sp2010
57
E-Value
Significance of similarity alignment value
Likelihood that sequence similarity is due to
chance.
Lowest E-value best and most significant.
“Expectation value. The number of different
alignents with scores equivalent to or better
than S that are expected to occur in a database
search by chance. The lower the E value, the
more significant the score. “
(http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html)
CSE 8331, Sp2010
58
Word Alignment
Search for matched short words then
expand region
BLAST
Local Alignment
Approximate E-Value
Underestimates
-4 desireable [1]
Lower than 10
CSE 8331, Sp2010
59
Logo Representation
Size of letter in sequence indicates
relative frequency.
Can use to show conservation among
sequences with different letters
possible in each column.
http://weblogo.berkeley.edu/
CSE 8331, Sp2010
60
RNA Secondary Structure
Many possible structures
Stability of structure
depends on strength of
chemical bonds
Secondary Structure used
to predict tertiary structure
Prediction of structure
important research area
CSE 8331, Sp2010
http://www.bioinfo.rpi.edu/~zukerm/lectures
/RNAfoldhtml/node1.html#SECTION000110000
00000000000
61
RNA Tertiary Structure
Three dimensional
shape of molecule
Determines function
of molecule
Harder to predict than
secondary structure
http://www.santafe.edu/~pth/rna.html
CSE 8331, Sp2010
62
Predicting RNA Secondary
Structure
http://www.tbi.univie.ac.at/
CSE 8331, Sp2010
Multiple potential foldings
GU Wobble
Techniques
Deterministic –
optimize chemical
bonding.
Stochastic - Simulated
annealing
Alignment
63
RNA Structure
2D structure
Watson-Crick pairing
Predicted based on strength of chemical
bonds.
http://intranet.cs.man.ac.uk/ai/Software/phas
e/manual/node98.html
CSE 8331, Sp2010
64
16S RNA
http://www.ncbi.nlm.nih.go
v/bookshelf/br.fcgi?book=g
enomes&part=A7594
CSE 8331, Sp2010
65
Protein Structure
http://www.contexo.info/DNA_Basics/Protein_Structure.htm
Protein sequence – primary structure
Secondary structure
Coil, helical (spring), extended , …
Often predicted using HMM or NN.
http://bioinf.mpiinf.mpg.de/conferences/ismb99/WWW/TUTORIALS/Brutla
g/brutlag.html
Tertiary structure
Functionality dictated by 3D structure
http://chemistry.umeche.maine.edu/MAT500/Proteins10.ht
CSEml
8331, Sp2010
66
Assembling Sequences
DNA divided into fragments
Fragments sequenced separately
Fragments reassembled together
Shotgun sequence
500-1000 nt
Read
Find overlaps
http://www.learner.org/courses/mathilluminated/intera
ctives/dna/
CAP3
http://pbil.univ-lyon1.fr/cap3.php
CSE 8331, Sp2010
67
Chaos Game Representation (CGR)
Scatter plot showing occurrence of
patterns of nucleotides.
CSE 8331, Sp2010
University of the Basque Country
http://insilico.ehu.es/genomics/my_words/
68
FCGR
Chaos Game Representation
2D technique to visually
(CGR)
see
the distribution of
subpatterns
Our technique is based on
the following:
Generate totals for
each subpattern
Scale totals to a [0,1]
range. (Note scaling
can be a problem)
Convert range to
red/blue
A CA C
G UG U
A CA C
G UG U
A CA C
G UG U
A CA C
G UG U
A CA C
G UG U
A CA C
G UG U
A CA C
G UG U
A CA C
G UG U
• 0-0.5: White to
Blue
• 0.5-1: Blue to Red
CSE 8331, Sp2010
70
FCGR Example
Homo sapiens – all mature miRNA
Patterns of length 3
UUC
GUG
CSE 8331, Sp2010
72
Outline – Topics to be Covered
Introduction
Online Resources
Sequence Analysis
Working with Genes
Finding Genes
Unusual “Genes”
Microarrays
Analyzing Gene Expression Data
Organism Classification
Phylogenetic Trees
CSE 8331, Sp2010
73
Finding Genes
Open Reading Frame - area of DNA that
become protein
Finding ORF
http://www.ncbi.nlm.nih.gov/projects/gorf/
Start codon: http://en.wikipedia.org/wiki/Start_codon
Stop codon: http://en.wikipedia.org/wiki/Stop_codon
CSE 8331, Sp2010
74
Gene Prediction: Computational Challe
aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcg
gctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatg
acaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccg
atgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgc
ggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaa
gctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctggga
Gene
tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctat
gctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc
tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggg
aatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcg
gctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgct
aagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgaca
atgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgc
ggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttg
gaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttacc
ttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggat
ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactat
gctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg
CSE 8331, Sp2010
www.bioalgorithms.info; chapter 6; Gene Prediction
75
Gene Prediction
Determine where a gene is in the genome
Start (TAA, TAG, TGA )and stop codons
(ATG)
Matches to motifs are not exact
Coding segments (exons) interrupted by
noncoding segments (introns)
Noncoding make up about 95% of genome
Techniques
Statistical distribution of nucleotides in
exons and introns
Motifs
Similarity to known genes (conservation
across species)
Start and stop codons
CSE 8331, Sp2010
76
Motif
Pattern
Problems
Fuzzy Match
Complementary Sequence
CSE 8331, Sp2010
77
Gene Prediction Tools
GENSCAN
HMMs
GenomeScan
MIT
Adds similarity to GENSCAN
TwinScan
HMM
Alignment (similarity)
Glimmer
Gene Locator and Interpolated Markov
ModelER
Markov Chain
GenMark
Markov Chain
CSE 8331, Sp2010
78
RNA Interference (RNAi)
Short non-coding RNAs interfere with gene
expression
Shorter RNA binds to section of larger RNA
Binding based on complementarity of sequences
siRNA
Double stranded
Primarily in plants kills expression of gene
miRNA
Single stranded
Stem loop fold
CSE 8331, Sp2010
79
MiRNA Applications
Finding/Prediction
No known start/stop codons
Predicting Function
Finding Binding Site
CSE 8331, Sp2010
80
Motivation
2000bp Flanking Upstream Region
mir-258.2 in C elegans
a) All 2000 bp
CSE 8331, Sp2010
b) First 240 bp
b) Last 240 bp
81
Temporal CGR (TCGR)
Temporal version of Frequency CGR
In our context temporal means the starting location of a window
2D Array
Each Row represents counts for a particular window in sequence
• First row – first window
• Last row – last window
• We start successive windows at the next character location
Each Column represents the counts for the associated pattern in
that window
• Initially we have assumed order of patterns is alphabetic
Size of TCGR depends on sequence length and subpattern lengt
As sequence lengths vary, we only examine complete windows
We only count patterns completely contained in each window.
CSE 8331, Sp2010
82
TCGR Example
acgtgcacgtaactgattccggaaccaaatgtgcccacgtcga
Moving Window
Pos 0-8
Pos 1-9
A
2
1
C
3
3
G
3
3
T
1
2
4
2
1
C
0.6
0.6
G
0.6
0.6
T
0.2
0.4
0.8
0.4
0.2
…
Pos 34-42 2
Pos 0-8
Pos 1-9
A
0.4
0.2
…
Pos 34-42 0.4
CSE 8331, Sp2010
83
TCGR Example (cont’d)
TCGRs for Sub-patterns of length 1, 2, and 3
CSE 8331, Sp2010
84
TCGR Example (cont’d)
ACGT
Window 0: Pos 0-8
Window 1: Pos 1-9
acgtgcacg
cgtgcacgt
Window 17: Pos 17-25
Window 18: Pos 18-26
tccggaacc
ccggaacca
Window 34: Pos 34-42
ccacgtcga
CSE 8331, Sp2010
85
TCGR – Mature miRNA
(Window=5; Pattern=3)
C. elegans
Homo sapiens
Mus musculus
All Mature
CSE 8331, Sp2010
ACG
CGC
GCG
UCG
87
Microarray
http://www.carleton.ca/catalyst/2006s/ima
ges/dk-PersMed3.jpg
Animation
http://www.bio.davidson.edu/Courses/genomics/chip/chip.html
CSE 8331, Sp2010
88
Affymetrix GeneChip® Array
http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affx
CSE 8331, Sp2010
89
Microarray Data Analysis
Each probe location associated with gene
Measure the amount of mRNA
Color indicates degree of gene expression
Compare different samples (normal/disease)
Track same sample over time
Questions
Which genes are related to this disease?
Which genes behave in a similar
manner?
What is the function of a gene?
Clustering
Hierarchical
K-means
CSE 8331, Sp2010
90
Microarray Clustering
Hierarchical
Clustering of
normal and
cancer tissue
from 11 patients.
http://www.nature.com/app_notes/nmeth/
2007/072606/fig_tab/nmeth1067_F3.html
CSE 8331, Sp2010
91
Microarray Problems
Noisy data
Inaccurate data
Bleeding
Normalizing readings
Timing
Based on snapshot reading
No continuous values
Expensive
CSE 8331, Sp2010
92
Outline – Topics to be Covered
Introduction
Online Resources
Sequence Analysis
Working with Genes
Organism Classification
Phylogenetic Trees
CSE 8331, Sp2010
93
Classification
Place an unknown
organism into the
classification hierarchy.
http://www.infoplease.com/cig/bi
ology/systematics-taxonomyclassification.html
Excellent Tutorial
http://www.google.com/search?sourceid
=navclient&ie=UTF8&rlz=1T4GGLL_enUS348US348&q=
organism+classification+burgeis
CSE 8331, Sp2010
Wikipedia, “Biological Classification,”
http://en.wikipedia.org/wiki/Biological_c
lassification, 6/24/09.
Outline – Topics to be Covered
Introduction
Online Resources
Sequence Analysis
Working with Genes
Organism Classification
Phylogenetic Trees
CSE 8331, Sp2010
95
Phylogenetic Tree
Evolutionary Tree
Hierarchy that shows organism evolution.
Similar to dendrogram – not same as
classification tree
Shows organism diversity and similarity.
Unrooted – no ancestry information
Rooted
Closest parents most recent ancesters
May have edges labeled or length to show
time.
CSE 8331, Sp2010
96
Parsimony
Build tree to optimize number of
substitutions (minimum) among
sequences
Candidates trees assigned cost based
on number of substitutions
CSE 8331, Sp2010
97
Construction
Much like hierarchical classification
Uses evolutionary distance or
similarity measures
Can’t measure accuracy!
CSE 8331, Sp2010
98
Bibliography
1.
2.
3.
4.
5.
6.
7.
8.
Jean-Michel Claverie and Cedric Notredame, Bioinformatics for Dummies 2nd Edition, Wiley
Publishing, 2007.
D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and
Computatiotional Biology, University Press, 1997.
P. A. Pevzner, Computational Molecular Biology: An Algorithmic Approach, MIT Press, 2000.
M. S. Waterman, Introduction to Computational Biology, Chapman & Hall, 1995.
S. B. Needleman and C. D. Wunsch, “A General Method Applicable to the Search for Similarities
in the Amino Acid Sequence of Two Proteins,” Journal of Molecular Biology, vol 48, 1970, pp
443-453.
T. F. Smith and M. S. Waterman, “Identification of Common Molecular Subsequences,” Journal
of Molecular Biology, vol 147, 1981, pp 195-197.
David W. Mount, Bioinformatics Sequence and Genome Analysis, Cold Spring Harbor
Laboratory Press, 2001.
Richard Durbin, Sean Eddy, Anders Krogh, and Graeme Mitchison, Biological Sequence
Analysis Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1998.
CSE 8331, Sp2010
99
Online Sequence Data
Resources
GenBank
National Center for Biotechnology
Information (NCBI)
Biological sequence data
www.ncbi.nih.gov/genbank
DNA Data Bank of Japan (DDBJ)
http://www.ddbj.nig.ac.jp/
EMBL Nucleotide Sequence Database
European Bioinformatics Institute
http://www.ebi.ac.uk/embl/
CSE 8331, Sp2010
100
Genome Sites
WormBase
www.wormbase.org
Ensemble Genome Browser
www.ensembl.org
UCSC Genome Browser
www.genome.ucsc.edu
AceDB
www.acedb.org
Microbial Data
www.tigr.org/tigrscripts/CMR2/CMRHomePage.spl
FlyBase
http://flybase.bio.indiana.edu
CSE 8331, Sp2010
101
Gene Databases
RefSeq
mRNA sequences
www.ncbi.nlm.nih.gov/RefSeq/
Transcribed Sequences
www.allgenes.org
GDB
Human Genome database
www.gdb.org
miRNA Registry
http://microrna.sanger.ac.uk/sequences/
Stanford Microarray Database
http://genome-www.stanford.edu/microarray/
KEGG
www.genome.ad.jp/kegg/
CSE 8331, Sp2010
102
Protein Databases
InterPro
Protein families
www.ebi.ac.uk/interpro
PIR
Protein Information Resource
http://pir.georgetown.edu
SWISS-PROT/TrEMBL
www.expasy.ch/sprot
UniProt
www.expasy.uniprot.org/index.shtml
PDB
Protein Data Bank
http://www.rcsb.org/pdb/home/home.do
CSE 8331, Sp2010
103
Other Useful Links
Bioinformatics Analysis Tools Links
http://www.geocities.com/bioinformaticsweb/toollink.html
Molecular Visualization Resources
http://www.umass.edu/microbio/chime/
Links from Washington University
http://ccb.wustl.edu/indexfor2ndpage/links.html
ISCB (International Society for Computational Biology)
http://www.iscb.org
NCBI (National Center for Biotechnology Information)
http://ncbi.nlm.nih.gov
http://www.bioinformatics.org
Online Bioinformatics Courses
http://lectures.molgen.mpg.de/online_lectures.html
www.bioalgorithms.info with An Introduction to
Bioinformatics Algorithms
CSE 8331, Sp2010
104