Introduction to bioinformatics I617

Download Report

Transcript Introduction to bioinformatics I617

Introduction to bioinformatics
(I617)
Haixu Tang
School of Informatics
Email: [email protected]
Office: EIG 1008
Tel: 812-856-1859
Textbook
• A Primer of Genome Science (2nd Edition) by
Greg Gibson, Spencer V. Muse, Sinauer
Associates, 2004
• Suggested reading materials will be posted on
the class wiki page:
http://cheminfo.informatics.indiana.edu/djwild/I61
7_2006_wiki/index.php/Main_Page
• Office Hour: MW 11:00-12:00, EIG 1008 or
appointment
Grading
• Class project: selected from one of four
covered areas (bioinformatics, Chemical
informatics, Laboratory informatics and
Health informatics) 25%
– Suggested Bioinformatics topics will be
posted on the class wiki page
• Homework: 25% in Bioinformatics
– 4, each 6.25%
Bioinformatics = BIOlogy +
informatics?
• Not really: it is a term (somehow arbitrarily
chosen) to define a multi-disciplinary area that
combines life sciences, physical sciences and
computer science / informatics;
• It addresses biological problems using
theoretical informatics approaches, not vice
versa;
• It is transforming classical Biology into a
Information Science.
The birth of bioinformatics
• A revolution in biology research: the
emergence of Genome Science
• Technology advancement in both biology
and information science
Genome science: a revolution of
biology
• Classical Biology
Hypothesis
• Genome Science
Data
Data
Hypothesis
Knowledge
Knowledge
Hypothesis driven approach
Data driven approach
Bioinformatics: from data analysis
to data mining
• Classical Biology
• Genome Science
Hypothesis
Data
Data
Hypothesis
1
2
3…
Low throughput data
High throughput data
Hypothesis confirmation /
rejection
Hypothesis generation
Bioinformatics: in the driver’s seat
• Classical Biology
Hypothesis
• Genome Science
Data
Data
mining
Data
analysis
Data
Hypothesis
Knowledge
Knowledge
Key technology advancements
• High throughput biotechnologies
– Genome sequencing techniques
– DNA microarray
– Mass spectrometry
• Large-scale experiments
– HGP, HapMap
– Omics / Systems Biology
• Massive data generation, storage, exchange
and analysis
– CPU, storage, etc.
– High speed network (Internet)
– Bioinformatics
Bioinformatics: mutually beneficial
• For biologists
– Fragment assembly in
genome sequencing
– Genome comparison
– Gene clustering in
DNA microarray
analysis
– Protein identification in
proteomics
• For computer scientists
– String algorithms / Tree
algorithms
– Alternative Eulerian path
(BEST theorem)
– Reversal distances
– Probabilistic graphic
models (HMMs, BNs,
etc.)
Two origins of bioinformatics
• Combinatorial pattern matching in
theoretical computer science
– DNA and protein sequence analysis
• Physical and analytical chemistry of
Biomolecules
– Protein structure analysis  Structural
bioinformatics
– Bio-analytical chemistry  Proteomics
Bioinformatics addresses computational
challenges in life and medical sciences
• New computational problems for automatic
data analysis
• Reformulation of old problems using new
high throughput data
• Formulating new problems using high
throughput data
Bioinformatics addresses computational
challenges in life and medical sciences
• New computational problems for automatic data analysis
• Genome sequencing
• Proteomics
• Transcriptomics
• Data representation and visualization
• Genome Browser
• Solving biological problems by in silico approaches
– Reformulation of old problems using new high throughput data
• Gene finding
• Protein structure and function
– Formulating new problems using high throughput data
• Comparative genomics
• Polymorphisms / Population genetics
• Systems Biology
Bioinformatics resources
• Databases
– Nucleic Acid Research (NAR) annual database issue
• Organization
– ISCB (International Society in Computational Biology)
• Conferences
– ISMB
– RECOMB
– Many other smaller or regional conferences, e.g.
ECCB, CSB, PSB, etc, including local Indiana
Bioinformatics conference
A case study
• How bioinformatics help and transform
classical biological topics?
• Molecular evolutionary studies: from
anatomical features to molecular
evidences
• Genome evolution: comparison of gene
orders
Early Evolutionary Studies
• Anatomical features were the dominant
criteria used to derive evolutionary
relationships between species since
Darwin till early 1960s
Early Evolutionary Studies
• Anatomical features were the dominant
criteria used to derive evolutionary
relationships between species since
Darwin till early 1960s
• The evolutionary relationships derived
from these relatively subjective
observations were often inconclusive.
Some of them were later proved incorrect
Evolution and DNA Analysis:
the Giant Panda Riddle
• For roughly 100 years scientists were unable to
figure out which family the giant panda belongs
to
• Giant pandas look like bears but have features
that are unusual for bears and typical for
raccoons, e.g., they do not hibernate
Evolution and DNA Analysis:
the Giant Panda Riddle
• In 1985, Steven O’Brien and colleagues solved
the giant panda classification problem using
DNA sequences and bioinformatics algorithms
Evolutionary Tree of Bears and Raccoons
Evolutionary Trees: DNA-based Approach
• 40 years ago: Emile Zuckerkandl and
Linus Pauling brought reconstructing
evolutionary relationships with DNA into
the spotlight
• In the first few years after Zuckerkandl and
Pauling proposed using DNA for
evolutionary studies, the possibility of
reconstructing evolutionary trees by DNA
analysis was hotly debated
• Now it is a dominant approach to study
evolution.
Evolutionary Trees
How are these trees built from DNA
sequences?
Evolutionary Trees
How are these trees built from DNA
sequences?
– leaves represent existing species
– internal vertices represent ancestors
– root represents the common evolutionary
ancestor
Rooted and Unrooted Trees
In the unrooted tree the position of
the root (“common ancestor”) is
unknown. Otherwise, they are like
rooted trees
Distances in Trees
• Edges may have weights reflecting:
– Number of mutations on evolutionary path
from one species to another
– Time estimate for evolution of one species
into another
• In a tree T, we often compute
dij(T) - the length of a path between leaves i and j
dij(T) – tree distance between i and j
Distance in Trees: an Exampe
j
i
d1,4 = 12 + 13 + 14 + 17 + 12 = 68
Distance Matrix
• Given n species, we can compute the n x n
distance matrix Dij
• Dij may be defined as the edit distance
between a gene in species i and species j,
where the gene of interest is sequenced for
all n species.
Dij – edit distance between i and j
Fitting Distance Matrix
• Given n species, we can compute the n x
n distance matrix Dij
• Evolution of these genes is described by a
tree that we don’t know.
• We need an algorithm to construct a tree
that best fits the distance matrix Dij
Reconstructing a 3 Leaved Tree
• Tree reconstruction for any 3x3 matrix is
straightforward
• We have 3 leaves i, j, k and a center
vertex c
Observe:
dic + djc = Dij
dic + dkc = Dik
djc + dkc = Djk
Turnip vs Cabbage: Look and Taste
Different
• Although cabbages and turnips share a
recent common ancestor, they look and
taste different
Turnip vs Cabbage: Comparing Gene Sequences
Yields No Evolutionary Information
Turnip vs Cabbage: Almost Identical
mtDNA gene sequences
• In 1980s Jeffrey Palmer studied
evolution of plant organelles by
comparing mitochondrial genomes of the
cabbage and turnip
• 99% similarity between genes
• These surprisingly identical gene
sequences differed in gene order
• This study helped pave the way to
analyzing genome rearrangements in
molecular evolution
Turnip vs Cabbage: Different mtDNA Gene Order
• Gene order comparison:
Before
After
Evolution is manifested as the divergence in
gene order
Turnip vs Cabbage: Different mtDNA Gene Order
• Gene order comparison:
Turnip vs Cabbage: Different mtDNA Gene
Order
• Gene order comparison:
Turnip vs Cabbage: Different mtDNA Gene
Order
• Gene order comparison:
Turnip vs Cabbage: Different mtDNA Gene
Order
• Gene order comparison:
Transforming Cabbage into Turnip
Reversal distance
History of Chromosome X
Rat Consortium, Nature, 2004