Overview of bioinformatics

Download Report

Transcript Overview of bioinformatics

Tentative definition of
bioinformatics
Bioinformatics, often also called genomics,
computational genomics, or computational biology, is a
new interdisciplinary field at the intersection of biology,
computer science, statistics, and mathematics. Its
subject matter is the extraction of biologically useful
information from large sets of molecular data, such as
DNA or protein sequence data or gene expression data.
The term “bioinformatics” is currently used mainly to
refer to the extraction of information from sequence
data, while the creation and analysis of gene expression
data is called functional genomics.
Biology’s dilemma: There is too
much to know about living things
Roughly 1.5 million species of organisms have been
described and given scientific names to date. Some
biologists estimate that the total number of all living
species may be several times higher. It is impossible to
learn everything about all these organisms. Biologists
solve the dilemma by focusing on some species, so-called
model organisms, and trying to find out as much as they
can about these model organisms.
Some important model organisms
Mammals: Human, chimpanzee, mouse, rat
Fish: Zebrafish, Pufferfish
Insects: Fruitfly (Drosophila melanogaster)
Roundworms: Ceanorhabditis elegans
Protista: Malaria parasite (Plasmodium falciparum)
Fungi: Baker’s yeast (Saccharomyces cerevisiae)
Plants: Thale cress (Arabidopsis thaliana), corn, rice
Bacteria: Escherichia coli, Mycoplasma genitalis
Archea: Methanococcus janaschii
Let’s find out everything about
some species
What would it mean to learn everything about a given
species? All available evidence indicates that the complete
blueprint for making an organism is encoded in the
organism’s genome. Chemically, the genome consists of
one or several DNA molecules. These are long strings
composed of pairs of nucleotides. There are only four
different nucleotides, denoted by A, C, G, T. The
information about how to make the organism is encoded
by the order in which the nucleotides appear.
Some genome sizes
 HIV2 virus
 Mycoplasma genitalis
 Haemophilus influenzae
 Saccharomyces cerevisiae
 Caenorhabditis elegans
 Drosophila melanogaster
 Homo sapiens
 Some amphibians
 Amoeba dubia
9671 bp
5.8 · 105 bp
1.83 · 106 bp
1.21 · 107 bp
108 bp
1.65 · 108 bp
3.14 · 109 bp
8 · 1010 bp
6.7 · 1011 bp
Sequencing Genomes
Contemporary technology makes it possible to completely
sequence entire genomes, that is, determine the sequence
of A’s, C’s, G’s, and T’s in the organism’s genome. The
first virus was sequenced in the 1980’s, the first
bacterium (Haemophilus influenzae) in 1995, the first
multicellular organism (Caenorhabditis elegans) in 1998.
A draft of the human genome was announced in 2000.
Where to store all these data?
In databases of course. Some of the sequence data are
stored in proprietary data bases, but most of them are
stored in the public data base Genbank and an be
accessed via the World Wide Web. In fact, most relevant
journals require proof of submission to Genbank before an
article discussing sequence data will be published.
The URL for Genbank is:
http://www.ncbi.nlm.nih.gov/Genbank/
What’s in the databases?
In 1981, Genbank contained less than 500,000 bp of info.
In 1986, Genbank contained 9,615,371 bp of info.
In 1991, Genbank contained 71,947,426 bp of info.
In 1996, Genbank contained 651,972,984 bp of info.
In 2001, Genbank contained 15,849,921,438 bp of info.
In 2004, Genbank contained 37,893,844,733 bp of info.
In 2009, Genbank contained 106,533,156,756 bp of info.
What’s in the databases?
On March 18, 2005 there were 1791 completely sequenced
viruses, 204 completely sequenced bacteria,
21 completely sequenced archaea, and 9 complete
genomes of Eukaryotes, among them two yeasts, the
roundworm C. elegans, the fruitfly Drosophila
melanogaster, the mosquito A. gambiae, the malaria
parasite P. falciparum, and the plant Arabidopsis thaliana
(thale cress). There are also drafts of 11 other genomes
of eukaryotes, most notably of the human genome.
What’s in the databases?
On December 17, 2010 there were
3518 completely sequenced viruses,
952 completely sequenced bacteria,
68 completely sequenced archaea,
and 73 complete genomes of Eukaryotes,
among them cow, wolf, horse, human, a
monkey, pig, chimpanzee.
First challenge:
Sequencing large genomes
Currently, much of the sequencing process is automated.
However, contemporary sequencing machines can only
sequence stretches of DNA that are a few hundred base
pairs long at a time. The process of assembling these
stretches of sequence into a whole genome poses some
interesting mathematical problems.
First challenge:
Sequencing large genomes
For example, the publicly financed Human Genome Project
uses an approach called genome mapping to facilitate
sequence assembly. Celera Genomics, a private
enterprise, announced that they will be able to complete
the sequencing of the entire human genome much faster
by using an approach called shotgun sequencing. There
was much debate over the feasibility of the latter
approach, but it apparently worked. At its core, this was a
debate over the mathematics of sequence assembly.
You have sequenced your
genome - what do you do with it?
This is known as genome analysis or sequence analysis.
At present, most of bioinformatics is concerned with
sequence analysis. Here are some of the questions
studied in sequence analysis:
 gene finding
 protein 3D structure prediction
 gene function prediction
 prediction of important sites in proteins
 reconstruction of phylogenies
Genes and proteins
The genome controls the making and workings of an
organism by telling the cell which proteins to manufacture
under which conditions. Proteins are the workhorses of
biochemistry and play a variety of roles.
A gene is a stretch of DNA that codes a given protein.
Where are the genes?
The objective of gene finding is to identify the regions of
DNA that are genes. Ideally, we want to make statements
like: “Positions 28,354 through 29,536 of this genome code
a protein.”
The mathematical challenge here is to identify patterns in
DNA that reliably indicate where a gene starts and ends,
especially in eukaryotes.
Protein structure prediction
When a protein is manufactured in the cell, it assumes a
characteristic 3D structure or fold. It is very costly to
determine the 3D structure of a protein experimentally (by
NMR or X-ray crystallography). It would be much cheaper
if we could predict the 3D structure of a protein directly
from its primary structure, i.e., from the sequence of its
amino acids. This is known as the protein folding problem.
Many approaches have been proposed to develop
algorithms for solving this problem; so far results are
mixed.
Prediction of protein function
Suppose you have identified a gene. What is its role in the
biochemistry of its organism? Sequence databases can
help us in formulating reasonable hypotheses.
 Search the database for proteins with similar amino acid
sequences in other organisms.
 If the functions of the most similar proteins are known
and if they tend to be the same function (e.g., “enzyme
involved in glucose metabolism”), then it is reasonable
to conjecture that your gene also codes an enzyme
involved in glucose metabolism.
Prediction of protein function:
homology searches
Given a nucleotide or DNA sequence, searching the data
base(s) for similar sequences is known as “homology
searches”. The most popular software tool for performing
these searches is called BLAST; therefore biologists often
speak of “BLAST searches”. There are two interesting
problems here:
 How to measure “similarity” of two sequences.
 How much similarity constitutes evidence of biologically
meaningful homology as opposed to random chance?
Prediction of important sites in
proteins
Not all parts of a protein are equally important; the
function of most of its amino acids is often just to maintain
an appropriate 3D structure, and mutations of those less
crucial amino acids often don't have much effect.
However, most proteins have crucial parts such as
binding sites. Mutations occurring at binding sites tend to
be lethal and will be weeded out by evolution.
How to predict binding sites from
sequence data:
 Get a collection of proteins of similar amino acid
sequences and analogous biochemical function from
your database.
 Align these sequences amino acid by amino acid.
 Check which regions of the protein are highly conserved
in the course of evolution.
 The binding site should be in one of the highly
conserved regions.
The importance of being aligned
DNA and protein molecules evolve mostly by three
processes: point mutations (exchange of a single letter for
another), insertions, and deletions. If a group of
homologuous proteins from different organisms has been
identified, it is assumed that these proteins have evolved
from a common ancestor. The process of multiple
sequence alignment aims at identifying loci in the
individual molecules that are derived from a common
ancestral locus. These form the columns of the alignment.
Example of a multiple alignment
A T G
|
A C G
|
- C G
|
- T G
- - T T C G
|
A A T C C A
|
A A T C C T
|
A G C A C T
G A C
|
G - C
|
A A C
|
A A C
T
T
C
C
Reconstruction of phylogenetic
trees
A phylogenetic tree depicts the evolutionary history of a
group of species. By observing similarities and differences
between species, we may be able to reconstruct their
phylogeny. Classically, the degree of similarity between
two species has been assessed from morphological
characters. By comparing genomic sequence data, we
actually can quantify the degree of similarity between any
two species, and use these degrees of similarity as a basis
for reconstructing phylogenetic trees.
Reconstruction of phylogenetic
trees
The most common approach to using genomic data for
reconstruction of phylogenetic trees is to look at genes
with analogous function and thus supposedly common
ancestry and see how far the genes taken from the extant
organisms have diverged.
The observed differences in the amino acid composition
are then used to reconstruct the phylogeny. The current
partition of organisms into eubacteria, archaea and
eukaria was discovered in this way by analyzing rRNA.
The new frontier:
Functional genomics
It is fashionable nowadays to talk about functional
genomics. Many people use this term as if it were a new
discipline separate from bioinformatics, but I think it is
more appropriate to consider it a new subfield of
bioinformatics.
The ultimate aim of functional genomics is to understand
what genes do, when they do it, and how they do it.
Ideally, we would like to understand the cell, or organism,
as a giant network of chemical pathways that regulate
each other.
Microarrays (gene chips)
Microarrays or Gene Chips allow to monitor the level of
activity of all the gene represented on the chip
simultaneously under a variety of environmental
conditions, in various organs, and at various stages of
development.
There are two types of challenges here: To determine
when a change in activity level detected by the chip is
statistically significant, and to use the data so obtained to
make inferences about gene regulation.
What do we do with all these
data?
The bread and butter method of microarray data
analysis is clustering. This allows to identify, for
a sequence of experiments on the same set of genes
under various conditions, groups of genes that are
up- or down-regulated simultaneously. It is believed
that genes acting in the same chemical pathway
would normally belong to the same cluster. Some
algorithms for clustering will be discussed in this course.