On bioinformatics

Download Report

Transcript On bioinformatics

Genomics, Proteomics,
and Bioinformatics
Biology 224
Instructor: Tom Peavy
August 30, 2010
What is bioinformatics?
• Interface of biology and computers
• Analysis of genomes, genes, mRNA
and proteins using computer algorithms
and computer databases
What is Genomics?
What is Proteomics?
What is the Transcriptome?
On bioinformatics
“Science is about building causal relations between natural
phenomena (for instance, between a mutation in a gene and
a disease). The development of instruments to increase our
capacity to observe natural phenomena has, therefore,
played a crucial role in the development of science - the
microscope being the paradigmatic example in biology. With
the human genome, the natural world takes an
unprecedented turn: it is better described as a sequence of
symbols. Besides high-throughput machines such as
sequencers and DNA chip readers, the computer and the
associated software becomes the instrument to observe it,
and the discipline of bioinformatics flourishes.”
Martin Reese and Roderic Guigó, Genome Biology 2006 7(Suppl I):S1,
introducing EGASP, the Encyclopedia of DNA Elements (ENCODE) Genome Annotation
Assessment Project
What do you want out of this course?
Themes throughout the course:
gene/protein families
Retinol-binding protein 4 (RBP4)


member of the lipocalin family
small, abundant carrier protein
We will study it in a variety of contexts including
--homologs in various species
--sequence alignment
--gene expression
--protein structure
--phylogeny
bioinformatics
medical
informatics
Tool-users
public health
informatics
Tool-makers
algorithms
databases
infrastructure
DNA
genomic
DNA
databases
RNA
cDNA
ESTs
UniGene
Microarrays
protein
protein
sequence
databases
phenotype
There are three major public DNA databases
EMBL
Housed
at EBI
European
Bioinformatics
Institute
GenBank
DDBJ
Housed
at NCBI
National
Center for
Biotechnology
Information
Housed
in Japan
Sequences (millions)
Base pairs of DNA (billions)
Growth of GenBank
Updated 8-12-04:
>40b base pairs
1982
1986
1990
1994
Year
1998
2002
Number of sequences
in GenBank (millions)
250
200
150
100
50
0
1982
1987
1992
1997
2002
2007
Base pairs of DNA in GenBank (billions)
Base pairs in GenBank + WGS (billions)
Growth of GenBank + Whole Genome Shotgun
(1982-November 2008)
Taxonomy at NCBI:
~200,000 species are represented in GenBank
2010: 230,682 species
11/08
http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi
The most sequenced organisms in GenBank
Homo sapiens
Mus musculus
Rattus norvegicus
Bos taurus
Zea mays
Sus scrofa
Danio rerio
Oryza sativa (japonica)
Strongylocentrotus purpurata
Nicotiana tabacum
Updated 11-6-08
GenBank release 168.0
Excluding WGS, organelles, metagenomics
13.1 billion bases
8.4b
6.1b
5.2b
4.6b
3.6b
3.0b
1.5b
1.4b
1.1b
Go to NCBI website
http://www.ncbi.nlm.nih.gov/
• National Library of Medicine's search service
• 12 million citations in MEDLINE
• links to participating online journals
• PubMed Central has access to full articles
•Entrez integrates the scientific literature; DNA and protein sequence databases;
3D protein structure data; population study data sets; assemblies of complete
genomes; etc
Entrez is a search and retrieval system
that integrates NCBI databases
BLAST: Basic Local Alignment Search Tool
• NCBI's sequence similarity search tool
• supports analysis of DNA and protein databases
• 80,000 searches per day
Online Mendelian Inheritance in Man: catalog of human genes and genetic disorders
OMIA: Online Mendelian Inheritance in Animals
Structure site includes: Molecular Modelling Database (MMDB); biopolymer structures
obtained from the Protein Data Bank (PDB); Cn3D (a 3D-structure viewer); vector alignment
search tool (VAST), and other protein structure resources
Review of
Genetics, Biochemistry
& Evolution
Human Genome Project
What is a typical
Genomic structure for a
Eukaryotic gene?
Synonymous vs. nonsynonymous
changes
Proline
C
C
C
C
C
C
C
C
T
C
A
G
Arginine
C G T
four fold degenerate
amino acid
Synonymous changes
Nonsynonmous changes
Synonymous
Substitution
Non-synonymous
Substitution
Central Dogma
• DNA  RNA  protein
• sequence  structure  function  evolution
What kind of modifications
Are made to Eukaryotic mRNAs?
RNA Modifications
What are cDNAs?
Protein structures
• X-ray crystallography and Nuclear
magnetic resonance (NMR)
• Primary structure
– linear AA
• Secondary structure– alpha helix and beta sheet
• Tertiary structures– 3-d that exposes binding domains etc
Linkage maps
• YAC Yeast artificial chromosome &
• BAC Bacterial artificial chromosome
-used to clone large pieces of DNA
-overlapping clones
• Are genes linked?
Organization of genomes
• Groups of genes within a species
-Comparative Genomics
• plastid genomes and mt genomes
How do we determine functions
of genes?
How do we determine functions
of genes?
• Expression patterns
–
–
–
–
Northerns
RT-PCR
SAGE
Microarrays
• Transgenics
– insert genes what results?
• Mutants
– classical genetics
– molecular genetics
• And Functional Protein Assays
Charles Darwin
• Descent with modification
– species change through time and are related to a
common ancestor
• Natural Selection is the process by which
this change occurs
Understanding
Natural selection
• acts on individuals though consequences
occur in populations
– Individual’s phenotype reason survived and
reproduced
– after a time this will change the distribution in
the population,
– what ultimately changes?
• Gene pool
New alleles
• Point change is all that is needed
– not always a "big deal"
• neutral change
– can be in Sickle cell anemia
Gene duplication
• creates an additional copy of a gene
– unequal cross-over
– X-rays
• Are these duplicates maintained in
populations?
– Psuedogenes
Polyploidy
• additional set of chromosomes
– Found in plants
– Amphibians, invertebrates
• Through a type of parthenogenesis
– Triploid
• Poor fertility
• Hybridization or meiosis malfunction
Homology
• study of likeness (literal)
• Similarity between species (or genes) that
results from inheritance of traits from a
common ancestor
– Unless know of a common ancestor have to be
careful when using this word.
Orthologous vs Paralogous Genes
a
Gene Duplication
a
b
Speciation
a
Species 1
b
a
Species 2
b
Species
• All organisms alive today can trace their
ancestry back to the origin of life some 3.8
billion years ago
– Since then millions if not billions of branching
events have occurred
• Mechanisms have to be in place for change
to occur
– genetic drift and natural selection