Bio background
Download
Report
Transcript Bio background
BNFO 615 Data Analysis in
Bioinformatics
Instructor
Zhi Wei
Outline
Cell
Genome
Gene
mRNA
Proteins
Systems biology
Outline
Cell
Genome
Gene
mRNA
Proteins
Systems biology
Cells
Fundamental working units of
every living system.
Every organism is composed of one of two radically
different types of cells:
prokaryotic cells or eukaryotic cells.
Prokaryotes and Eukaryotes are descended from
the same primitive cell.
All extant prokaryotic and eukaryotic cells are
the result of a total of 3.5 billion years of
evolution.
Prokaryotes v.s. Eukaryotes
Different Structures
Different Components
Different biological processes
Prokaryotes vs Eukaryotes
Prokaryotes
Eukaryotes
Single cell
Single or multi cell
No nucleus
Nucleus
No organelles
Organelles
One piece of circular DNA Chromosomes
No mRNA post
transcriptional
modification
Exons/Introns splicing
Prokaryotes v.s. Eukaryotes
Prokaryotes
Eukaryotes
bacteria, archaea
Ecoli cell
plants, animals, protista, and fungi
Yeast cell
5X106 base pairs
> 90% of DNA encode protein
5400 genes
Lacks a membrane-bound nucleus.
Circular DNA
Histones are unknown
12.4x106 base pairs
A small fraction of the total DNA encodes
protein. Many repeats of non-coding
sequences
5800 genes
All chromosomes are contained in a
membrane bound nucleus
DNA is divided between 16 chromosomes
A set of five histones: DNA packaging and
gene expression regulation
Cells chemical composition
Chemical composition-by weight
70% water
7% small molecules
salts
Lipids
amino acids
nucleotides
23% macromolecules
Proteins
Polysaccharides
lipids
We have different cells
Cells differ in size, shape
and weights
Q: what is the biggest cell in
the human body?
Cell Cycle
Born,
eat, replicate, and die
Lodish et al. Molecular Biology of the Cell (5th ed.). W.H. Freeman & Co., 2003.
Sexual Reproduction v.s. Cell Division
Cell Division: Cells reproduce by duplicating
their contents and dividing in two.
Sexual Reproduction
Formation of new individual by a combination of two
haploid sex cells (gametes).
Gametes for fertilization usually come from separate
parents
Both gametes are haploid, with a single set of
chromosomes. The new individual is called a zygote,
with two sets of chromosomes (diploid).
Meiosis is a process to convert a diploid cell to a
haploid gamete, and cause a change in the genetic
information to increase diversity in the offspring.
Meiosis
v.s.
Mitotic
cell
division
Outline
Cell
Genome
Gene
mRNA
Proteins
Systems biology
Genome
A genome is an organism’s
complete set of DNA (including
its genes).
However, in humans less than
3% of the genome actually
encodes for genes.
A part of the rest of the
genome serves as a control
regions (though that’s also a
small part)
The function of the rest of the
genome is unknown (junk DNA?
An open question).
Comparison of Different Organisms
Genome size (bp)
Num. of genes
E. Coli
.05*108
5,400
Yeast
.12*108
5,800
Worm
.15*108
18,400
Fly
1.8*108
13,600
Human
30*108
25,000
Plant
1.3*108
25,000
Outline
Cell
Genome
Gene
mRNA
Proteins
Systems biology
What is a gene?
Promoter
Protein coding sequence
Terminator
Genomic DNA
DNA: Deoxyribo Nucleic Acid
Example of a Gene: Gal4 DNA
ATGAAGCTACTGTCTTCTATCGAACAAGCATGCGATATTTGCCGACTTAAAAAGCTCAAG
TGCTCCAAAGAAAAACCGAAGTGCGCCAAGTGTCTGAAGAACAACTGGGAGTGTCGCTAC
TCTCCCAAAACCAAAAGGTCTCCGCTGACTAGGGCACATCTGACAGAAGTGGAATCAAGG
CTAGAAAGACTGGAACAGCTATTTCTACTGATTTTTCCTCGAGAAGACCTTGACATGATT
TTGAAAATGGATTCTTTACAGGATATAAAAGCATTGTTAACAGGATTATTTGTACAAGAT
AATGTGAATAAAGATGCCGTCACAGATAGATTGGCTTCAGTGGAGACTGATATGCCTCTA
ACATTGAGACAGCATAGAATAAGTGCGACATCATCATCGGAAGAGAGTAGTAACAAAGGT
CAAAGACAGTTGACTGTATCGATTGACTCGGCAGCTCATCATGATAACTCCACAATTCCG
TTGGATTTTATGCCCAGGGATGCTCTTCATGGATTTGATTGGTCTGAAGAGGATGACATG
TCGGATGGCTTGCCCTTCCTGAAAACGGACCCCAACAATAATGGGTTCTTTGGCGACGGT
TCTCTCTTATGTATTCTTCGATCTATTGGCTTTAAACCGGAAAATTACACGAACTCTAAC
GTTAACAGGCTCCCGACCATGATTACGGATAGATACACGTTGGCTTCTAGATCCACAACA
TCCCGTTTACTTCAAAGTTATCTCAATAATTTTCACCCCTACTGCCCTATCGTGCACTCA
CCGACGCTAATGATGTTGTATAATAACCAGATTGAAATCGCGTCGAAGGATCAATGGCAA
ATCCTTTTTAACTGCATATTAGCCATTGGAGCCTGGTGTATAGAGGGGGAATCTACTGAT
ATAGATGTTTTTTACTATCAAAATGCTAAATCTCATTTGACGAGCAAGGTCTTCGAGTCA
A sequence of A,C,G,T
Example of a Gene: Gal4 AA
MKLLSSIEQACDICRLKKLKCSKEKPKCAKCLKNNWECRYSPKTKRSPLTRAHLTEVESR
LERLEQLFLLIFPREDLDMILKMDSLQDIKALLTGLFVQDNVNKDAVTDRLASVETDMPL
TLRQHRISATSSSEESSNKGQRQLTVSIDSAAHHDNSTIPLDFMPRDALHGFDWSEEDDM
SDGLPFLKTDPNNNGFFGDGSLLCILRSIGFKPENYTNSNVNRLPTMITDRYTLASRSTT
SRLLQSYLNNFHPYCPIVHSPTLMMLYNNQIEIASKDQWQILFNCILAIGAWCIEGESTD
IDVFYYQNAKSHLTSKVFESGSIILVTALHLLSRYTQWRQKTNTSYNFHSFSIRMAISLG
LNRDLPSSFSDSSILEQRRRIWWSVYSWEIQLSLLYGRSIQLSQNTISFPSSVDDVQRTT
TGPTIYHGIIETARLLQVFTKIYELDKTVTAEKSPICAKKCLMICNEIEEVSRQAPKFLQ
MDISTTALTNLLKEHPWLSFTRFELKWKQLSLIIYVLRDFFTNFTQKKSQLEQDQNDHQS
YEVKRCSIMLSDAAQRTVMSVSSYMDNHNVTPYFAWNCSYYLFNAVLVPIKTLLSNSKSN
AENNETAQLLQQINTVLMLLKKLATFKIQTCEKYIQVLEEVCAPFLLSQCAIPLPHISYN
NSNGSAIKNIVGSATIAQYPTLPEENVNNISVKYVSPGSVGPSPVPLKSGASFSDLVKLL
SNRPPSRNSPVTIPRSTPSHRSVTPFLGQQQQLQSLVPLTPSALFGGANFNQSGNIADSS
A sequence of 20 amino acids {A,R,N,D,C,E,Q,G,H,I,L,K,M,F,P,S,T,W,Y,V}
The Central Dogma
DNA RNA: Gene Transcription
promoter
5’
3’
G A T T A C A . . .
C T A A T G T . . .
3’
5’
Gene Transcription
transcription factor, binding site, RNA polymerase
5’
3’
G A T T A C A . . .
C T A A T G T . . .
3’
5’
Transcription factors recognize
transcription factor binding sites
and bind to them, forming a complex.
RNA polymerase binds the complex.
Gene Transcription
5’
3’
3’
5’
The two strands are separated
Gene Transcription
5’
3’
3’
5’
An RNA copy of the 5’→3’ sequence is
created from the 3’→5’ template
Gene Transcription
G A T T A C A . . .
5’
3’
3’
5’
C T A A T G T . . .
pre-mRNA
5’
G A U U A C A . . .
3’
RNA Processing (Eukaryotes)
5’ cap, polyadenylation, exon, intron, splicing, UTR
5’ cap
poly(A) tail
exon
intron
mRNA
5’ UTR
3’ UTR
Mammalian Gene Structure
introns
5’
3’
coding
promoter
5’ UTR
exons
3’ UTR
non-coding
Regulatory regions: up to 50 kb upstream of +1 site
Exons:
protein coding and untranslated regions (UTR)
1 to 178 exons per gene (mean 8.8)
8 bp to 17 kb per exon (mean 145 bp)
Introns:
splice acceptor (GU) and donor (AG) sites, junk DNA
average 1 kb – 50 kb per intron
Gene size:
Largest – 2.4 Mb (Dystrophin). Mean – 27 kb.
Only 1.5% DNA for coding in Human!
Identifying Genes in Sequence Data
Predicting the start and end of genes as well as
the introns and exons in each gene is one of the
basic problems in computational biology.
Gene prediction methods look for ORFs (Open
Reading Frame).
These are (relatively long) DNA segments that
start with the start codon, end with one of the
end codons, and do not contain any other end
codon in between.
Splice site prediction has received a lot of
attention in the literature.
Comparative genomics
Outline
Cell
Genome
Gene
mRNA
Proteins
Systems biology
RNA
RNA is similar to DNA chemically. It is usually
only a single strand. T(hyamine) is replaced by
U(racil)
Some forms of RNA can form secondary
structures by “pairing up” with itself. This can
have change its properties dramatically.
DNA and RNA
can pair with
each other.
tRNA linear and 3D view:
http://www.cgl.ucsf.edu/home/glasfeld/tutorial/trna/trna.gif
RNA, continued
Several types exist, classified by function
mRNA – this is what is usually being
referred to when a Bioinformatician says
“RNA”. This is used to carry a gene’s
message out of the nucleus.
tRNA – transfers genetic information from
mRNA to an amino acid sequence
rRNA – ribosomal RNA. Part of the
ribosome which is involved in translation.
Messenger RNA
Basically, an intermediate product
Transcribed from the genome and
translated into protein
Number of copies correlates well with
number of proteins for the gene.
Unlike DNA, the amount of messenger
RNA (as well as the number of proteins)
differs between different cell types and
under different conditions.
Complementary base-pairing
mRNA is transcribed from the DNA
mRNA (like DNA, but unlike proteins) binds to its
complement
Quantify mRNA levels
Outline
Cell
Genome
Gene
mRNA
Proteins
Systems biology
Proteins: Workhorses of the Cell
Proteins are polypeptide chains of amino acids.
20 different amino acids
different chemical properties cause the protein chains to
fold up into specific three-dimensional structures that
define their particular functions in the cell.
Proteins do all essential work for the cell
build cellular structures
digest nutrients
execute metabolic functions
Mediate information flow within a cell and
among cellular communities.
Genes Make Proteins
genome-> genes ->protein(forms cellular structural & life
functional)->pathways & physiology
Genes Encode for Proteins
Second letter
U
C
A
G
UUU Phenylalanine (Phe)
UCU Serine (Ser)
UAU Tyrosine (Tyr)
UGU Cysteine (Cys)
U
UUC Phe
UCC Ser
UAC Tyr
UGC Cys
C
UUA Leucine (Leu)
UCA Ser
UAA STOP
UGA STOP
A
UUG Leu
UCG Ser
UAG STOP
UGG Tryptophan (Trp)
G
CUU Leucine (Leu)
CCU Proline (Pro)
CAU Histidine (His)
CGU Arginine (Arg)
U
CUC Leu
CCC Pro
CAC His
CGC Arg
C
CUA Leu
CCA Pro
CAA Glutamine (Gln)
CGA Arg
A
CUG Leu
CCG Pro
CAG Gln
CGG Arg
G
AUU Isoleucine (Ile)
ACU Threonine (Thr)
AAU Asparagine (Asn)
AGU Serine (Ser)
U
AUC Ile
ACC Thr
AAC Asn
AGC Ser
C
AUA Ile
ACA Thr
AAA Lysine (Lys)
AGA Arginine (Arg)
A
AUG Methionine (Met) or START
ACG Thr
AAG Lys
AGG Arg
G
GUU Valine (Val)
GCU Alanine (Ala)
GAU Aspartic acid (Asp)
GGU Glycine (Gly)
U
GUC Val
GCC Ala
GAC Asp
GGC Gly
C
GUA Val
GCA Ala
GAA Glutamic acid (Glu)
GGA Gly
A
GUG Val
GCG Ala
GAG Glu
GGG Gly
G
C
A
G
Triplet one Amino Acid
4^3 combinations mapped to 20 Amino Acids
Third letter
First letter
U
Open Reading Frames
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
Synonymous Mutation
G
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
Ala
Cys
Leu
Arg
Ile
G C U U G U U U G C G A A U U A G
Ala
Cys
Leu
Arg
Ile
Missense Mutation
G
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
Ala
Cys
Leu
Arg
Ile
G C U U G G U U A C G A A U U A G
Ala
Trp
Leu
Arg
Ile
Nonsense Mutation
A
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
Ala
Cys
Leu
Arg
Ile
G C U U G A U U A C G A A U U A G
Ala
STOP
Frameshift
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
Ala
Cys
Leu
Arg
Ile
G C U U G U
U A C G A A U U A G
Ala
Tyr
Cys
Glu
Leu
Protein Structure
Proteins work together with
other proteins or nucleic
acids as "molecular
machines"
structures fit together and
function in highly specific,
lock-and-key ways.
Four levels of structure:
Primary Structure: The
sequence of the protein
Secondary structure: Local
structure in regions of the
chain. (alpha helix, beta
sheet)
Tertiary Structure: Three
dimensional structure
Quaternary Structure:
multiple subunits
Assigning Function to Proteins
While 25000 genes have been identified in
the human genome, relatively few have
known functional annotation.
Determining the function of the protein
can be done in several ways.
Sequence similarity to other (known) proteins
Using domain information
Using three dimensional structure
Based on high throughput experiments (when
does it functions and who it interacts with)
Summary: DNA(Gene) RNA Protein
Replication
Transcription
Translation
Outline
Cell
Genome
Gene
mRNA
Proteins
Systems biology
Biological pathway/gene networks
Instead of having brains, cells make decision
through complex networks of interactions, called
pathways
Synthesize new materials
Break other materials down for spare parts
Signal to eat or die
In order to fulfill their function, proteins interact
with other proteins in a number of ways including:
Regulation
Signaling Pathways, for example A -> B -> C
Post translational modifications
Forming protein complexes
An Example
Systems Biology
We now have many sources of data, each
providing a different view on the activity
in the cell
Sequence (genes)
DNA motifs
Gene expression
Protein interactions
Protein-DNA interaction
Etc.
Putting it all together: Systems Biology
Next week
Introduction to R programming
You need to do in-class exercises
Acknowledgments
Ziv Bar-Joseph: for some of the slides
adapted or modified from his lecture slides
at Carnegie Mellon University
Neil Jones: for some of the slides adapted
or modified from his slides for the book An
Introduction to Bioinformatics Algorithms