Bio background

Download Report

Transcript Bio background

BNFO 615 Data Analysis in
Bioinformatics
Instructor
Zhi Wei
Outline
Cell
 Genome
 Gene
 mRNA
 Proteins
 Systems biology

Outline
Cell
 Genome
 Gene
 mRNA
 Proteins
 Systems biology

Cells
 Fundamental working units of
every living system.
 Every organism is composed of one of two radically
different types of cells:
prokaryotic cells or eukaryotic cells.
 Prokaryotes and Eukaryotes are descended from
the same primitive cell.
 All extant prokaryotic and eukaryotic cells are
the result of a total of 3.5 billion years of
evolution.
Prokaryotes v.s. Eukaryotes
Different Structures
Different Components
Different biological processes
Prokaryotes vs Eukaryotes
Prokaryotes
Eukaryotes
Single cell
Single or multi cell
No nucleus
Nucleus
No organelles
Organelles
One piece of circular DNA Chromosomes
No mRNA post
transcriptional
modification
Exons/Introns splicing
Prokaryotes v.s. Eukaryotes
Prokaryotes
Eukaryotes
bacteria, archaea
 Ecoli cell








plants, animals, protista, and fungi
 Yeast cell
5X106 base pairs
> 90% of DNA encode protein
5400 genes
Lacks a membrane-bound nucleus.
Circular DNA
Histones are unknown






12.4x106 base pairs
A small fraction of the total DNA encodes
protein. Many repeats of non-coding
sequences
5800 genes
All chromosomes are contained in a
membrane bound nucleus
DNA is divided between 16 chromosomes
A set of five histones: DNA packaging and
gene expression regulation
Cells chemical composition

Chemical composition-by weight
 70% water
 7% small molecules





salts
Lipids
amino acids
nucleotides
23% macromolecules



Proteins
Polysaccharides
lipids
We have different cells
Cells differ in size, shape
and weights
 Q: what is the biggest cell in
the human body?

Cell Cycle
 Born,
eat, replicate, and die
Lodish et al. Molecular Biology of the Cell (5th ed.). W.H. Freeman & Co., 2003.
Sexual Reproduction v.s. Cell Division


Cell Division: Cells reproduce by duplicating
their contents and dividing in two.
Sexual Reproduction




Formation of new individual by a combination of two
haploid sex cells (gametes).
Gametes for fertilization usually come from separate
parents
Both gametes are haploid, with a single set of
chromosomes. The new individual is called a zygote,
with two sets of chromosomes (diploid).
Meiosis is a process to convert a diploid cell to a
haploid gamete, and cause a change in the genetic
information to increase diversity in the offspring.
Meiosis
v.s.
Mitotic
cell
division
Outline
Cell
 Genome
 Gene
 mRNA
 Proteins
 Systems biology

Genome




A genome is an organism’s
complete set of DNA (including
its genes).
However, in humans less than
3% of the genome actually
encodes for genes.
A part of the rest of the
genome serves as a control
regions (though that’s also a
small part)
The function of the rest of the
genome is unknown (junk DNA?
An open question).
Comparison of Different Organisms
Genome size (bp)
Num. of genes
E. Coli
.05*108
5,400
Yeast
.12*108
5,800
Worm
.15*108
18,400
Fly
1.8*108
13,600
Human
30*108
25,000
Plant
1.3*108
25,000
Outline
Cell
 Genome
 Gene
 mRNA
 Proteins
 Systems biology

What is a gene?
Promoter
Protein coding sequence
Terminator
Genomic DNA
DNA: Deoxyribo Nucleic Acid
Example of a Gene: Gal4 DNA
ATGAAGCTACTGTCTTCTATCGAACAAGCATGCGATATTTGCCGACTTAAAAAGCTCAAG
TGCTCCAAAGAAAAACCGAAGTGCGCCAAGTGTCTGAAGAACAACTGGGAGTGTCGCTAC
TCTCCCAAAACCAAAAGGTCTCCGCTGACTAGGGCACATCTGACAGAAGTGGAATCAAGG
CTAGAAAGACTGGAACAGCTATTTCTACTGATTTTTCCTCGAGAAGACCTTGACATGATT
TTGAAAATGGATTCTTTACAGGATATAAAAGCATTGTTAACAGGATTATTTGTACAAGAT
AATGTGAATAAAGATGCCGTCACAGATAGATTGGCTTCAGTGGAGACTGATATGCCTCTA
ACATTGAGACAGCATAGAATAAGTGCGACATCATCATCGGAAGAGAGTAGTAACAAAGGT
CAAAGACAGTTGACTGTATCGATTGACTCGGCAGCTCATCATGATAACTCCACAATTCCG
TTGGATTTTATGCCCAGGGATGCTCTTCATGGATTTGATTGGTCTGAAGAGGATGACATG
TCGGATGGCTTGCCCTTCCTGAAAACGGACCCCAACAATAATGGGTTCTTTGGCGACGGT
TCTCTCTTATGTATTCTTCGATCTATTGGCTTTAAACCGGAAAATTACACGAACTCTAAC
GTTAACAGGCTCCCGACCATGATTACGGATAGATACACGTTGGCTTCTAGATCCACAACA
TCCCGTTTACTTCAAAGTTATCTCAATAATTTTCACCCCTACTGCCCTATCGTGCACTCA
CCGACGCTAATGATGTTGTATAATAACCAGATTGAAATCGCGTCGAAGGATCAATGGCAA
ATCCTTTTTAACTGCATATTAGCCATTGGAGCCTGGTGTATAGAGGGGGAATCTACTGAT
ATAGATGTTTTTTACTATCAAAATGCTAAATCTCATTTGACGAGCAAGGTCTTCGAGTCA
A sequence of A,C,G,T
Example of a Gene: Gal4 AA
MKLLSSIEQACDICRLKKLKCSKEKPKCAKCLKNNWECRYSPKTKRSPLTRAHLTEVESR
LERLEQLFLLIFPREDLDMILKMDSLQDIKALLTGLFVQDNVNKDAVTDRLASVETDMPL
TLRQHRISATSSSEESSNKGQRQLTVSIDSAAHHDNSTIPLDFMPRDALHGFDWSEEDDM
SDGLPFLKTDPNNNGFFGDGSLLCILRSIGFKPENYTNSNVNRLPTMITDRYTLASRSTT
SRLLQSYLNNFHPYCPIVHSPTLMMLYNNQIEIASKDQWQILFNCILAIGAWCIEGESTD
IDVFYYQNAKSHLTSKVFESGSIILVTALHLLSRYTQWRQKTNTSYNFHSFSIRMAISLG
LNRDLPSSFSDSSILEQRRRIWWSVYSWEIQLSLLYGRSIQLSQNTISFPSSVDDVQRTT
TGPTIYHGIIETARLLQVFTKIYELDKTVTAEKSPICAKKCLMICNEIEEVSRQAPKFLQ
MDISTTALTNLLKEHPWLSFTRFELKWKQLSLIIYVLRDFFTNFTQKKSQLEQDQNDHQS
YEVKRCSIMLSDAAQRTVMSVSSYMDNHNVTPYFAWNCSYYLFNAVLVPIKTLLSNSKSN
AENNETAQLLQQINTVLMLLKKLATFKIQTCEKYIQVLEEVCAPFLLSQCAIPLPHISYN
NSNGSAIKNIVGSATIAQYPTLPEENVNNISVKYVSPGSVGPSPVPLKSGASFSDLVKLL
SNRPPSRNSPVTIPRSTPSHRSVTPFLGQQQQLQSLVPLTPSALFGGANFNQSGNIADSS
A sequence of 20 amino acids {A,R,N,D,C,E,Q,G,H,I,L,K,M,F,P,S,T,W,Y,V}
The Central Dogma
DNA  RNA: Gene Transcription
promoter
5’
3’
G A T T A C A . . .
C T A A T G T . . .
3’
5’
Gene Transcription
transcription factor, binding site, RNA polymerase
5’
3’
G A T T A C A . . .
C T A A T G T . . .
3’
5’
Transcription factors recognize
transcription factor binding sites
and bind to them, forming a complex.
RNA polymerase binds the complex.
Gene Transcription
5’
3’
3’
5’
The two strands are separated
Gene Transcription
5’
3’
3’
5’
An RNA copy of the 5’→3’ sequence is
created from the 3’→5’ template
Gene Transcription
G A T T A C A . . .
5’
3’
3’
5’
C T A A T G T . . .
pre-mRNA
5’
G A U U A C A . . .
3’
RNA Processing (Eukaryotes)
5’ cap, polyadenylation, exon, intron, splicing, UTR
5’ cap
poly(A) tail
exon
intron
mRNA
5’ UTR
3’ UTR
Mammalian Gene Structure
introns
5’
3’
coding
promoter
5’ UTR




exons
3’ UTR
non-coding
Regulatory regions: up to 50 kb upstream of +1 site
Exons:
protein coding and untranslated regions (UTR)
1 to 178 exons per gene (mean 8.8)
8 bp to 17 kb per exon (mean 145 bp)
Introns:
splice acceptor (GU) and donor (AG) sites, junk DNA
average 1 kb – 50 kb per intron
Gene size:
Largest – 2.4 Mb (Dystrophin). Mean – 27 kb.
Only 1.5% DNA for coding in Human!
Identifying Genes in Sequence Data





Predicting the start and end of genes as well as
the introns and exons in each gene is one of the
basic problems in computational biology.
Gene prediction methods look for ORFs (Open
Reading Frame).
These are (relatively long) DNA segments that
start with the start codon, end with one of the
end codons, and do not contain any other end
codon in between.
Splice site prediction has received a lot of
attention in the literature.
Comparative genomics
Outline
Cell
 Genome
 Gene
 mRNA
 Proteins
 Systems biology

RNA


RNA is similar to DNA chemically. It is usually
only a single strand. T(hyamine) is replaced by
U(racil)
Some forms of RNA can form secondary
structures by “pairing up” with itself. This can
have change its properties dramatically.
DNA and RNA
can pair with
each other.
tRNA linear and 3D view:
http://www.cgl.ucsf.edu/home/glasfeld/tutorial/trna/trna.gif
RNA, continued
Several types exist, classified by function
 mRNA – this is what is usually being
referred to when a Bioinformatician says
“RNA”. This is used to carry a gene’s
message out of the nucleus.
 tRNA – transfers genetic information from
mRNA to an amino acid sequence
 rRNA – ribosomal RNA. Part of the
ribosome which is involved in translation.

Messenger RNA
Basically, an intermediate product
 Transcribed from the genome and
translated into protein
 Number of copies correlates well with
number of proteins for the gene.
 Unlike DNA, the amount of messenger
RNA (as well as the number of proteins)
differs between different cell types and
under different conditions.

Complementary base-pairing


mRNA is transcribed from the DNA
mRNA (like DNA, but unlike proteins) binds to its
complement
Quantify mRNA levels
Outline
Cell
 Genome
 Gene
 mRNA
 Proteins
 Systems biology

Proteins: Workhorses of the Cell


Proteins are polypeptide chains of amino acids.
20 different amino acids


different chemical properties cause the protein chains to
fold up into specific three-dimensional structures that
define their particular functions in the cell.
Proteins do all essential work for the cell
 build cellular structures
 digest nutrients
 execute metabolic functions
 Mediate information flow within a cell and
among cellular communities.
Genes Make Proteins

genome-> genes ->protein(forms cellular structural & life
functional)->pathways & physiology
Genes Encode for Proteins
Second letter
U
C
A
G
UUU Phenylalanine (Phe)
UCU Serine (Ser)
UAU Tyrosine (Tyr)
UGU Cysteine (Cys)
U
UUC Phe
UCC Ser
UAC Tyr
UGC Cys
C
UUA Leucine (Leu)
UCA Ser
UAA STOP
UGA STOP
A
UUG Leu
UCG Ser
UAG STOP
UGG Tryptophan (Trp)
G
CUU Leucine (Leu)
CCU Proline (Pro)
CAU Histidine (His)
CGU Arginine (Arg)
U
CUC Leu
CCC Pro
CAC His
CGC Arg
C
CUA Leu
CCA Pro
CAA Glutamine (Gln)
CGA Arg
A
CUG Leu
CCG Pro
CAG Gln
CGG Arg
G
AUU Isoleucine (Ile)
ACU Threonine (Thr)
AAU Asparagine (Asn)
AGU Serine (Ser)
U
AUC Ile
ACC Thr
AAC Asn
AGC Ser
C
AUA Ile
ACA Thr
AAA Lysine (Lys)
AGA Arginine (Arg)
A
AUG Methionine (Met) or START
ACG Thr
AAG Lys
AGG Arg
G
GUU Valine (Val)
GCU Alanine (Ala)
GAU Aspartic acid (Asp)
GGU Glycine (Gly)
U
GUC Val
GCC Ala
GAC Asp
GGC Gly
C
GUA Val
GCA Ala
GAA Glutamic acid (Glu)
GGA Gly
A
GUG Val
GCG Ala
GAG Glu
GGG Gly
G
C
A
G
Triplet  one Amino Acid
4^3 combinations mapped to 20 Amino Acids
Third letter
First letter
U
Open Reading Frames
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
Synonymous Mutation
G
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
Ala
Cys
Leu
Arg
Ile
G C U U G U U U G C G A A U U A G
Ala
Cys
Leu
Arg
Ile
Missense Mutation
G
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
Ala
Cys
Leu
Arg
Ile
G C U U G G U U A C G A A U U A G
Ala
Trp
Leu
Arg
Ile
Nonsense Mutation
A
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
Ala
Cys
Leu
Arg
Ile
G C U U G A U U A C G A A U U A G
Ala
STOP
Frameshift
G C U U G U U U A C G A A U U A G
G C U U G U U U A C G A A U U A G
Ala
Cys
Leu
Arg
Ile
G C U U G U
U A C G A A U U A G
Ala
Tyr
Cys
Glu
Leu
Protein Structure

Proteins work together with
other proteins or nucleic
acids as "molecular
machines"


structures fit together and
function in highly specific,
lock-and-key ways.
Four levels of structure:




Primary Structure: The
sequence of the protein
Secondary structure: Local
structure in regions of the
chain. (alpha helix, beta
sheet)
Tertiary Structure: Three
dimensional structure
Quaternary Structure:
multiple subunits
Assigning Function to Proteins
While 25000 genes have been identified in
the human genome, relatively few have
known functional annotation.
 Determining the function of the protein
can be done in several ways.





Sequence similarity to other (known) proteins
Using domain information
Using three dimensional structure
Based on high throughput experiments (when
does it functions and who it interacts with)
Summary: DNA(Gene)  RNA  Protein
Replication
Transcription
Translation
Outline
Cell
 Genome
 Gene
 mRNA
 Proteins
 Systems biology

Biological pathway/gene networks

Instead of having brains, cells make decision
through complex networks of interactions, called
pathways




Synthesize new materials
Break other materials down for spare parts
Signal to eat or die
In order to fulfill their function, proteins interact
with other proteins in a number of ways including:




Regulation
Signaling Pathways, for example A -> B -> C
Post translational modifications
Forming protein complexes
An Example
Systems Biology

We now have many sources of data, each
providing a different view on the activity
in the cell







Sequence (genes)
DNA motifs
Gene expression
Protein interactions
Protein-DNA interaction
Etc.
Putting it all together: Systems Biology
Next week

Introduction to R programming

You need to do in-class exercises
Acknowledgments
Ziv Bar-Joseph: for some of the slides
adapted or modified from his lecture slides
at Carnegie Mellon University
 Neil Jones: for some of the slides adapted
or modified from his slides for the book An
Introduction to Bioinformatics Algorithms
