Why teach a course in bioinformatics?

Download Report

Transcript Why teach a course in bioinformatics?

Introduction to Molecular
Biology
G-C and A-T pairing.
A&G=
Purines
C&T=
Pyrimidines
Important terms:
• Nucleotide Pair = Base pair (bp)
• 1000 base pairs = 1 kilobase pairs (kb)
• 1,000,000 base pairs = 1 megabase pairs (Mb)
• 1,000,000,000 base pairs = Gb?
Double-stranded DNA is peeled
apart to replicate DNA
• The 2 daughter
molecules are
identical to each
other and exact
duplicates of the
original
(assuming errorfree replication).
• One chromosome is
one long, twisted,
dramatically
compacted DNA
molecule.
• The average length
of a human
chromosome is 130
million b.p.
Genes are defined segments of
DNA
•The information content
of the DNA molecule
consists of the order of
bases (A, C, G, and T)
along the length of the
molecule.
Nucleic Acids
DNA
vs.
RNA
• RNA is quite
similar to
DNA, but
usually singlestranded. Both
are nucleic
acids
In RNA,
“U”
replaces
“T “
Important Concepts
• DNA and RNA have polarity- each strand has a
5’ and a 3’ end. (The 2 strands of DNA are antiparallel)
• The common convention is to list only one
strand of DNA, in a 5’ to 3’ direction:
5’ AGTCGTAGTCGTAGTCGTAGTCTG3’
(3’TCAGCATCAGCATCAGCATCAGAC 5’)
How Genes are
Expressed- the Central
Dogma.
Transcription
=
RNA
synthesis
Translation
=
Protein
synthesis
Eukaryotic transcription operates
‘gene by gene’.
One strand of DNA is copied (sense
strand); the antisense strand is never
transcribed.
Transcription produces an RNA
‘copy’ of a gene (DNA)
• animation
Important Term:
• Transcription = RNA synthesis
• Quiz question- how does sequence
of mRNA compare to sequence of
noncoding strand of DNA?
The mRNA are translated in the
cytoplasm
Three consecutive bases in the
mRNA form one codon
No exceptionsthe genetic
code is a triplet
code.
tRNA are the ‘bilingual’ molecules
The genetic code is the codon-amino acid conversion table
http://academy.d20.co.edu/kadets/lundbe
rg/DNA_animations/protein.mov
The immediate product of
translation is the primary protein
structure
The primary
sequence
dictates the
secondary
and tertiary
structure of
the protein
Important Term:
• Translation = Protein synthesis
There are 2 basic types of genes:
• Protein-coding genes:
(DNA  mRNA  protein)
• RNA-specifying genes:
(DNA  tRNA)
(DNA  rRNA)
(DNA  small RNA)
Genetic information, stored in DNA, is
conveyed as proteins
Protein sequences are also
represented linearly.
• Each of the 20 amino acid is can be
represented by a 3 letter code:
Ser Tyr Met Glu His
In bioinformatics, each of the 20 amino
acid is commonly represented by a 1
letter code:
MDETSGHLKPWECVGH . . .
Genetic information, stored in DNA, is
conveyed as proteins
In sickle-cell anemia, one
nucleotide change is responsible
for the one amino acid change.
Sickle-cell anemia is caused by
one amino acid change.
A single base-pair mutation
is often the cause of a human
genetic disease.
How to find a gene?*
• One way is too search for an open reading frame
(ORF).
• An ORF is a sequence of codons in DNA that
starts with a Start codon, ends with a Stop
codon, and has no other Stop codons inside.
* = inexact science
Each strand has 3 possible ORFs.
5'
3’
atgcccaagctgaatagcgtagaggggttttcatcatttgagtaa
1 atg ccc aag ctg aat agc gta gag ggg ttt tca tca ttt gag taa
M
P
K
L
N
S
V
E
G
F
S
S
F
E
*
2
3
tgc cca agc tga ata gcg tag agg ggt ttt cat cat ttg agt
C
P
S
*
I
A
*
R
G
F
H
H
L
S
gcc caa gct gaa tag cgt aga ggg gtt ttc atc att tga gta
A
Q
A
E
*
R
R
G
V
F
I
I
*
V
Eukaryotic Genomes
• Finding a gene is much more
difficult in eukaryotic genomes
than in prokaryotic genomes.
WHY??
Prokaryotic (bacterial) genomes:
• Are much smaller than eukaryotic genomes
E. coli = 4,639,221 bp, 4.6 Mb
Human = ~~ 3,300 Mb
• Contain a small amount of noncoding DNA
E. coli= ~ 11%
Human = > 95%
Eukaryotic transcripts
(mRNA) are processed and
leave the nucleus
Exon
=
Genetic code
Intron
=
Non-essential
DNA ? ?
• The mechanism
of splicing is
not well
understood.
Alternate Splice sites generate various
proteins isoforms (HGP estimate = 35%)
Variable mutation rate?
• Most mutations in introns and
intergenic DNA are (apparently)
harmless
• Consequently, intron and intergenic
DNA sequences diverge much quicker
than exons.
Bacteria cells are different:
• Prokaryotic cells- No splicing (i.e.
– no split genes)
• Eukaryotic cells- Intronless genes
are rare (avg. # of introns in HG is
3-7, highest # is 234); dystrophin
gene is > 2.4 Mb.
How to confirm the identification
of a gene?
• Possible answer- Identify the gene
by identifying its promoter.
Promoters are DNA regions that
control when genes are activated.
Promoter
[
]
Exons encode the information
that determines what product will
be produced.
Promoters encode the
information that determines when
the protein will be produced.
Nucleotides of a particular gene are often
numbered:
• De
Demonstration of a consensus
sequence.
How to find a gene?
• Look for a substantial ORF and
associated ‘features’.
• Two nucleic acids, that are
exact complements of each
other will hybridize.
• Two nucleic acids that are
mostly complementary
(some mismatchs) will . . .
. . . hybridize under the right
conditions.
Recombinant DNA techniques?
• Many popular tools of recDNA rely on the
principle of DNA hybridization.
• In large mixes of DNA molecules,
complementary sequences will pair.
Hybridization ‘in silico’
• Algorithms have been written that will
compare two nucleic acid sequences. Two
similar DNA sequences (they would
hybridize in solution) are said ‘to match’
when software determines that they are of
significant similarity.
Protein- Protein similarity
searches?
• Many algorithms have been
designed to compare strings of
amino acids (single letter amino
acid code) and find those of a
defined degree of similarity.
Significance of sequence
similarity
• DNA similarity suggests:
• Similar function
• Similar structure
• Evolutionary relationship
The End