DataSci_MolBioBckgrndx
Download
Report
Transcript DataSci_MolBioBckgrndx
Things that may help with
comprehension of bioinformatics
issues in general and Rosalind
problems in particular
Problems
1.
2.
3.
4.
Counting DNA nucleotides
Transcribing DNA into RNA
Complementing a strand of DNA
Rabbits and Recurrence Relations
•
Population growth model
5. GC Content
Problems
1
Counting DNA nucleotides
•
Intro to nucleotides
•
•
2
What are they? Put together in genome as a string
Some cool features of genome structure
Transcribing DNA into RNA
•
The protein coding parts
•
•
3
The central Dogma
Codons and other structural elements of genes
Complementing a strand of DNA
•
4
Issues of complementarity
Rabbits and Recurrence Relations
•
5
Population growth model
GC Content
•
Signatures of different parts of genomes and differences among
genomes
This course pays a bit of extra attention to data applications in the life
sciences, such as DNA sequencing.
Bioinformatics
• “the science of collecting and analyzing complex
biological data such as genetic codes.”
• “conceptualizing biology in terms of macromolecules
(in the sense of physical-chemistry) and then applying
"informatics" techniques (derived from disciplines such
as applied maths, computer science, and statistics) to
understand and organize the information associated
with these molecules, on a large-scale.”
http://www.ncbi.nlm.nih.gov/news/01-23-2015-genbank-trillion-bases/
http://www.nature.com/scitable/resource?action=showFullImageForTopic&
imgSrc=/scitable/content/ne0000/ne0000/ne0000/ne0000/78429/Databas
es_Fig1_FULL.jpg
Characteristics of DNA relevant for
bioinformatics
• A linear string
• Exists as double stranded helix
– Each strand has directionality
– Rules for pairing
• Genome has different regions
– Protein coding
• Genetic code
– RNA coding
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT
RNA and DNA each contain 4 Nitrogenous Bases
Bases, sugars, and phosphates combine to be
“nucleotides”
RNA and DNA differ in the nature of the sugar
molecule that they contain.
5 Carbons – (5’ and 3’)
The building blocks of DNA (and RNA) are
Nucleotides (=nucleoside triphosphates)
Nucleoside bases
are linked
together in
chains of RNA
or DNA by
phosphodiester
(phosphate-sugar)
bonds
RNA and DNA
differ in two
ways—
1. sugar molecule
they use
2. one baseuracil in RNA,
but thymine in
DNA.
The other
bases (adenine,
guanine, and
cytosine) occur
in both molecules.
Hydrogen bonds
connect A and T and
G and C.
Watson and Crick
Based on X-ray
crystallography data
from Franklin and
Wilkins, W&C
proposed a doublehelix model of DNA
Rosalind
Franklin
THE CENTRAL DOGMA
http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/central_dogma.html
Transcribing a gene in more detail:
Making sense of anti-sense
Sense strand: What we think of
as the coding sequence for a gene.
Sequence matches mRNA
sequence. Also called “plus
strand” or “non-template strand”.
Anti-sense strand: The strand
actually read by RNA polymerase
to create the mRNA in the 5’ to 3’
direction. (This strand is read in
the anti-parallel direction to build
RNA 5’ to 3’.) Also called “minus
strand” or “template strand”.
Problem solving
The DNA for a given gene reads as follows:
3’ TACGGTACTATC 5’
5’ ATGCCATGATAG 3’
The bottom strand shows the coding region/nontemplate strand.
A. What should the newly synthesized RNA read?
5’ AUGCCAUGAUAG 3’
B. Which strand will RNA polymerase attach to, and
in which direction will it read?
The top strand, reading 3’ to 5’
THE CENTRAL DOGMA!
The ‘universal’ degenerate code
Bioinformatics sites
• Translate DNA to protein
– http://web.expasy.org/translate/
• Search NCBI database to identify sequence
– http://blast.ncbi.nlm.nih.gov/blast/Blast.cgi
An unknown sequence for you to play
around with as you wish
GGCACGAGAAAAGACTAGTTGCTCACTGGAAAAAGTC
TAAAAATGAGGTTTCTCGTTGGAGCAGTATTAGTTGTTG
TGTTGGTGGCTTGTGCCACGGCATTCGAAAGTGATGCC
GAAACTTTTAAATCTCTTGTTGTAGAAGAAAGAAAATG
CCACGGAGATGGTTCCAAGGGCTGTGCCACAAAGCCT
GATGACTGGTGCTGCAAGAATACACCTTGCAAGTGCCC
CGCCTGGTCCTCCACAAGTGAGTGCAGGTGCGCAATG
GACTGCAGCCGAAGATGCAAAGGCAAACGAGCATTGT
TGTTGCCAGTTGAGACTCACCGACTACTCTTCCCTGAAC
AATGGTGAAGCCATTGACATCGATATATCATCTACTGTTA
TGTACTGTAAAAACAAATAAAGTTACTTATGCAGTAAAA
AAAAAAAAAAAAAAAAAAAAA