atgcccgcatttgaataa

Download Report

Transcript atgcccgcatttgaataa

DNA sequence analysis
School B&I TCD Bioinformatics
May 2010
A, T/U, C, G
• Simple code, lots of sequence
• Sequence analysis
– Computer intensive
•
•
•
•
BLAST homology searching
Gene/exon prediction
Multiple sequence alignment
Alignments in general
– “Trivial”
Trivial
• Could be done by hand
– Computers
• Quicker
• More reliable
• Examples
– Translate DNA
– Restriction sites
– Synonymous codon usage
Sequence formats
•
Fasta Format
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
•
Phylip Format
4 131
IXI_234
IXI_235
IXI_236
IXI_237
•
TSPASIRPPA
TSPASIRPPA
TSPASIRPPA
TSPASLRPPA
GPSSRPAMVS
GPSSR----GPSSRPAMVS
GPSSRPAMVS
SRRTRPSPPG
----RPSPPG
SR--RPSPPP
SRR-RPSPPG
PRRPTGRPCC
PRRPTGRPCC
PRRPPGRPCC
PRRPT----C
SAAPRRPQAT
SAAPRRPQAT
SAAPPRPQAT
SAAPRRPQAT
CLUSTAL W(1.4) multiple sequence alignment
IXI_234
IXI_235
IXI_236
IXI_237
TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQAT
TSPASIRPPAGPSSR---------RPSPPGPRRPTGRPCCSAAPRRPQAT
TSPASIRPPAGPSSRPAMVSSR--RPSPPPPRRPPGRPCCSAAPPRPQAT
TSPASLRPPAGPSSRPAMVSSRR-RPSPPGPRRPT----CSAAPRRPQAT
• Interconvert: http://thr.cit.nih.gov/molbio/readseq/
DNA sequence analysis
• Google EMBOSS
– A suite of programs with the same look&feel
– Does pretty much everything you need
– Can be installed locally
Translation
• DNA anti-parallel.
– One strand 5’ -3’ matches the complementary
strand 3’ – 5’
– Translation, transcription always 5’ – 3’
• Six possible translations, 3 each strand
• ATGCCCGCATTTGAATAA
• ATGCCCGCATTTGAATAA
Frameshift errors
• ATGCCCGCATTTGAATAA Frameshift mutations
• Stop codons underlined
Genetic code
The “Universal” Genetic Code.
Phe UUU
UUC
Leu UUA
UUG
Ser UCU
UCC
UCA
UCG
Tyr UAU
UAC
ter UAA
ter UAG
Cys UGU
UGC
ter UGA
Trp UGG
Leu CUU
CUC
CUA
CUG
Pro CCU
CCC
CCA
CCG
His CAU
CAC
Gln CAA
CAG
Arg CGU
CGC
CGA
CGG
Ile AUU
AUC
AUA
Met AUG
Thr ACU
ACC
ACA
ACG
Asn AAU
AAC
Lys AAA
AAG
Ser AGU
AGC
Arg AGA
AGG
Val GUU
GUC
GUA
GUG
Ala GCU
GCC
GCA
GCG
Asp GAU
GAC
Glu GAA
GAG
Gly GGU
GGC
GGA
GGG
Exceptions to the code
•
•
•
•
•
•
•
•
•
•
•
#1: Yeast Mitochondrial Code: CUN=T AUA=M UGA=W
#2: Mitochondrial Code of Vertebrates: AGR=* AUA=M UGA=W
#3: Mitochondrial Code of Filamentous fungi: UGA=W
#4: Mitochondrial Code of Insects and platyhelminths: AUA=M
UGA=W AGR=S
#5: Nuclear Code of Candida cylindracea: CUG=S (*)
#6: Nuclear Code of Ciliata: UAR = Q
#7: Nuclear Code of Euplotes: UGA=C
#8: Mitochondrial Code of Echinoderms: UGA=W AGR=S AAA=N
#9: Mitochondrial Code of Ascidaceae: UGA=W AGR=G AUA=M
#10: Mitochondrial Code of Platyhelminthes: UGA=W AGR=S
UAA=Y AAA=N
#11: Nuclear Code of Blepharisma: UAG=Q
(*) (see Nature 341:164):
Start codons
•
•
•
•
•
ATG the “universal” start codon … but
10% E.coli genes start with GTG
1% start with TTG.
Bioinformaticians only make predictions
Molecular biologists verify
Restriction sites
• Essential for the construction of plasmids
• A key tool for molecular biology
• Hundreds available commercially
– Need to decide which to order
– Costs from $3.80/1000units - $500/1000
BamH1
5'G’GATCC
3'CCTAG’G
EcoR1
5'G’AATTC
3'CTTAA’G
BluntEnd
Alu1
5'AG’CT
3'TC’GA
• http://tools.neb.com/NEBcutter2/index.php
• Usually need an enzyme that cuts once
Promoter Prediction
• To find start of transcript (97% Human genome not
coding)
• False positive rate too high
– Predicted 1 / kb gene-density 1 / 100kb
• RNA polII transcribes DNA – RNA
– Needs general transcription factors (GTFs)
• Also specific (species, tissue, devt stage) TF
• TF binding sites short and “fuzzy”
• 7% of vertebrate genes are TFs
Promoters 2
NF-AT4 matrix (3 known sites)
and consensus:
A00333001
C12000002
G00000110
T21000220
TCAAATTC
Consensus YYAAAKKM = [CT](2)AAA[GT](2)[AC]
Predicts five sites in 3Kb upstream of human IL-11:
Bp 007 TTAAAGGC
Bp 248 ACAAATTC
Bp1959 GAGTTTGA
Bp2154 TCAAAGGA
Bp2181 GACTTTTA
Ask if TF site relevant to your cell type is present.
Primer design
• You will be asked to design primers for
sequencing, PCR etc.
• Manual pages cover this
• Computationally trivial, so lots of choice for
available websites
Not-trivial
• NA secondary structure
– EMBOSS einverted for short palindromes
– mFOLD
• Huge database of 16sRNA structures
• miRNA sites
Secondary Structure
• DNA (and RNA) can form base-pairs.
• Not all of these are with complementary
strands.
Bioinformatic view
= a cartoon
Closer to reality
16s RNA
Gram -ve
Gram +ve
Evolutionary consequences? Coordinated/dependent mutational change
RDP
• Ribosomal Database Project-II Release
9 Notes
• RDP Release 9.42 (Release 9, update 42)
consists of 262,030 aligned and annotated
16S rRNA sequences, along with five
online analysis tools.