No Slide Title
Download
Report
Transcript No Slide Title
An Introduction to Bioinformatics
Sequence information and file formats
AIMS
To understand the conventions regarding the presentation of
DNA and protein sequence information
To understand the logic underlying these conventions
To become familiar with the commonly used sequence file
formats
To become familiar with the READSEQ programme for the
interconversion of file formats
OBJECTIVES
Present a nucleotide or protein sequence according to
accepted conventions
Recognize different sequence files formats
Interconvert files between formats
INTRODUCTION
Virtually all the information one deals with in computational
molecular biology is either in the form of DNA or protein
sequences
There are conventions applying to the presentation/storage of
sequence information
The way in which sequence information is stored, retrieved and
manipulated varies
There are different computer file types for sequence information
DNA
The DNA of living organisms is normally
double stranded
It is the convention to show only one strand
of the DNA
Which strand do you show?
Which way round do you show it?
The orientation of a DNA strand is determined by which end
has a 5'-phosphate group and which has a 3'-hydroxyl group
It is usually the case that either strand can be the template
(or coding) strand at any particular point
Given that the two strands are anti-parallel, the genes on the
two strands will face in opposite directions
RNA polymerase in all organisms moves along the template
strand of the DNA in the 3'-5' direction producing RNA that
grows in the 5'-3' direction
The RNA sequence will be identical to that of the nontemplate strand, except for the presence of uracil instead
of thymine
3’ GGCATAGCAGGTACGTTATGCCAGCATTG 5’ template
5’ CCGTATCGTCCATGCAATACGGTCGTAAC 3’ non-template
5’ CCGUAUCGUCCAUGCAAUACGGUCGUAAC 3’ mRNA
The convention is to show the non-template strand of the DNA
because it resembles the RNA
For purely cultural reasons the sequence is shown running
from right to left on the page, with the 5' end of the sequence
on the right
Sometimes when sequencing projects are in the draft state
there are still ambiguities in the sequence
IUPAC have defined a standard table for the nucleotide
ambiguity codes
R = A or G
K = G or T
S = G or C
Y = C or T
M = A or C
W = A or T
B = not A
H = not G
D = not C
V = not T
N = any
PROTEIN
Polypeptides have a polarity, with an N-terminal and
C-terminal ends possessing a free amino group and
carboxyl group respectively
A polypeptide is presented with its N-terminus on the left of
and its C-terminus on the right.
H2N-Methionine-Valine-Tyrosine-Glycine-Isoleucine-Lysine-COOH
To keep the polypeptide information in a form that can be
conveniently handled by computers the amino acids are
each given a single letter code
Myoglobin
Glycine G
Isoleucine I
Alanine A
Cysteine C
Tryptophan W
Arginine R
Phenylalanine F Threonine T
Proline P
Lysine K
Valine V
Tyrosine Y
Methionine M
Aspartate D
Histidine H
Leucine L
Serine S
Asparagine N
Glutamate G
Glutamine Q
H2N-Methionine-Valine-Tyrosine-Glycine-Isoleucine-Lysine-COOH
MVYGIK
FILE TYPES
Many software packages have been developed for the
analysis of DNA and protein sequences
A variety of different file formats have been developed to
store/analyse DNA and protein sequence information
The various software packages will usually only accept a
specific file format
The situation is made worse by the fact that different
databases hold the information in different file formats
An essential skill is be able to recognize the different formats
and to be able to interconvert files between formats
Common File Formats
IG/Stanford
Fitch
Plain/Raw
GenBank/GB
Fasta/Pearson
PIR/CODATA
NBRF
Zuker
MSF
EMBL
Olsen
ASN 1.8
GCG
Phylip 3.2
PAUP/NEXUS
DNAStrider
Phylip
Pretty
PLAIN SEQUENCE FORMAT
A sequence in plain format may contain only IUPAC characters and
spaces (no numbers!).
Note: A file in plain sequence format may only contain one sequence,
while most other formats accept several sequences in one file.
An example sequence in plain format is:
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAA
CCTCCCATCCGTGTCTATTGTACCCTGTTGCTTCGGCGGGCCCGC
CGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTG
CCCGCCGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTC
TGAGTTGATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCT
FASTA FORMAT
A sequence file in FASTA format can contain several sequences.
One sequence in FASTA format begins with a single-line description, followed by
lines of sequence data. The description line must begin with a greater-than (">")
symbol in the first column.
An example sequence in FASTA format is:
>U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1)
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC
TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC
CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT
TTCAACAATGGATCTCTTGGTTCCGGC
EMBL FORMAT
A sequence file in EMBL format can contain several sequences.
One sequence entry starts with an identifier line ("ID "), followed by further
annotation lines. The start of the sequence is marked by a line starting with "SQ"
and the end of the sequence is marked by two slashes ("//").
An example sequence in EMBL format is:
ID
XX
AC
XX
DE
DE
XX
SQ
//
AA03518
standard; DNA; FUN; 237 BP.
U03518;
Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
rRNA and 5.8S rRNA genes, partial sequence.
Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc
tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg
ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa
tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg
catccgtgtc
ggcgcctctg
gcgtgcagtc
ttccggc
60
120
180
237
GENBANK FORMAT
A sequence file in GenBank format can contain several sequences.
One sequence in GenBank format starts with a line containing the word LOCUS
and a number of annotation lines. The start of the sequence is marked by a line
containing "ORIGIN" and the end of the sequence is marked by two slashes ("//").
An example sequence in GenBank format is:
LOCUS
DEFINITION
AAU03518
237 bp
DNA
PLN
04-FEB-1995
Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
rRNA and 5.8S rRNA genes, partial sequence.
U03518
41 a
77 c
67 g
52 t
ACCESSION
BASE COUNT
ORIGIN
1 aacctgcgga
61 tattgtaccc
121 ccccccgggc
181 tgagttgatt
//
aggatcatta
tgttgcttcg
ccgtgcccgc
gaatgcaatc
ccgagtgcgg
gcgggcccgc
cggagacccc
agttaaaact
gtcctttggg
cgcttgtcgg
aacacgaaca
ttcaacaatg
cccaacctcc
ccgccggggg
ctgtctgaaa
gatctcttgg
catccgtgtc
ggcgcctctg
gcgtgcagtc
ttccggc