Sequencing a genome and Basic Sequence Alignment

Download Report

Transcript Sequencing a genome and Basic Sequence Alignment

Sequencing a genome and Basic
Sequence Alignment
Lecture 8
Global Sequence
1
Introduction
• Determining DNA sequences
• Discovering genomes the shot-gun approach
• Sequence alignment (sequence matching)
Annotation of sequences
• As discussed before when the gene sequence’s
(DNA and/or mRNA) have been determined
(obtained) then the data must be annotated:
(Klug 2010)
– what sequences correspond UTR, exons/ introns,
coding sequences (cds), polyA signal
– Other sequences of interest include: promoters sites
and other regulatory regions (enhancers…)
• Annotation also contains important
supplementary material; other organisms that
have the same gene; the corresponding protein
sequence and journal articles related to the
sequences….
Global Sequence
3
How are sequences of genes and genomes
obtained
•
DNA recombinant technology is
essential to produce DNA sequences
that can be used to determine [chapter
17 (klug 2010)]:
– sequences in genes,
– regulatory sequences
– large DNA strands.
•
Some of the important terms in this
field include:
– Cloning DNA: making copies of DNA.
– Restriction enzymes: cuts DNA at specific
sites : vary in size from sites of 4bp to 8bp
or longer; 4 bp cuts into fragments of 256
bp in size ; of 8 b.p 4 8 (64,000) b.p. ; e.g.
EcoR1 site: GAATTC
– Restriction maps: map of restriction
enzyme sites (refer to figure 17.5 klug)
Global Sequence
4
DNA recombinant technology
– Plasmid Vectors: help insert the DNA fragment that needs cloned into a
host cell. Inside the host cell both the vector and the DNA fragment are
cloned (copied). In the example a DNA fragment is inserted into the
plasmid. The plasmid is then inserted into the host cells and produces
many copies of itself.
– The LacZ gene is used as a marker. If markers is disrupted then it means
that the host cell has a plasmid vector (recombinant plasmid) in it
Sequencing DNA strands
• dATP is an adenine base nucleic
acid
• ddATP is a modified adenine base
which has a coloured florescent
marker attached. In has the
added property of terminating
the elongation if chosen instead
of dATP
• During the process all possible
lengths of chain are produced.
• Lengths are separated based on
weight and analysed to give
• The complementary sequence of
the template strand. [ note the
sequences in part 1 and part4]
Expressed sequence tags
• Refer to box 9.1 understanding bioinformatics
GENOMES: Sequencing and assembling
• Plasmids and other recombinant DNA technology
only produce relatively small DNA segments.
• To sequence an organism’ s entire genome :
– Must use the “shot gun” approach
• Shot gun approach requires two genetic
technologies and one computational technique:
– Restriction enzymes: cut up denatured (ss)DNA
– Fast DNA sequencing of fragments (sequences)
– Combining overlapping contiguous DNA sequences
Global Sequence
8
Overlapping Contiguous Fragments
Adapted from [1] p. 377
Global Sequence
9
Overlapping Fragments: example
• Original sentence:
• This is DT228 bioinformatics course.
• Cut 2 copies of the sentence into fragmentes
• This is
• course
• DT228 bioinformatics
• This is DT228
• Bioinformatics course
Global Sequence
10
Overlapping Fragments: example
•
•
•
•
•
•
•
Check for overlaps (prefix and suffix)
This is
This is DT228
DT228 bioinformatics
bioinformatics course
course
Result of alignment of fragments is:
– This is DT228 bioinformatics course
Global Sequence
11
Overlapping Fragments: example
•
Reconstruct the sentence from the following 2 sentences (same as the original) which have been
randomly fragmented.
–
–
–
–
–
–
–
–
–
–
–
–
–
molto questa lingua.
mondo. ho dodici anni e
sono nato nel posto
inglese e mi piace
migliore nel
parlo io un pochino
nel mondo. Ho dodici anni e parlo
piace molto questa lingua.
nel posto migliore
un pochino inglese e mi
sono nato
• Solution will be discussed in class
Global Sequence
12
Example of Contigs alignment:
The above diagram shows an DNA example of how overlapping contiguous
sequences are aligned. However it is an oversimplification as actual segments are
many times larger than shown and overlapping does not always happen at then end
of ends of segments. Adapted from: Klug 7th p 378
Global Sequence
13
Sequence Alignment ( Pair-wise) : A simple global match
The assignment of residues-residue
corresponds:
A Global match: align all of one sequence
with another .
The figure shows to sequences of nucleic
acids.
Some have the same base (nucleic acid )
and so there is a match at this position
between the strands. This is represented by
a vertical line and a blue highlight.
Others do not match and have no vertical
line and blue highlight: these are unmatched
pairs and correspond to substitutions .
In DNA nucleic acids transitions A > G
and T> C are the most common than
transversions
This figure adapted from Klug is a
comparison of a “leptin gene” from a dog
(top) and a homo sapiens (bottom)
This technique of global alignment matching
is important in the area of:
Comparative genomics, homologous gene
analysis and the development of
evolutionary trees.
Global Sequence
14
Global alignment: different size sequences
•
•
•
•
•
A Global alignment between sequence
of difference sizes requires the
inclusions of gaps [dash] in order to
optimise the matching process.
Example 1 with not inclusion of gaps
produces a much lower number of
matches than example 2 which includes
dashes.
the assumption is that the both strands
are homologous [ have a common
ancestor; were the same sequence] but
are now different through a series
substitution [mismatch] ,
Deletions /insertions [gaps]
Example 1
I am from Cork
I am not from Cork
****
(4 matches out of 18; based on
length of bottom string)
•
•
•
•
Example 2
I am ---- from Cork
I am not from Cork
**** **********
•
(14 matches out of 18; based on
length of bottom string)
Global Sequence
15
Example of alignment Nucleic acids
tAdapted from Klug p. 384
Global Sequence
16
Sequence alignment: Amino Acids
•
“*” match; “-” gap; “:” conserved substitution “.”semi-conserved substitution.
In DNA the sequence is most important in relation to its functionality however in
proteins its final structure is most significant; while it relates to the sequence but
also to:
The property of amino acids plays a significant part in the final configuration (refer
to lecture 3 slide 5).
Amino Acids with similar properties /structure will have overlapping “effects” on
the final 3-D structure of the protein.
Therefore the type of substitutions must be extended to included this and so you
can have conserved and semi-conserved substitutions
Global Sequence
17
Sequence Alignment: pairwise : a local
Match
A local Match :
Example
• find a region in one sequence
that matches a region of
another overhangs at the end
are not treated as gaps
• A local match is generally used
if there is a larger difference in
size between the sequences
• In example
– global Scores is 9 out of 13;
– Local score is 8 out of 10 ( no
overhangs…)
Global Sequence
18
Sequence Alignment: pairwise : a
motif match
Motif (small region) match
example
•
•
A motif match can find:
a “perfect match between a small
sequence and one or more regions
in a larger sequence.
•
This plays an important part in
looking for repeating sequences
[tandem repeats] , and other
“relatively small” regions that may
be conserved between organisms
•
The motif match like the others of
course does not have to be
“perfect” can include
deletion/insertions
• You are not from Cork
• You are not normal
•
*** ***
Global Sequence
19
Multiple alignment: many sequences
• Similar to the previous except
you look for areas conserved
between all the sequences in
the alignment:
•
•
•
•
My name is denis and I am from cork
My name is kieran and I am not from cork
We name the dog “canis familiaris”
name
• Programs like clustaW are
used to align multiple
sequences which can be used
to check for conserved
motifs/sequences in many
species: used to determine
phylogenetic relationships
and protein functionality
Global Sequence
20
Exam Questions
• Explain, using suitable examples, the “shot- gun”
genomic alignment approach and why it has become
the dominant method for analysing genomes.
• DNA Sequences alignment matching can take a number
of forms; describe the different types of matching.
• Explain how the different types of point mutations are
incorporated into the sequence alignment matching
process.
• Discuss why the inclusions of “gaps” in to a matching
alignment increase the degree of matching and explain
what these gaps mean and what it means for the
aligned sequences.