DNA Sequences Analysis

Download Report

Transcript DNA Sequences Analysis

DNA Sequences Analysis
Hasan Alshahrani CS6800
• Statistical Background : HMMs.
• What is DNA Sequence.
• How to get DNA Sequence.
• DNA Sequence formats.
• Analysis methods and tools.
• What is next ?
HMMs
Hidden Markov Model (HMM) is very useful statistical model for
molecular biology although it was aimed to be used for speech
recognition purposes.
HMM can be used as a statistical profile for a protein family (DNAs)
and hence used to search a database for other similarities or family
members.
Q1 :How can HMMs be used in DNA analysis?
To calculate the probability of the sequence ACTTCG, we multiply the
probabilities; where the probability is the conditional probability that a
certain nucleotide appears in a position, given that a specific nucleotide
was in the previous position:
P (ACTTCG….) = P1(A) * P2(C|A) * P3(T|C) * P4(T|T) * P5(C|T) *
P6(G|C)…………
In more formal way , HMM cannot be observed directly but we
can infer the hidden state qt from a random observation Yt
What is DNA sequence ?
• DNA consists of two long interwoven strands that form the famous “double helix”. Each
Strand is built from a small set of molecules called nucleotides.
• Often the length of double-stranded DNA is expressed in the units of basepairs (bp),
kilobasepairs (kb), or megabasepairs (Mb), so that this size could be expressed
equivalently as 5X 10 ^6 bp,5000 kb, or 5Mb
• Collectively, the 46 chromosomes in one human cell consist of approximately 3 X 10^9
bp of DNA
How to get DNA sequence
• By using chemical methods for determining the order of the nucleotide bases:
Adenine, Guanine, Cytosine, and Thymine - in a molecule of DNA
• Used in many fields and applications such as Forensics and biological systems
• why don’t we use the powerful text searching algorithms and tools to search
DNA databases?
DNA can be sequenced by a chemical procedure that breaks a terminally labelled DNA
molecule partially at each repetition of a base.
DNA Sequencing can be done by different methods :
1.Maxam-Gilbert sequencing
2.Chain-termination methods
3.Dye-terminator sequencing
4.Automation and sample preparation
5.Large scale sequencing strategies
Q2: Name four of DNA Sequencing methods
Example :a chain termination method
A DNA sequencing printout. The sequence is represented by a series of peaks,
one for each nucleotide position. In this example, a red peak is an A, blue is a C,
orange is a G, and green is a T.
DNA Sequence formats:
• Plain sequence format
• EMBL format
• FASTA format
• GCG format
• GCG-RSF (rich sequence format)
• GenBank format
• IG format
FASAT Format :
• FASTA format is the standard format in the field of bioinformatics to represent either
nucleotide sequences or peptide sequences.
• This format is single-letter code and it allows sequence names and comments
• FASAT consists of a single-line description at the beginning followed by sequence
data in multiple lines.
• The length of the each chunk (line) of the sequence must not exceeds 80 characters.
• Sequence identifiers are defined by a standard called NCBI
Q3: what is FASAT format?
NCBI Data Base:
• National Centre for Biotechnology Information (www.ncbi.nlm.nih.gov) is sequence
database in US maintain a huge collection of DNA and protein sequences.
• Each sequence in NCBI is stored in a separate record with a unique identifier called
accession.
• Example : By accessing the NCBI website and using this accessing NC_001477, we can
retrieve the DNA sequence for Dengue virus that causes Dengue fever
NCBI cont…..
The database query can be done either directly from the website or by using
the R functions choosebank() and query()
Analysis methods
The analysis fall into 5 main methods :
• Knowledge-based single sequence analysis.
• Pairwise sequence comparison.
• Multiple sequence alignment.
• Sequence motif discovery in multiple alignments.
• Phylogenetic inference.
Q4: What are the main methods of DNA sequence analysis ?
Analysis methods: alignment
• Alignment: to compare a sequence with sequences that have already been reported and
stored in a database.
• Alignment can be global and local
• Local alignments: reveal regions that are highly similar, but do not necessarily provide a
comparison across the entire two sequences.
• The global approach compares one whole sequence with other entire sequences.
Alignment Examples:
Alignment Tools : BLAST
• The most common local alignment tool is BLAST (Basic Local Alignment
Search Tool) developed by Altschul et al. (1990. J Mol Biol 215:403)
“BLAST is a set of algorithms that attempt to find a short fragment of a query
sequence that aligns perfectly with a fragment of a subject sequence found in a
database.”
• That initial alignment must be greater than a neighborhood score threshold (T) ,
the fragment is then used as a seed to extend the alignment in both directions…
Which means BLAST algorithm breaks the query into short words of a specific
length
Joshua Naranjo
Q5: what is BLAST algorithm ? State its steps .
Can R Help ?
• Yes .
• It has so many useful packages to process DNA Sequences.
• It can be used to access BLAST as well.
Examples :
DNA sequence Composition
1. GC fraction:
GC content is one of the fundamentals properties of a genome sequence, which is the percentage of
Gs and Cs ((GC)s). We can do that by two ways:
• lengthy one is to use the statistics to calculate the percentage of GC with respect to the whole
string.
• The other way is to use function GC () from the R package SeqinR, and we will go with this option
as shown below
2. DNA words:
It the same idea of knowing the frequency of some nucleotides such as A or G
but with longer words like “AA” or “CA”. Those can be 2 nucleotides such as
“GC”, 3 nucleotides like “AAA” or 4 nucleotides long and so on. An example of 3
nucleotides words is shown below:
3. To find the score for the optimal global alignment between the
sequences ‘GAATTC’ and ‘GATTA’, we type:
4. Comparing two sequences using a dotplot()
Is it that easy ? No
• It is not simply give the sequences to R and get the results .
• It is an art which need a degree of skills.
• Fitting the sequences to be compared to a form that reflects some
shared quality. For example:
-How they look structurally,
-How they evolved from a common ancestor, or
-Optimization of a mathematical construct
What is next ?
Are we monkeys ?
References:
1.
2.
3.
4.
5.
6.
7.
8.
http://www.garlandscience.com/res/pdf/9780815365099_ch02.pdf
http://library.umac.mo/ebooks/b28050393.pdf
https://courses.cs.washington.edu/courses/cse527/00wi/lectures/roottr.pdf
http://www.lancaster.ac.uk/pg/nemeth/Hidden%20Markov%20Models%20with%20Applications%20to%20D
NA%20Sequence%20Analysis.pdf
https://www.ndsu.edu/pubweb/~mcclean/plsc411/Blast-explanation-lecture-and-overhead.pdf
http://www.cs.ru.ac.za/research/g07V3343/deliverables%5CShort%20Paper%5CSpecies%20Identification%2
0through%20DNA%20String%20Analysis%20-%20Summary.pdf
http://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/src/chapter4.html
https://www.bioconductor.org/packages/3.3/bioc/vignettes/DECIPHER/inst/doc/ArtOfAlignmentInR.pdf