DNA Sequence Analysis

Transcript DNA Sequence Analysis

DNA Sequence Analysis
 5.1 Introduction
1. Terms in common use are defined, and the genetic
code is reviewed.
2. EST-Expressed Sequence Tag as a unit of sequence
data, derived from rapid sequencing of cDNA
libraries.
3. Three examples of producers of EST databases are
profiled.
5.2 Why analysis DNA?

The most sensitive comparisons between
sequences are made at the protein level; detection
of distantly related sequences is easier in protein
translation, because the redundancy of the
genetic code of 64 codons is reduced to 20
distinct amino acids. However, the loss of
degeneracy at this level is accompanied by a loss
of information about evolutionary process,
because proteins are a functional abstraction of
genetic events in DNA.
Table 5.1 The Genetic Code
Box 5.1 Family Analysis at DNA Level
5.3 Gene structure and
DNA sequences
1. DNA sequence databases contain genomic
sequence data,which includes information at the
level of the untranslated sequence, introns and
exons, mRNA, cDNA , and translations.
2. Untranslated regions(UTRs): occur in both DNA
and RNA; they are portions of the sequence
flanking the CDS that are not translated into
protein.It is highly specific at the 3’ end both to
the gene and the species from which the sequence
is derived.
Box 5.2 The Central Dogma
3. Six-Frame Translation: There are three
forward frames, which are achieved by
beginning to translate at the first,second and
third bases respectively; the three reverse
frames are determined by reversing the DNA
sequence and again beginning on the first,
second and third bases. Thus, for any piece
of DNA, the result of a six-frame translation
is six potential protein sequences.
Fig. 5.1 Six-Frame Translation
5.4 Features of DNA sequence
analysis
1.
Detecting open reading frames (ORF) :
Initial codon: ATG
Stop codon: TGA, TAA, TAG
2. Several features may be used as indicators of potential
protein coding regions in DNA:
a. Sufficient ORF length
b. Recognition of flanking Kozak sequence
c. Patterns of codon usage
d. A general preference for G/C over A/T in the third base
(wobble) position of a codon
e. Ribosome binding sites
f. Alignment with a homologous protein sequences
Table 5.2 Percentage use of codons for
serine in a variety of model organisms
3. DNA sequence assembly: The rapid
accumulation of DNA sequence data has
been expedited by the introduction of
fluorescent sequencing technology.The
output consists of a series of color-coded
peaks, beneath which is a string of base
symbols-the particular base shown is
determined by the highest peak at that
position of the trace.
Box 5.3 Fluorescent sequence
chromatogram interpretation
5.5 Issues in the interpretation of
EST searches
1. A large part of currently available DNA data is made up of
partial sequence, the majority of which are Expressed
Sequence Tags (ESTs).
2. In analyzing ESTs the following points should be borne in
mide:
a. The EST alphabet is five characters:ACGTN.
b. There may be phantom INDELs resulting in translation
frameshifts.
c. The EST will often be a sub-sequence of any other
sequence in the databases.
d. The EST may not represent part of the CDS of any gene.
3. The EST alphabet
4. The existence of splice variants has particular
consequences for database searches with EST
queries.
5.6 Two approaches to gene
hunting
 Position cloning: The chromosome linked to
the disease in question is established by
analyzing a population of subjects. Once a link
to a chromosomal region has been established,
a large part of the chromosome in the vicinity
of this region(locus) is sequenced, yielding
several megabases of DNA. Such a locus can
contain many individual genes, only one of
which is likely to be involved in diseases.
Ultimately, several genes will need to be
expressed, and further experimentation will be
required to confirm which gene is actually
involved in the disease. Although genes
discovered in this way can be illuminating from
an academic point of view, they do not
necessarily represent good drug targets.The
whole process is lengthy, time-consuming and
labor intensive.
 RNA transcript analysis: This approach
requiring much less sequencing effort and
relying more heavily on the powerful search
capabilities of current computer systems,
examines the genes that are actually expressed
in healthy and diseased tissue.This process
analyses the mRNA and allows a comparison to
be performed between the two states, and a
process of reasoning applied to arrive at a
potential drug target in a more direct way.

1.
2.
3.
The hierarchy of genomic information: The human
genome is complex, containing of about 3 billion basepairs of DNA. Yet only 3% of the DNA is coding
sequence. Thus, in simple terms, we have three levels of
genomic information:
The chromosomal genome-the genetic information
common to every cell in the organism.
The expressed genome-the part of genome that is
expressed in a cell at a specific stage in its development.
The proteome-the protein molecules that interact to give
the cell its individual character.
5.7 cDNA libraries and ESTs
 Obtained a sample of cells
RNA extraction
Reversed transcribed to cDNA
cDNA library
Sequence
1. The sequences that emerge successfully from this
process are called ESTs.
2. Good libraries contain at least 1 million clones, and
the actual number of distinct genes expressed in a
cell may be a few thousand; the number varies
according to cell type.
5.8 Different approaches to EST
analysis

There are three major sources of EST information.
Much of the publicly available data are collected
together into the EST sections of the EMBL Data
Library and GenBank (dbEST).
1. Merck/IMAGE: In 1994, Merck&Co. funded a
research project to sequence 300,000 ESTs from a
variety of normalised libraries. As of May 1997,
484421 ESTs had been submitted by the project to
dbEST.(Table 5.4)
2. Incyte: Incyte Pharmaceuticals Inc. produces a
database, LifeSeq, emphasizing the quantitative
information derived by sequencing standard cDNA
libraries. The goal is to provide information on
transcribed genes in health and diseased tissues, to
facilitate the elucidation of potential therapeutic
targets.In April 1998, the size of LifeSeq was 2.5
million ESTs, representing 80,000-120,000
different genes.
3. TIGR: The Institute for Genomic Research is a
research organization with interests in structural,
functional and comparative analysis of genomes
and gene products.
TIGR Human Gene Index(HGI)
5.9 EST analysis tools
 There are three publicly avaiable tools for the
analysis of ESTs:
1. Sequence similarity search tools- The BLAST
series of programs has variants that will
translate DNA databasees(TBLASTN),
translate the input sequence(BLASTX), or
both(TBLASTX).FastA provides a similar
suite of options.
2. Sequence assembly tools-When a search of the
databases reveals several ESTs matching with
a probe sequence, the ESTs must be aligned with
each other to reveal the consensus sequence.
3. Sequence clustering tools- Programs that take a
large set of sequences and divide them into
subsets, or clusters, based on the extent of
shared sequence identity in a minimum overlap
region. A reliable mechanism for clustering
ESTs will reduce redundancy in the dataset,
and save search time.
Clustering an EST library
5.10 A practical example of
EST analysis

DNA Sequence Analysis

Transcript DNA Sequence Analysis

Directory