Sequencing - Amirkabir University of Technology

Download Report

Transcript Sequencing - Amirkabir University of Technology

Sequence Analysis
Sequence Analysis - Topics
• Comparison of gene sequences for similarities and
defining homologies from phylogenetic analysis
• Identification of gene structure, including reading
frames, exon-intron distribution and regulatory
elements
• Prediction of protein structural elements
• Genome mapping (linear arrangement of genes on
chromosomes and its assessment within the
context of metabolic pathways
Sequence Analysis
• Helps understand evolution of life
• Expresses Relationship between DNA sequences
of different proteins and organisms
• Facilitates the collection, storage, organization and
annotation of raw data and construction of
secondary and tertiary databases
• Necessary to achieve the goal of bioinformatics
Goal of Bioinformatics
• Organization of sequence databases with
bibliographic and biological annotations
• Support via software for the alignment of
sequences
• Identification of genes
• Translation of DNA sequences into amino acid
sequences
• Search for homologs (evolutionary related
sequences)
History of Seq Analysis
• Fifteen years ago – People read DNA or
Amino acid Sequences over telephone
• Caused an estimated mutation rate that far
exceeded that of natural DNA replication or
transcription process
Computational tools for
Sequence Analysis
• Extremely easy
• Fast
• Virtually error free
Database Submissions
• Information is submitted to NCBI, EBI,
DDBJ
• GenBank staff scientists assign accession
numbers for immediate release to public
• Daily exchanges between GenBank, EBI
and DDBJ ensure information is non
redundant (submitted only once)
Database submissions
• Authors can update original information
• Specialized submission procedures include
EST(Expressed sequence tags),
STs(Sequence tagged sites) and
GSSs(Genome survey sequences)
EST (Expressed sequence tags)
• ESTs are short sequences of 300-500 bp and
represent actually expressed genes.
• These are markers that are helpful in
locating (map) genes on chromosomes
• EST submissions therefore include both
sequence and mapping information
STs (Sequence tagged sites)
• Provide unique identifiers within a given
genome identifiable by PCR
• Similar in length and number of submitted
sequences per batch
• STs sequences will soon outnumber EST
because of the non coding regions of the
genomes
Processing of submissions
• The submissions are processed on a daily
basis and can be submitted before they are
completed
• The processing at NCBI includes 3 phases:
1) Unfinished , Unordered 2) unfinished
ordered 3) high-quality finished sequences
with no gaps
Annotation
• Annotation of sequences is important – helps in
predicting structures, drug discovery , establishing
phylogenetic relationships etc
• Erroneous annotation result in erroneous
interpretation and conclusions and reduces
reliability of data
• NCBI’s staff continously screen biomedical
journals for published sequence and structure data
and use it for annotation purposes
Data Retrieval
• Data for DNA and Protein sequences –
enormous – searching is dubbed “biological
data mining”.
• Sequences are retrieved based on specific
criteria (similarity or identity between
sequences)
Search Engines
• Perform simple string searches for
information retrieval of stored data
(GenBank:nucleotides and proteins; and
PubMed’s MEDLINE: 3-D structures,
genomes and taxonomy databases)
• Perform similarity searches (e.g., BLAST)
to retrieve , align and compare sequences or
structures
Steps in Retrieval
• First step includes retrieving sequences
based on specific criteria (similarity or
identity between sequences)
• If no sequence is known or available, the
NCBI’s search engine can be screened at the
nucleotide or protein level by typing in the
keyword – the name of protein, the author
or the proper accession number
Results of Data retrieval
• The level of reported similarity indicates potential
biological relationships across species and
taxonomic divisions
• Identities between sequences are measured as Evalues between zero and one indicating chance of
a random hit
• A value of one indicates potential randomness
while values of zero or close to zero are less likely
to be random hits
Sequence Alignment
• Pair-wise comparison of sequences
• First step in assessing the property of a
newly sequenced gene
• Finding homologs in other organisms
• Identifying new sequences as novel
• BLAST 2 – Compare two sequences
• ClustalW – Multiple sequence alignment
Results of Sequence Alignment
• Several sequences can be submitted and
different output settings can be selected
• Identities from pair wise alignments are
shown
• Order of most identical to least identical
sequence pairs are also shown
• Phylogenetic trees (graphical description)
are also included
What Sequence Reveals
• The Biological function of a Gene
• Related sequences in database
• Structure prediction / comparison with Xray structure
• ORF (open reading frame) if function is
unknown
• Domain structure
What Sequence Reveals
•
•
•
•
•
•
•
Transmembrane segments
Signal sequence
Alternate nomenclature
Genetic information – regulatory sequences
Translation
2-D gels, pI (charge), molecular weight
Bibliography
Identification of Gene
• Software identifies ORFs (Open reading frames)
or URFs (unidentified reading frames)
• Searches for long streches of sequence between a
start and a stop codon
• The length of the ORF directly related to the size
or molecular weight of the coded protein
• The comparison of the similarity of two or more
sequences is a good indicator of biological
function of gene
Redundancy
• Scientists work independently – results in
repetitive naming of identical genes and
proteins
• Similar to having name listed as 3 entries in
a telephone book - first , middle and last
name
• Redundancy is useful - an unintentional
quality control
Human Genome Project
• The ultimate physical map of the human genome is the
complete DNA sequence the determination of all base pairs
on each chromosome. The completed map will provide
biologists with a Rosetta stone for studying human biology
and enable medical researchers to begin to unravel the
mechanisms of inherited diseases.
• A major focus of the Human Genome Project is the
development of automated sequencing technology that can
accurately sequence 100,000 or more bases per day at a
cost of less than $.50 per base. Specific goals include the
development of sequencing and detection schemes that are
faster and more sensitive, accurate, and economical.
Human Genome Project
• Second-generation (interim) sequencing
technologies will enable speed and accuracy to
increase by an order of magnitude (i.e., 10 times
greater) while lowering the cost per base. Some
important disease genes will be sequenced with
such technologies as
– (1) high-voltage capillary and ultra thin electrophoresis
to increase fragment separation rate and
– (2) use of resonance ionization spectroscopy to detect
stable isotope labels.
Human Genome Project
• Third-generation gel-less sequencing technologies,
which aim to increase efficiency by several orders
of magnitude, are expected to be used for
sequencing most of the human genome. These
developing technologies include
– (1) enhanced fluorescence detection of individual
labeled bases in flow cytometry,
– (2) direct reading of the base sequence on a DNA strand
with the use of scanning tunneling or atomic force
microscopies,
– (3) enhanced mass spectrometric analysis of DNA
sequence, and
– (4) sequencing by hybridization to short panels of
nucleotides of known sequence.