Bioinformatics - University of Hawaii

Download Report

Transcript Bioinformatics - University of Hawaii

Bioinformatics
ABE 2007
Kent Koster
Group 3
Why bioinformatics?

“Other techniques raise more questions
than they answer. Bioinformatics is what
answers the questions those techniques
generate.”
Outline
Bioinformatics Defined
 Evolution of Bioinformatics
 Bioinformatics History
 Common Uses of Bioinformatics
 Procedures and Tools of Bioinformatics
 Our Procedure
 Our Results
 Resources

Bioinformatics Defined



Bioinformatics is broad term covering the use of
computer algorithms to analyze biological data.
Differs from “computational biology” in that while
computational biology is the use of computer
technology to solve a single, hypothesis-based
question, bioinformatics is the omnibus use of
computerized statistical analysis to make
statistical or comparative inferences.
i.e. converting “data” to “information.”
The nebulous genesis of
bioinformatics




1977 – Φ-X174 Phage Genome sequenced
1990 – Paper published in the Journal of
Molecular Biology describes sequence
alignment search algorithm
1990s – Software used to find fragment overlap
for the Human Genome Project
1992 – NCBI takes over GenBank DNA
sequence database in response to the growing
number of gene patents
The nebulous genesis of
bioinformatics



1994 – “Entrez” Global Query Cross-Database
Search System allows users to search GenBank
database
1995 – Dr. Owen White writes software to help
find gene elements (promoters, start and stop
codons, etc.) in the sequenced Haemophilus
influenzae genome
1996 – NCBI-BLAST created to provide powerful
heuristic searches against the GenBank
database
Genomics to Proteomics through
Bioinformatics





Because proteins are ultimately the tool of all* gene
expression, proteomics is, in effect, the “product” science
made possible by bioinformatics
A proteome is the collection of all proteins expressed in a
cell at a given time
Every organism has 1 genome, but many proteomes
In addition to “high throughput” protein analysis,
proteomics is researched through cDNA analysis (RTPCR)
Proteomics represents a methodical addition of “large
scale biology” to traditional molecular biology, made
possible by bioinformatics
Common Uses of Bioinformatics

Homology and Comparative Modeling


Protein or gene homology is shared
nucleotide or amino acid sequences or
domains shared between different proteins
regardless of whether from same or different
organism
Gene or Protein Identification

Searching databases for nucleotide or amino
acid sequences that match sequences in
unknown samples
So, how do ya do it?
DNA Sequencing
 Sequence Formats
 Sequence Homology Software Tools
 Aligning Tools
 Annotated Information
 Protein Folding

DNA Sequencing

Sanger Method

New nucleotide chains of DNA being
replicated by DNA Polymerase are stopped
when di-deoxy nucleotides (added in the
reaction mixture in ~1/100 ratio) are
incorperated into the chain
DNA Sequencing
Fluorescent dyes are bound to the
ddNTPs, allowing the molecule to detected
when it is excited by a laser
 Terminated DNA chains are run on a gel,
and fragments are resolved by size
 By combining the fluorescence readings
from each size nucleotide chain, the DNA
sequence is computed

Example Sequence
Chromatograph
Sequence Analysis






First Things First – Sequence File Formats:
Most common for nucleotides: FASTA / Multi-FASTA
“>” followed by any unicode text, entire line read as sequence title
Carriage return followed by continuous 5’- 3’ nucleotide sequence or
protein sequence using 1-letter codes
Example:
>E. coli Globin-coupled chemotaxis sensory transducer (TM
domain)
ATGGACCTGATCACAAATGCGATTTAGAGACCTGATCACAAATG
CGATGACCTGATCACAAATGCGATGACCTGATCACAAATGCGAT
GTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGATC
TAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGATTAA
Sequence Homology Software

NCBI-BLAST





Run by the National Center for Biotechnology
Information
BLAST uses a heuristic algorithm based on the
Smith-Waterman algorithm
Algorithm searches database for a small string within
the query (default 11 for nucleotide searches), then
when it detects a match, searches for shared
nucleotides at each end of the seed to extend the
match
Gaps are taken into account, then the matches are
presented in order of statistical significance
http://www.ncbi.nlm.nih.gov/BLAST/
Different Types of BLAST

Nucleotide-nucleotide BLAST (BLASTN):



Protein-protein BLAST (BLASTP):


Basic nucleutide sequence searches
The BLAST that you used for your sequences
Similar technology used to search amino acid
sequences
Position-Specific Iterative BLAST (PSI-BLAST):

A more advance protein BLAST useful for analyzing
relationships between divergently evolved proteins.
Different Types of BLAST

BLASTX and BLASTN variants:


Use six-frame translation for proteins and
nucleotides, respectively, in the search
MegaBLAST:

Used for BLASTing several sequences at
once to cut down on processing load and
server reporting-time
Interpreting BLAST Results

Max/Total Score


Calculated from the number of matches and gaps.
Higher relative to your query length is better
E Value: E=Kmn(e-λS)



Translation: E Value gives you the number of entries
required in the database for a match to happen by
random chance. e.g. E=e-6 means that one match
would be expected for every 1,000,000 entries in the
database
Smaller E Values are better
Values larger than E=e-5 too likely to be due to
chance
Interpreting BLAST Results

Query Coverage


The percent of the query sequence matched
by the database entry
Max Ident

The percent identity, i.e. the percent that the
genes match up within the limits of the full
match (e.g. deletions or additions reduce this
value)
Sequence Aligning Software

Clustal (free)
ClustalX – Software
 ClustalW – Web

DNAStar ($$$)
 Functionality is similar, but difference is in
interface, tools, and speed of algorithms
 http://www.ebi.ac.uk/clustalw/

SMART
Simple – Modular – Architecture –
Research – Tool
 Run by EMBL (European Molecular
Biology Laboratory)
 While BLAST compares nucleotide
sequences and then informs you of any
domains that may have been annotated to
them, SMART compares by domains

PFAM






Protein domain database
Manually curated, trading volume for quality
Uses “hidden Markov models” for domain
pattern recognition
Run by Sanger Institute in the UK
Heuristic server-load analysis predicts when key
protein analysis report is due and crashes server
http://www.sanger.ac.uk/Software/Pfam/
Interpro
Database of protein domains and
functional sites
 Best source of annotation
 Other tools sometimes draw annotation
from Interpro
 Run by the European Bioinformatics
Institute
 http://www.ebi.ac.uk/interpro/

Protein Folding

Lowest energy state folding
Ab initio: tremendously resource heavy, can
only be done for tiny proteins
 Distributed computing is used for mid-sized
proteins

 Folding@Home
 Human
Proteome Folding Project
 Rosetta@Home
 Predictor@Home
Protein Folding

Software-assisted manual folding


Use knowledge of biochemistry to fold protein
into predicted structure, then software to find
lowest energy state
Commercial Programs:
Protein Shop
 Profold

Manual Motif Verification

Ramachandran Plot – ratio of Ψ to Φ
angles on N and C terminals of subunit
Our Procedure

Colonies were selected from nutrient plates




Each group selected two colonies to sequence
Colonies which survived ampicillin treatment were
possibly transformed by the vector, which contained
an ampicillin resistance gene
Presence of PDI insert was expected to disrupt ccdB
(lethal protein) and LacZα gene expression in vector
plasmid
LacZα expression resulted in some blue colonies, as
the colonies were able to cleave X-Gal substrate into
blue product
Initial Questions Guiding Colony
Selection






How did some blue colonies survive?
Did all blue colonies come from the PCR product?
Did the white colonies contain the PDI inserts?
Were some colonies able to survive without the
ampicillin resistance plasmid?
What was the actual sequence of the commercial
positive control insert?
Some samples were transformed with inserts collected
from PCR instead of gel electrophoresis. Could have
non-PDI sequences have ligated to the vector and been
inserted into bacteria?
Procedure
Samples were prepared with T3 and T7
(forward and backward) primers in solution
for sequencing
 Samples were sent to UH Manoa lab for
sequencing
 Chromatogram results were viewed with
Finch TV to determine quality

Procedure

Sequences were trimmed at 5’ and 3’
ends, then restriction enzyme sites on the
vector were attempted to be located with
Finch TV
Procedure





Sequences were exported in FASTA format
Procedure was repeated for the other strands
Pair-wise alignment was performed for both
strands of each sample with EBI’s tools
Consensus sequence from pair-wise alignment
was searched for in BLAST
Gene information was located from BLAST
annotation and TAIR website
Results

General Remarks



Because colonies were selected prior to the identity of
the positive control insert being questioned, no control
colonies were sequenced
All sequenced white colonies definitively had PDI
gene insert, save for one interesting exception
Some blue colonies showed multiple nucleotide
chromatogram readings, suggesting either sample
contamination or separately transformed E. coli
growing as one colony
Group 3 Results
Sequenced 1 blue and 1 white colony from
same plate
 Colonies were transformed with PCR
product, not gel-recovered DNA
 White colonies had PDI insert
 Blue colonies had 154Bp partial insert,
disrupting ccdB gene, but remaining inframe and allowing for a partially function
LacZ alpha gene to be expressed

Group 3 White Colony

T7 strand definitively showed the presence
of a PDI insert
Group 3 White Colony

T3 and T7 strand consensus sequence
also showed PDI gene presense
Group 3 Blue Colony

Blue colony T3 showed multiple signals
Group 3 Blue Colony
However, T7 strand was salvageable
 A 154 nucleotide sequence was found
between the restriction sites

Group 1 Results
White Colony from PCR product showed
PDI gene in both T3 and T7 strands
 White colony from gel purification:

T7 strand sequenced as multiple signals
 T3 strand sequenced excellently

Group 1 Gel White Colony

T3 sequence showed only nucleotides
1540-2320 of the vector
Group 2 Results

White Colony from gel purification


White colonies sequenced with PDI gene
Blue w/ White Ring Colony from PCR

Both T3 and T7 strand sequencing showed
consistent multiple signals
Group 4 Results
1 white colony from PCR and 1 white
colony from gel purification were
sequenced
 Both showed PDI gene

Final Remarks




All white colonies had the PDI gene, except one with a modified vector
All blue colonies were transformed with the direct PCR product (not gel
purified)
Group 3 showed that a small (154Bp) insert that stays in-frame with the
LacZ gene can knock-out the ccdB, while still allowing the expression of an
at least partially functioning LacZ gene
Some blue colonies with white rings could be 2 separate lines living
together


Bacteria transformed with ampicillin resistance gene could deplete area of
ampicillin, allowing bacteria without the gene to crowd the white bacteria out of
the area of depleted ampicillin
How could bacteria without the insert survive both ccdB expression and ampicillin
selection in broth?



ccdB gene could be lost due to mutation
Bactaria could have cut plasmid, deleting the ccdB, but retaining LacZ possibly and
ampicillin resistance genes
No group sequenced the positive control insert – sequence still a mystery!
Resources










http://www.bioinformatics.org
http://http://syntheticbiology.org/Tools.html
NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/
SMART: http://smart.embl-heidelberg.de/
PFAM: http://www.sanger.ac.uk/Software/Pfam/
Interpro: http://www.ebi.ac.uk/interpro/
Canadian Bioinformatics Helpdesk Newsletter (Ramachandran
Plot):
http://gchelpdesk.ualberta.ca/news/22sep05/cbhd_news_22sep05.p
hp
Finch TV: http://www.geospiza.com/finchtv/
EBI Pair-wise alignment:
http://www.ebi.ac.uk/emboss/align/index.html
TAIR: http://www.arabidopsis.org