Tools for BioInformatics - Computer Science
Download
Report
Transcript Tools for BioInformatics - Computer Science
Tools for BioInformatics
Eileen Kraemer
Computer Science Dept.
The University of Georgia
Types of Tools
Lab samples
Production Sequencing Software
Sequence data
Databases, Database Search Tools
Production Sequencing
Software
used throughout the sequencing
procedure from preparation of the DNA
through to the finishing of clones.
Example: Sanger Centre,
Shotgun Sequencing of typical human
clone
Data collection
Transfer to UNIX
Gel image processing
Sequence pre-processing
DNA Fragment Assembly
Editing
Finishing Services
Quality Control and Assesment
Databases
Swiss-Prot
EMBL
Entrez
GDB
GenBank
GSDB
PDB
& more -- see links at:
http://www.public.iastate.edu/~pedro/rt_1.html
Species-specific
Databases
See: http://genetics.about.com for both:
Non-human and human genome projects
Examples:
PomBase is a compilation of data relating to
the organism Schizosaccharomyces pombe
Wormpep predicted proteins from the C.
elegans genome sequencing project.
Annotation Tools
Annotation of sequences with info such as homologies
to known genes, possible gene locations, gene signals
such as promoters, etc.
Example: Genotator (Nomi Harris) -- developing a
workbench for automatic sequence annotation and
annotation viewing and editing. The goal is to run a
series of sequence analysis tools and display the results
in such a way that the various predictions can be
compared, and researcher makes decision of what to
include.
Database Software
ACEDB is an acronym for "A
Caenorhabditis elegans DataBase". It can
refer to a database and data concerning
the nematode C. elegans, or to the
database software alone.
Other groups may adapt existing, or
create own. For example, David Hall’s
workflow project at UGA for Neurospora
Types of Tools
Sequence
Structure
Function
Gene Prediction
Caution: accuracy <= ~ 70%
Good review: Snyder and Stormo,
(chapter 11 of the book Nucleic Acid and Protein Sequence
Analysis: A Practical Approach, second edition, 1994. )
Gene Prediction
GRAIL(Xgrail, JavaGrail, etc.)
Geneid
Netgene
GenMark
Fexon, Hexon
GENSCAN
xpound
Genefinder (University of Washington)
GRAIL
Predicts coding regions
Uses a neural network which combines a series
of coding prediction algorithms.
recognizes coding potential within a fixed
size (100 base) window; evaluates coding
potential without looking for additional
features
later versions incorporate additional info
human and other species
GeneMark
Based on inhomogeneous Markov models
predicts coding and non-coding regions
based on statistical patterns in
dinucleotide frequences … more next
week from Mark B.
Sequence Alignment
Pairwise alignments
Multiple sequence alignments
Pairwise Alignments
SIM (Protein only) - k best non-intersecting alignments
(EXPASY)
ALIGN - optimal global alignment with no short-cuts
(EERIE)
LALIGN - calculates the N-best local alignments
(EERIE)
LFASTA - local similarity searches showing local
alignments (EERIE)
BLAST 2 - local alignment using BLAST (NCBI)
LAP2 - local DNA to protein alignment with LAP2 (MTU)
Multiple Sequence
Alignments
ClustalW 1.7 (DNA/Protein) - Global progressive (BCM)
CAP Sequence Assembly (DNA) - Contig Assembly
MAP (DNA/Protein) - Global progressive in linear space
PIMA 1.4 (Protein only) - Pattern-Induced (local) Multiple
Alignment (BCM)
MSA 2.1 (Protein only) - Near-optimal sum-of-pairs
global (WashU)
BLOCK MAKER (Protein only) - Finds conserved blocks
in seq sets (FHCRC)ClustalW 1.7 (DNA/Protein) Global progressive (BCM)
MEME 2.2 (DNA/Protein) - Multiple EM for Motif
Elicitation (SDSC)
Similarity Searching
BLAST -- (BLASTP, TBLASTN, etc.)
a nucleotide or protein sequence sent to the
BLAST server is compared against and a
summary of matches is returned to the user.
allows all combinations of DNA or protein
query sequences with searches against DNA
or protein databases:
BLAST variations
blastp compares an amino acid query sequence against a
protein sequence database.
blastn compares a nucleotide query sequence against a
nucleotide sequence database.
blastx compares the six-frame conceptual translation
products of a nucleotide query sequence (both strands) against
a protein sequence database.
tblastn compares a protein query sequence against a
nucleotide sequence database dynamically translated in all
six reading frames (both strands).
tblastx compares the six-frame translations of a nucleotide query
sequence against the six-frame translations of a nucleotide
sequence database.
Types of Tools
Sequence
Structure
Function
Protein Structure
Prediction
Ab initio -- based on energy minimization
fold recognition -- sequence -> secondary
structure, then align secondary structures
with corresponding secondary structures
in related proteins, etc.
statistical -- based on “hidden patterns”;
similar patterns -> similar structure
Protein Secondary
Structure Prediction
Coils - prediction of coiled coil regions
nnPredict - uses a 2 layer neural network
PSSP / SSP - segment-oriented prediction
PSSP / NNSSP - nearest-neighbor prediction
SAPS - statistical analysis of protein sequences
Paircoil - coiled coil regions of pairwise residue
correlations
Protein Hydrophilicity /Hydrophobicity
SOPM - self optimized prediction method
Types of Tools
Sequence
Structure
Function
Protein Function
Prediction
Pfam groups of similar function proteins aligned
and HMMs generated for each “cluster”
HMM generated for unknown function protein
and compared to HMMs of known proteins for
predicted function classification
Pfam components
PROTEIN HMM SEARCH - Analyze a protein query
sequence to find Pfam domain matches.
DNA HMM SEARCH - Analyze a DNA query sequence
to find Pfam domain matches. (Uses the GeneWise
server at the Sanger Centre.)
BROWSE PFAM - View Pfam annotation and
alignments.
TEXT SEARCH - Query Pfam by keywords.
BROWSE SWISSPFAM - View the domain organization
of any SWISSPROT/TrEMBL sequence according to
Pfam.
Types of Tools
Sequence
Sequence
Sequence
Sequence
Across organisms …
Phylogeny
Reconstruction
Phylogeny Reconstruction
Construct evolutionary trees based on
divergences that occur in related
sequences
parsimony, minimum distance, etc.
parsimony -- construct tree so that number
of mutation events is minimized
PHYLIP, PAUP, others, some interactive
Visualization Tools
Database viewers
Sequence viewers
Molecular viewers
Physical Mapping Software
used to physically locate genetic markers.
FPC Software for FingerPrinting Contigs.
Image 3.x Software for processing fingerprint gel
images.
RHServer This web interface positions one or more
markers on the 1998 International Gene Map (GB4).
SAM System for Assembling Markers. SAM takes as input
a set of clones and their associated markers, and
outputs a partially ordered marker map.
Z-RHMAPPER Extensions to the RHMAPPER (Whitehead)
Radiation Hybrid Mapping Package.
Good Resources
Pedro’s BioMolecular Research
http://www.public.iastate.edu/~pedro/rt_1.html
BCM pages
www.hgsc.bcm.tmc.edu/SearchLauncher/index.html
Sanger Center
www.sanger.ac.uk/Software/Sequencing/overview.shtml
Mining Co. Web Site
genetics/miningco.com
& many others