Transcript Poster Link
BioInformatics
Abstract
What is a Protein Sequence?
In order to help predict the way proteins will
act in an organism, biologists cross-examine sequences of
amino acids from many proteins. There are a total of 20
amino acids in existence and proteins often consist of 300
or more amino acids. A “multiple alignment” is performed
on a collection of sequences to maximize the areas where
the amino acids are similar across all sequences. Online
websites presently are available to accomplish the task.
Once the multiple alignment is complete, a
tedious process begins of searching for contiguous
subsequences of the aligned group of protein sequences
that may be useful in determining properties about the
proteins’ functions. Subsequences that are selected for
further analysis are called “primers.” The primer search
process is often done by hand and can take hours for
small sequence lengths.
This project entails a Java program that
automates the primer search process and a database
organizing results obtained after primers are generated.
The software allows the user to examine multiple primers
at once and to adjust primer lengths. Once the primers are
generated, lab tests are performed on the primers and the
results are entered into a database. The database can be
queried to find results that might be useful to a biologist.
A string of amino acids, each represented by a single letter
There are 20 different amino acids
Typical proteins are about 300 amino acids long
EXAMPLE:
…ILVKMUTANKVKMU…
Multiple Alignment
Example
Shaded areas show regions of exact match.
A dash is placed in the smaller protein
sequence to achieve the alignment.
Redundancies in each column are then
removed.
Degeneracy Example
Inspection Window
The codons are listed for each
corresponding amino acid to
determine how many different
ways each amino acid can be
produced from DNA.
The total degeneracy is the
product of each amino acid’s
value. The higher this number is,
the less likely we know where the
sequence originated from, and the
less useful it is in any
experiments.
This window alllows
the user to
manipulate one
particular primer
chosen from a
multiple alignment.
The control buttons located at the bottom allow the length
and position of the primer to be changed with degeneracy
updated automatically.
Biological Description of the Gene
Database of Primer Results
Name of Gene
Nucleotide Sequence for Gene
Information for the Experiment
By clicking on Oligos, you can
choose which Oligos occurred
in the reaction.
Amino Acid Sequence
Oligos Contained in the Gene
By clicking on Observations,
you can record results about
each reaction.
Reactions for the Experiment
Data Mining
We want to find Association Rules based on data collected about
primers to make predictions about which ones to use
Association Rules have the form LHSRHS
Interpretation: If every item in LHS occurs, then it is likely that all of
the items in RHS will also occur
Example:
LHS = protein sequence A contains primers 1, 2 & 3
RHS = protein sequence A contains primer 4 & 5
Support
Data Mining:
Support & Confidence
How often do LHS & RHS occur together?
Confidence
Whenever LHS occurs, how often does RHS occur
as well?
Scope
Data is small compared to online databanks
Looking to larger sources to increase the support
of any predictions made will help in the future