Transcript here - ADUG

Sequence Search
The Problem as defined in
Windows 3.1 days
• Search for a sequence in a database several
megabytes in size, on a machine with 640 KB
memory machine as quickly as possible and
return all matching sequences
• If the sequence does not exist then add it to the
data file
• Each sequence will be given a unique identifier
• A sequence may be a subset of another
sequence
Givens
• Unlimited disk space
• Sequences made up of amino acids from
a growing set
• Each amino acid in the database given an
entry number
• Sequences are made up of at least 4
amino acids and maybe be of any length
upwards
Limitations
• Max network speed 2Mb/sec, Lantastic
• No SQL databases, only Paradox
available
Current Situation
• Amino acid table consists of approximately
700 entries
• Over 38000 unique sequences
• Sequences occupy over 11MB of Paradox
table
• 2 auxiliary tables 17MB in total
• Negative result returned almost instantly
Implementation
• Each amino acid is represented by a letter or its
entry number in the AA table. Eg ABCFTR
ABS(123)DFR
• Sequence as entered is converted to a hex-triple
representation. Eg 00100200301F
• Hex-triple chosen as it only requires 3
characters to represent up to 4095 distinct
amino acids. Hence making for shorter
sequence representations
Do we need to update the system?
• Yes, we want to be rid of Paradox