Mining Single Nucleotide Polymorphisms from public sequence

Download Report

Transcript Mining Single Nucleotide Polymorphisms from public sequence

Mining Single Nucleotide
Polymorphisms from public
sequence databases.
Gary Barker
IACR Long Ashton
What are Single Nucleotide
Polymorphisms (SNPs)?
ATGGTAAGCCTGAGCTGACTTAGCGT-AT
ATGGTAAACCTGAGTTGACTTAGCGTCAT



snp
snp
indel
SNPs result from replication errors and DNA damage
Why are these polymorphisms useful?
It’s sometimes possible to correlate a
SNP with a particular trait.
This is known as association genetics.
Disease resistant population
Disease susceptible population
Genotype all individuals for thousands of SNPs
ATGATTATAG
geneX
ATGTTTATAG
Resistant people all have an ‘A’ at position 4 in geneX,
while susceptible people have a ‘T’
To use SNPs, you first have to find them.
Poorly studied organisms:
Sequence many ‘loci’ (different places in the genome)
for many individuals.
Many well studied organisms:
Required data is already present in public sequence databases,
it just needs to be processed.
Number of ESTs* in EMBL database
Search string
(common)
Homo sapiens
Hordeum vulgare
Triticum aestivum
Zea mays
Oryza sativa
(man)
(barley)
(wheat)
(maize)
(rice)
ESTs in EMBL
(07-11-02)
4,798,137
308,301
264,910
181,164
112,240
*ESTs are single pass (often partial) gene sequences
Mining SNPs from EST sequences in the database
AutoSNP (PERL script) can find likely SNPs in data sets
downloaded from public databases.
1) Marks up only those polymorphisms where each allele is
supported by at least two independent sequences.
This filters out most sequencing errors.
2) Adds further confidence scores based on co-segregation
3) Results written to HTML reports.
Accessing AutoSNP results
1) Search by accession number:
Accessing AutoSNP results
2) Search with a query sequence
Current AutoSNP approach:
Cluster sequences (d2cluster)
Query with
Accession
Align and find SNPs (cap3)
Sequence query
Accession # / SNP report #
Blast client
MySQL database
gi|11117503 | snip_1.htm
gi|12217138 | snip_2.htm
Matching
Accessions
Links to existing SNP reports
Desirable:
Client supplied query Sequence (ATAGCGTACG……)
Data and
processing power
(large)
Blast search (data direct from EBI?)
processing power
(medium)
Build contigs of results
processing power
(small)
Detect eSNPs
< 10 seconds
Client gets SNP report(s) (html)
for all sequences matching query
Conclusions
SNPs (single nucleotide polymorphisms) are abundant and
useful genetic markers.
Software exists to mine them from public data sets, but this
doesn’t work in real time.
GRID technology could help to deliver up-to-date alignments
to users for any query sequence with putative SNPs marked up.
Related useful features would include bootstrapped trees
for each alignment, generated on the fly.