Evidence - iPlant Pods

Download Report

Transcript Evidence - iPlant Pods

How can we find genes?
Search for them
Look them up
How do I get from this…
>mouse_ear_cress_1080
GAAATAATCAATGGAATATGTAGAGGTCTCCTGTACCTTCACAGAGATTCTAGGCTGAGAGCAGTGCATATAGATATCTTT
CGTACTCATCTGCTTTTTCTGGTCTCCATCACAAAAGCCAACTAGGTAATCATATCAATCTCTCTTTACCGTTTACTCGAC
CTTTTCCAATCAGGTGCT TCTGGTGTGTCTACTACTATCAGTTTTAGGTCTTTGTATACCTGATCTTATCTGCTACTG
AGGCTTGTAAAAGTGATTAAAACTGTGACATTTACTCTAAGAGAAGTAACCTGTTTGATGCATTTCCCTAATATACCGGTG
TGGAAAAGTGTAGGTATCTGTACTCAGCTGAAATGGTGGACGATTTTGAAGAAGATGAACTCTCATTGACTGAAAGCGGGT
TGAAGAGTGAAGATGGCGTTATTATCGAGATGAATGTCTCCTGGATGCTTTTATTATCATGTTTGGGAATTTACCAAGGGA
GAGGTATCAGAATCTATCTTAGAAGGTTACATTTAGCTCAAGCTTGCATCAACATCTTTACTTAGAGCTCTACGGGTTTTA
GTGTGTTTGAAGTTTCTTAACTCCTAGTATAATTAGAATCTTCTGCAGCAGACTTTAGAGTTTTGGGATGTAGAGCTAACC
AGAGTCGGTTTGTTTAAACTAGAATCTTTTTATGTAGCAGACTTGTTCAGTACCTGAATACCAGTTTTAAATTACCGTCAG
ATGTTGATCTTGTTGGTAATAATGGAGAAACGGAAGAATAATTAGACGAAACAAACTCTTTAAGAACGTATCTTTCAGTTT
TCCATCACAAATTTTCTTACAAGCTACAAAAATCGAACTATATATAACTGAACCGAATTTAAACCGGAGGGAGGGTTTGAC
TTTGGTCAATCACATTTCCAATGATACCGTCGTTTGGTTTGGGGAAGCCTCGTCGTACAAATACGACGTCGTTTAAGGAAA
GCCCTCCTTAACCCCAGTTATAAGCTCAAAGTTGTACTTGACCTTTTTAAAGAAGCACGAAACGAAAAACCCTAAAATTCC
CAAGCAGAGAAAGAGAGACAGAGCAAGTACAGATTTCAACTAGCTCAAGATGATCATCCCTGTTCGTTGCTTTACTTGTGG
AAAGGTTGATATTTTCCCCTTCGCTTTGGTCTTATTTAGGGTTTTACTCCGTCTTTATAGGGTTTTAGTTACTCCAAATTT
GGCTAAGAAGAGATCTTTACTCTCTGTATTTGACACGAATGTTTTTAATCGGTTGGATACATGTTGGGTCGATTAGAGAAA
TAAAGTATTGAGCTTTACTAAGCTTTCACCTTGTGATTGGTTTAGGTGATTGGAAACAAATGGGATCAGTATCTTGATCTT
CTCCAGCTCGACTACACTGAAGGGTAAGCTTACAATGATTCTCACTTCTTGCTGCTCTAATCATCATACTTTGTGTCAAAA
AGAGAGTAATTGCTTTGCGTTTTAGAGAAATTAGCCCAGATTTCGTATTGGGTCTGTGAAGTTTCATATTAGCTAACACAC
TTCTCTAATTGATAACAGAAGCTATAAAATAGATTTGCTGATGAAGGAGTTAGCTTTTTATAATCTTCTGTGTTTGTGTTT
TACTGTCTGTGTCATTGGAAGAGACTATGTCCTGCCTATATAATCTCTATGTGCCTATCTAGATTTTCTATACAATTGATA
TTTGATAGAAGTAGAAAGTAAGACTTAAGGTCTTTTGATTAGACTTGTGCCCATCTACATGATTCTTATTGGACTAATCAT
TCTTTGTGTGAAAATAGAATACTTTGTCTGAACATGAGAGAATGGTTCATAATACGTGTGAAGTATGGGATTAGTTCAACA
ATTTCGCTATTGGAGAAGCAAACCAAGGGTTAATCGTTTATAGGGTTAAGCTAATGCTCTGCTCTTTATATGTTATTGGAA
CAGACTATTGTTGTGCCTATCTTGTTTAGTTGTAGATTCTATCTCGACTGTTATAAGTATGACTGAAGGCTTGATGACTTA
TGATTCTCTTTACACCTGTAGAAGGATTTAAGCTTGGTGTCTAGATATTCAATCTGTGTTGGTTTTGTCTTTCTTTTGGCT
…to this?
Meaning?
Mathematical Tools (Code; statistics)
Comparative Tools (Database searches)
What do we know about genes?
• Expressed (Transcribed)
– Transcriptional start & termination sites (TXSS, TXTS)
– Transcription artefacts (cDNA & ESTs)
• Regulated
– Promoters (TATAAA)
– Transcription Factor Binding Sites
– CpG (Cytosin methylation)
• Meaningful (Translated)
–
–
–
–
3n basepairs
Codon usage
Translational start & stop/termination codons (TLSS, TLTS)
Translation artefacts (proteins)
• Spliced
– Splice sites (GT-AG)
• Derived (Homology: Paralogy/Orthology)
– Search for known genes, proteins (BLAST)
How might this knowledge help to find genes?
• Predict genes
– Look for potential starts and stops.
– Connect them into open reading frames (ORFs).
– Filter for “correct’ length & codon usage.
• Search databases
– Known genes: UniGene
– Known proteins: UniProt
• Use transcript evidence
– cDNA
– ESTs
– proteins
Operating computationally
• Go to beginning of sequence  start SCAN
• If ATG  register putative TLSS; then
– Move in 3-steps & count steps (=COUNTS)
– If 3-step = (TAA or TAG or TGA),  register putative TLTS
– If register  evaluate COUNTS (= triplets)
If COUNTS < minimum  discard; then go behind ATG
above and start SCAN
If COUNTS > maximum  discard; then go behind ATG
above and start SCAN
If minimum < COUNTS < maximum  record as GENE
with TLSS, TLTS; then go behind ATG above and start
SCAN.
• Arrive at end of sequence  stop SCAN
Annotation workflow
Mathematical
evidence
Browse
results
Find gene
families
Get/Generate
sequence
Browse in
ccontext
Biological evidence
Construct
gene
models
Analyze
large data
sets
Annotation Cheat Sheet
• Open existing project or generate new (Red square)
A. DNA Subway
• Run RepeatMasker
• Generate evidence (Predictions, BLAST searches)
• Synthesize evidence into gene models (Apollo)
• Browse results locally and in context (Phytozome)
• Conduct functional analysis (link from Browser)
• Prospect for gene family (Yellow Line from Browser)
B. Apollo
• Select region that holds biological gene evidence
• Optimize work space and zoom to region (View tab)
• Expand all tiers (Tiers tab)
• Drag evidence item(s) onto workspace (mouse)
• Edit to match biol. evidence (right-click item for tools)
• Record what was done in Annotation Info Editor
• Assess necessity to build alternative model(s)
Predictors (mathematical evidence)
• Utilize predominantly mathematical methods (statistical).
• Search for patterns
– Some score starts, stops, splice sites (GenScan).
– Some score nucleotides (Augustus, FGenesH).
• Few incorporate EST data and/or known genes/proteins.
• Require optimization for each new species (training).
• Accuracy:
– False positives (scoring non-genes as genes):5% - 50%.
– False negatives (missed genes): 5%-40%.
– Weak or unable in determining first and last exons, and UTRs.
• Specific for gene models (spliced genes, non-spliced genes).
• Specialty predictors (tRNA Scan, RepeatMasker).
Search tools (biological evidence)
• Search sequence (molecules; tangible) databases:
– Known genes
– Known proteins
– cDNAs & ESTs
• Utilize alignment methods (BLAST, BLAT).
• Reliability:
– Good in determining gene locations and general gene structures.
– Weak in exactly determining exon/intron borders.
– Unlikely to correctly determine TXSS and TXTS.
– Should be used with cDNA/EST from same species as genome.
Sequence & course material repository
http://gfx.dnalc.org/files/evidence
Don’t open items, save them to your computer!!
•
•
•
•
•
•
•
Annotation (sequences & evidence)
Manuals (DNA, Subway, Apollo, JalView)
Presentations (.ppt files)
Prospecting (sequences)
Readings (Bioinformatics tools, splicing, etc.)
Worksheets (Word docs, handouts, etc.)
BCR-ABL (temporary; not course-related)