Legume Information Network

Download Report

Transcript Legume Information Network

Legume Information Network: A Component of the
Virtual Plant Information Network
National Center for Genome Resources
University of Minnesota – Center for Computational Genomics
and Bioinformatics
United States Department of Agriculture – Agricultural
Research Service
Gregory D. May
Atlanta
October 2007
Current State of
Bioinformatics Resources
>
>
Hundreds of Project
web-sites and DBs;
Project DBs are
distributed,
autonomous and
ephemeral;
CODs
MODs
Automatic
Annotation
Shops
TIGR
Gene Indices
Static
Repositories
LIMs
>
Inconsistent user
interfaces
Stein et al, (2006) Plant Biology Databases: A Needs Assessment by the NSF-USDA Working Group on Long-Lived Databases.
The promises of 30+ high throughput
‘omics’ technologies
>
Improved crops
nutrition, novel traits, resistance,
yield, sustainable
>
>
Improved animal production
>
Improved human health
>
>
>
biomarker diagnostics
personalized medicines and therapies
Improved environment
>
>
>
bioremediation
carbon sequestration
energy independence
A
b
u
n
d
a
n
c
e
T
I
C
:0
3
0
4
0
2
0
4
.
D
5
0
0
0
0
A
b
u
n
d
a
n
c
e4
4
0
0
0
0
0
A
b
u
n
d
a
n
c
e
4
5
0
0
0
0
3
5
0
0
0
0
4
0
0
0
0
0
3
0
0
0
0
0
A
b
u
n
d
a
n
c
e
4
5
0
0
0
0
3
5
0
0
0
0
2
5
0
0
0
0
4
0
0
0
0
0
3
0
0
0
0
0
A
b
u
n
d
a
n
c
e
4
5
0
0
0
0
2
0
0
0
0
0
T
I
C
:0
3
0
4
0
2
0
4
.
D
T
I
C
:0
3
0
4
0
2
0
4
.
D
T
I
C
:0
3
0
4
0
2
0
4
.
D
3
5
0
0
0
0
2
5
0
0
0
0
4
0
0
0
0
0
1
5
0
0
0
0T
I
C
:0
3
0
4
0
2
0
4
.
D
4
5
0
0
0
0
3
0
0
0
0
0
2
0
0
0
0
0
3
5
0
0
0
0
1
0
0
0
0
0
4
0
0
0
0
0
2
5
0
0
0
0
1
5
0
0
0
0
3
0
0
0
0
0
5
0
0
0
0
3
5
0
0
0
0
2
0
0
0
0
0
1
0
0
0
0
0
2
5
0
0
0
0
0
3
0
0
0
0
0
2
0
.
0
0 2
5
.
0
0 3
0
.
0
0 3
5
.
0
0 4
0
.
0
0 4
5
.
0
0 5
0
.
0
0 5
5
.
0
0 6
0
.
0
0
1
5
0
0
0
0
5
0
0
0
0
2
0
0
0
0
0
T
im
e
>
2
5
0
0
0
0
1
0
0
0
0
0
0
1
5
0
0
0
0
2
0
.
0
0 2
5
.
0
0 3
0
.
0
0 3
5
.
0
0 4
0
.
0
0 4
5
.
0
0 5
0
.
0
0 5
5
.
0
0 6
0
.
0
0
2
0
0
0
0
0
5
0
0
0
0
T
im
e
>
1
0
0
0
0
0
1
5
0
0
0
0
0
5
0
0
0
0
1
0
0
0
0
0
T
im
e
>
2
0
.
0
0 2
5
.
0
0 3
0
.
0
0 3
5
.
0
0 4
0
.
0
0 4
5
.
0
0 5
0
.
0
0 5
5
.
0
0 6
0
.
0
0
0
5
0
0
0
0
2
0
.
0
0 2
5
.
0
0 3
0
.
0
0 3
5
.
0
0 4
0
.
0
0 4
5
.
0
0 5
0
.
0
0 5
5
.
0
0 6
0
.
0
0
T
im
e
>
0
2
0
.
0
0 2
5
.
0
0 3
0
.
0
0 3
5
.
0
0 4
0
.
0
0 4
5
.
0
0 5
0
.
0
0 5
5
.
0
0 6
0
.
0
0
T
im
e
>
The need

The legume biologist still must navigate
multiple information resources for many
research questions

“Develop a virtual, easy-to-navigate “one-stop”
legume information network. By “one-stop” we
refer by analogy to Google and how it can be seen
as a single, yet non-exclusive, information
resource.”
Gepts et al, Report from the CATG meeting.
Plant Physiology (2005) 137:1228.
Virtual Plant Information Network
>
Establish an architecture based on
semantic web technologies to
support interoperable (database)
network
>
Standardize data formats and
user-interfaces to support
machine readable representation
of genomes, genetic maps,
polymorphisms, QTL, expression,
proteins, metabolites and
phenotypes.
>
Develop breeder’s toolboxes with
visual interfaces similar to that
depicted in GEYSIR
Goals



Design a solution for integrating disparate
data sources
Develop a prototype, Legume Information
Network, demonstrating the capabilities of
semantic web technologies
Legume community take a leadership role
in data and tool integration using
semantic-MOBY
The Requirements
Devise a way in which resources can be described,
discovered, and invoked on the web using:
• a common syntax – so machines can parse the data and
services of each other
• a public semantic – so machines can make determinations
on suitability-for-purpose
• a discovery service – so machines can find data and
services across the web based on the semantics of the
resources being offered and the needs of the task at hand
The Approach: Keep it simple
Client
Discovery
Server
Provider
Clients, Providers, and
even Discovery Servers
all read and contribute
to the same set of
statements.
All actors understand a single, mutable graph which embeds an
explicit logic necessary and sufficient to describe, query, discover,
invoke, and satisfy resources and requests.
Services
Data Provider Services
Service Description
GO Annotated Transcript Sequences
Medicago IMGAG Annotations
Precomputed BlastX against NCBI's NR
Blocks precomputed analysis retrieval
GenScan precomputed gene predictions
Sequence Text Retrieval
GO Annotations Retrieval
InterPro precomputed analysis retrieval
Analysis Services
Service Description
Clustalw Multiple Sequence Alignment
BlastN LIS Transcript Contigs
Blast sequences against Kegg Genes
Blast sequences against TIGR TOG Sequence
BlastN Legume BACs
BlastN Lotus finished BACs
Provider
LIS
CCGB
LIS
LIS
LIS
LIS
Visualization Services
LIS
Service Description
LIS
Comparative Map and Trait Viewer
ISYS TableViewer
Alignment visualization using PFAAT
Provider
CCGB
LIS
CCGB
CCGB
LIS
LIS
Provider
LIS
LIS
CCGB
LIN partners
Resources
A running Discovery Server: www.semanticMoby.org
The project web site: vpin.ncgr.org
Discussion forum: vpin.ncgr.org/mvnforum/forum
Collection of ontologies: ontologies.ncgr.org
Protocol documentation: ontologies.ncgr.org/OWLDocs/moby
Publications and other docs: vpin.ncgr.org/links.shtml
Developers’ resources: www.semanticmoby.org/developer/index.jsp
Provider Developer Kit: vpin.ncgr.org/provider.shtml
Client Developer Kit: vpin.ncgr.org/client.shtml
Generation of DNA Sequence Data
Cost/1000 bp
1990 ~ $10.00
2000 ~ $3.00
2005 ~ $1.00
2006 ~ $0.10
2007 ~ $0.03
Sequencing Platform Comparison
454 FLX (Roche)
Solexa (Illumina)
AB SOLiD
250bp (400)
36bp (50) or 2x36bp
25bp (30)
420K
~40M
40M
Data output
~100Mb
~1Gb
~2Gb
Reagent cost per run
~$8,000
~$3,200
~$6,000
Reagent cost per 1Gb
~$80,000
~$3,200
~$3,000
0.5%
0.2%
0.1%
10 days
2-3 days
7-10 days
Easy of use
Most difficult
Least difficult
454-like difficulty
Base calling
Flow Space
Nucleotide Space
Color Space
Whole transcriptome shotgun
Yes
Yes
Yes
Whole genome shotgun
Yes
Yes
Yes
Paired-read
Yes
Yes
No
Small RNA
No
Yes
No
ChIP-Sequencing
No
Yes
No
Expression
Yes
Yes (1st application)
Yes
Read length
Number of reads/run
Error
Run time to 1Gb
Applications
Alpheus: Cyberinfrastructure for medical
and agricultural resequencing

Nucleotide variant and splice isoform
detection





100s Gb-scale resequencing projects
Short reads (454, Solexa, SOLiD plus Sanger)
Paired and unpaired
Alignments to genomic and transcriptomic
references
Greek mythology: cleansed the Augean
stables and restored life to the soil
Pileup Visualization
Slidable window
Overview of transcript
Coding domain
| nsSNP
| SNP
| in/del
454 reads
National Center for Genome Resources
indicated by a horizontal line, oriented from 5’ to 3’, from left to right), along with its associated CDS (in green). 394 454
reads from sample 1437 are displayed as arrows aligned against the transcript whose direction reflects their orientation
with respect to the tran script. Variants found in individual reads are displayed by hash marks at their relative position
on the read. Variants are characterized as substitutions, deletions or insertions individual sequence reads aligned. The
left panel displays all putative vari ants, while the right displays variants meeting the empiric rule -set (Fig.8). A single,
coding-domain, synonymous SNP (C398T) is indicated that was present in > 4 reads and in more than 30% of reads
aligned at that position. The SNP was present in 7 of 1 3 reads aligned at that position in sample 1437, 9 of 18 reads in
sample 1438 and 20 of 21 reads in 1439. C398T is a validated SNP (dbSNP number rs7121). Of note, homozygous
398T has shown association with deficit schizophrenia 73.
Dynamic Filtering
National Center for Genome Resources
13th June 2007
CONFIDENTIAL
National Center for Genome Resources
13th June 2007
CONFIDENTIAL
National Center for Genome Resources
13th June 2007
CONFIDENTIAL
National Center for Genome Resources
13th June 2007
CONFIDENTIAL
National Center for Genome Resources
Summary of Medicago ecotype F83005.5
Solexa resequencing

With 1x coverage of a 540Mb genome


One SNP ~600bp – no filtering
~45,000 High-stringent SNPs
Chromosome
Length
Aligned
Reads
UniqAl
Reads
MtChr0
16948850
737714
112,883
MtChr1
31270187
731194
201785
MtChr2
27869060
632480
230301
MtChr3
38177656
971655
329274
MtChr4
39361788
605760
244861
MtChr5
37289045
815069
355206
MtChr6
19993879
289819
110912
MtChr7
32528341
716964
286807
MtChr8
35250466
420086
226924
13th June 2007
CONFIDENTIAL
National Center for Genome Resources
Application of Next-Generation Sequencing Technologies
for Variant Detection in Crop Plants and Pathogens

Whole transcriptome shotgun re-sequencing


Whole genome shotgun re-sequencing



Expressed portions (or gene space) of the genome across
populations in the absence of a reference genome
Sequence across populations with available reference genomes
WGS skimming of transformation events
Target genome re-sequencing across
populations



Area under the QTL
Pooled long-PCR products to walk between markers
Restriction enzyme-anchored
GEYSIR
geysir.ncgr.org
(Genomic Explorer y Survey of Immune Response)
Clickable
LOD scores
moves
selection
windows
Gene & Nucleotide Gene
View
Neighborhood
Chromosome
Map
CTRL-left
mouse
click takes
you to
Gene detail
page
SNP
markers
Nucleotide
slider window
View
Map region selection windows (grab & slide)
Zoom & pan
buttons
View Selected Studies (across all chromosomes)
Sample study 1
Marker on
linkage
map (cM)
Marker on
physical
map (Mb)
Sample study 1
Sample study 2
Candidate genes in
blue
Exons in
green
Nucleotide
slider
window
Marker titles
visible in this
1.5 Mb region
Slide-able
feature
neighborhood
Clickwindow
on
chromosome
22
Clickable
SNP
bubbles
take you to
dbSNP
Acknowledgements

Funding
•LIS/LIN: USDA-ARS
SCA 3625-21000-038-01
•GEYSIR: NIH-NIAID
HHSN266200400064C

•VPIN: NSF-BDI 0516487
•LIS Steering Committee:
•Mark Burow
•Doug Cook
•Perry Cregan
•Rebecca Dickstein
•David Grant
•Randy Shoemaker
•Michael Udvardi
•Nevin Young

NCGR LIS
 Greg May
 Kamal Gajendran
 Andrew Farmer
 Michael Gonzales
 Selene Virk
 Bill Beavis
USDA-ARS LIS
 Randy Shoemaker
 David Grant
 Rich Wilson
NCGR GEYSIR
 Susan Baxter
 Faye Schilkey
 Neil Miller
 Dan Weems
 Lar Mader





USDA-ARS LIN
 Randy Shoemaker
 Michelle Graham
CCGB/U. Minn LIN
 Ernest Retzel
 Jim Johnson
 Michael Heuer
 John Crow
NCGR VPIN/LIN
 Damian Gessler
 Gary Schiltz
 Bill Beavis
 Andrew Farmer
S. Knapp
N. Young