Managing Data Modeling

Download Report

Transcript Managing Data Modeling

Managing Data Modeling
GO Workshop
3-6 August 2010
Managing Data
Functional modeling strategy
 Converting between Database IDs





Ensembl Biomart
UniProt
DAVID
AgBase ArrayIDer
Arrays
 examples to work on

Types of data sets and modeling





Commercial array data – more likely to have ID
mapping to support functional modeling.
Custom/USDA array data – may need to do your
own ID mapping: see examples on workshop
page.
Proteomics data
RNA-Seq data sets – computational pipelines to
assign GO (GOanna is limited; contact AgBase).
Real-time data or quantitative proteomics data –
hypothesis testing.
Overview of Functional Modeling Strategy
Microarray Ids
ArrayIDer
Protein/Gene
identifiers
GOModeler
hypothesis testing
Pathways and
network analysis
Ingenuity Pathways Analysis (IPA)
Pathway Studio
Cytoscape
DAVID
GO Enrichment
analysis
GORetriever
Genes/Proteins with GO annotations
no GO annotations
GOanna
Ingenuity Pathways Analysis (IPA)
Pathway Studio
Cytoscape
DAVID
EasyGO/AgriGO
Onto-Express
Onto-Express-to-go (OE2GO)
GOSlimViewer
summarizes
GO function
Yellow boxes represent AgBase tools
Green/Purple boxes are non-AgBase resources
Functional Modeling Considerations

Should I add my own GO?



Should I do GO analysis and pathway analysis and
network analysis?



use GOSlimViewer to see how much GO is available for your
species
use GORetriever to see how much GO is available for your
dataset
different functional modeling methods show different aspects
about your data (complementary)
is this type of data available for your species (or a close
ortholog)?
What tools should I use?



which tools have data for your species of interest?
what type of accessions are accepted?
availability (commercial and freely available)





structurally and functionally re-annotated a
microarray
quantified the impact of this re-annotation based
on GO annotations & pathways represented on
the array
tested using a previously published experiment
that used this microarray
re-annotation allows more comprehensive GO
based modeling and improves pathway coverage
re-annotation resulted in a different model from
previously published research findings
Converting accessions




Depending on your data set & the tools you use,
you are likely to need to convert between
database accessions to do your functional
modeling.
UniProt database – ID mapping tab
Ensembl BioMart
Online analysis tools:




DAVID
g:profiler
GORetriever
ArrayIDer – converts EST accessions for some
species (by request)
ID Mapping

Commercial arrays


EST arrays

Commercial ID
mapping eg. NetAffy
Ensembl BioMart
Online tools
(g:convert, DAVID)
ArrayIDer

Custom arrays


Proteomics

UniProt ID Conversion

RNA-Seq data

Working on your own data:

New to GO



Your own data set




GO browser tutorials to familiarize yourself with
the GO
learn what GO is available for your species
functional grouping to get overview (eg.
GOSlimViewer
GO enrichment analysis (tools available for your
species)
Pathway analysis
Example data sets available – use as
worked examples
Working on your own data:

New to GO



Your
own
data set
only
certain
database




GO browser tutorials to familiarize yourself with
Most
the
GO of these tools (including
learn
what GOAnalysis)
is available for
your species
Pathways
accept
functional
grouping to get overview (eg.
accessions
GOSlimViewer
 needanalysis
to convert
GO enrichment
(tools available for your
species)
accessions between databases
Pathway analysis
Example data sets available – use as
worked examples
Example: ID conversion




Ensembl Plant Biomart tool
currently limited species, but Ensembl is adding
more plants
BioMart allows sophisticated querying of genomic
data
DAVID ID conversion tool


UniProt ID conversion


allows users to convert IDs and do GO
enrichment analysis
highly annotated data
ArrayIDer

links ESTs to public database IDs
http://plants.ensembl.org/index.html
NOTE: Ensembl is adding new plant species…
1. Ensembl BioMart
Clicking on these
headings allows you to
set up searches.
Selecting FILTERS gives you different filtering options:
Expand GENE and check “ID list limit”
to select a defined list of accessions.
Enter your list of
accessions.
Selecting ATTRIBUTES allows you to choose what
information is reported:
Check accessions
from external
databases (UniProt &
RefSeq).
Clicking on RESULTS will show you the output information.
 Output can be displayed online and/or downloaded (text,
Excel).
 Selecting FILTERS or ATTRIBUTES will allow you to go back
and make changes.
 Limited to species represented in Ensembl

2. Online analysis tools
Database for Annotation, Visualization and Integrated Discovery
(DAVID)
http://david.abcc.ncifcrf.gov/conversion.jsp
This tool works for a
wide range of species.
Paste in your accession list
(You can also upload a file of accessions.)
Select accession type.
NOTE: If you choose “Note
Sure” the tool will try to decide
what type of accession you
have.
Select gene list.
Submit list.
Select the type of accession
you want to convert TO.
Any ambiguous IDs are
listed for you to decide.
3. UniProt ID Mapping
Paste accession list (>1000 may cause errors).
COMMENT: Note the difference between UniProt
Accessions and UniProt IDs.
UniProt accessions are a short string a letters and
numerals 6-8 characters long. UniProt IDs have a
suffix related to the species name.
Eg: Cassava Hydroxynitrilase
Accession: P52705
ID: HNL_MANES
Select the accession type you have:
and the accession type you want to convert to:
Click on MAP
The mapping link will display
a tab separated file that can
be displayed in Excel:
4. AgBase: ArrayIDer
Maps ESTs to gene/protein
accessions.
Contact AgBase to
request additional
species.
Upload a list of dbEST
accessions or EST names.
An email will be sent with a
link to the results. Results
are formatted as an Excel
file.
For additional help with
database accessions
please contact AgBase.
Working on your own data:
NOTE:
 Always keep note of what tool you used to
do the accession ID mapping/conversion
and its version/update/date.
 Keep a copy of your original IDs and what
they mapped to so that you can refer back
to this during your modeling.
Tutorial 1: ID conversion
The AgriGO GO enrichment analysis tool
accepts the following inputs for rice:
 GenBank ID: AAP50233.1
 DDBJ ID: BAB11514.1
 EMBL ID: CAA18188.1
 UniProt ID: Q9LYA9
 RefSeq Peptide ID: NP_564434
We will convert a list of Rice Affy IDs to
these IDs for use in the AgriGO tool.
Arrays: ID Mapping
“annotation” file that shows which database
accessions the probes were based on
 array annotation files may include multiple
database IDs
 Commercial arrays – may be updated
regularly
 Custom/Research arrays – not updated as
often
 Always check when the last ID mapping was
updated, as this data changes continually

Array annotation available:
FHCRC chicken 13K GPL2863
Agilent-015068 Chicken Gene Expression Microarray 4x44k GPL8764
Avian Innate Immunity Microarray (AIIM) GPL1461
Affymetrix Chicken Genome Array GPL3213*
UIUC Bos taurus 13.2K 70-mer oligoarray GPL2853
Affymetrix Bovine Genome Array GPL2112
Agilent-015354 Bovine Oligo Microarray (4x44K)
Equine Whole Genome Oligonucleotide (EWGO) array
Array annotation in progress:
ARK-Genomics G. gallus 20K v1.0 GPL5480
FHCRC Chicken 13K v2.0 GPL1836
Chicken cDNA DDMET 1700 array version 1.0 GPL3265
Tutorial 1: ID conversion
Work through tutorial 1 on the workshop
website.
Alternatively – work on your own data set
during this time, using the tutorial as a
guide.