this powerpoint TUTORIAL - Processing of Prokaryotic Genome and

Download Report

Transcript this powerpoint TUTORIAL - Processing of Prokaryotic Genome and

GSEA-Pro Tutorial
Anne de Jong
University of Groningen
Introduction

The main principle of a Gene Set Enrichment Analysis (GSEA) is to discover which
biological function is or functions are overrepresented in a set of genes or proteins.

For such an analysis GSEA-Pro use the Genome2D database that describes the
relation between genes/proteins and functions (functional classification). As
example, all genes encoding enzymes for a specific metabolic pathway belong to the
same class

GSEA-Pro use multiple classification; GO, InterPro, KEGG, COG, PFAM, SMART and
Superfamily

For GSEA-Pro locus-tags are used as ID for genes as well as for proteins
Introduction

Overview of Functional Analysis of Genes Sets
Transcriptomics
Proteomics
Metagenomics
One or multiple sets of Genes
Unravel the biological function of a “Gene Set”
-omics
Input

STEP 1: Select Genome

The GSEA-Pro is integrated into the Genome2D web-server that contain classifications of all ‘complete’ genomes of the
NCBI.

Be sure to select the correct strain (check your locus-tags).

Preferably use the RefSeq locus-tags names, but also old-locus-tags are supported if a genome is selected from the
RefSeq database. The ‘old’ non-RefSeq NCBI genome database is also supported and still contain gene names and locustags which are discarded by NCBI in the RefSeq database.

STEP 2: Four types of data tables can be used as input

Single list of locus-tags: This is a bare list of genes (as locus-tags) deduced from transcriptome or proteome analysis
results.

Single list of locus-tags with ratio values: The first column contains the locus-tags, the second ratio values generated
by differential expression (DE) analysis.

Experiments: From time series or perturbation experiments GSEA-Pro will select the gene set of each experiment on
the basis of ratio data. Default threshold values can be changed on the webserver.

Clustering: Clustering algorithms will group genes showing similar behavior over purtubation experiments or time
series. GSEA-Pro will handle each cluster as a gene set and will show the biological function of each cluster. The first
column of the input table should contain the locus-tags and the column with cluster-IDs should have the header
“clusterID” (or change this at the web-server )
Input

Step 3: Examples of input data tables
Tables can be uploaded to the webserver as tab delimited file or by copy and paste directly from e.g. Excel
Single list
BSU40340
BSU40320
BSU40100
BSU40090
BSU39380
BSU38470
BSU37740
BSU37560
BSU36640
BSU35310
BSU34670
BSU33830
BSU33810
BSU33800
BSU33440
..
..
Single list
+ ratio data
locus
Null-WT
BSU32600 9.823054
BSU32610 9.171172
BSU32590 8.934336
BSU32580
8.7597
BSU09460 7.679297
BSU02390 7.497631
BSU32570 7.258288
BSU03010 6.926733
BSU32090 6.846735
BSU32550 6.438756
BSU31540 6.313128
BSU19350 6.063705
BSU10450 5.88237
BSU10460 5.857612
..
Experiments
Clustering
[ value columns will be ignored ]
locus
A_F71Y-WTB_R61K-WTC_R61H-WT
Null-WT
BSU40420 -0.18052 -0.54343 -1.15383 -1.50486
BSU40340 0.846962 0.910176 1.078578 1.139989
BSU40320 0.530724 0.939465 1.06793 1.164206
BSU40180 1.193949 1.410571 2.207017 2.447594
BSU40100 0.593872 0.649456 1.197021
1.0736
BSU40090 0.762748 0.587133 1.146103 1.017818
BSU40022 0.873289 1.049594 1.582536 1.928704
BSU40021 1.076014 1.562787 1.779806 2.252712
BSU39930 0.02815 0.289389 1.00797 2.638345
BSU39920 0.193887 0.476066 1.322628 2.777654
BSU39910 0.89087 1.137493 2.184607 4.009186
BSU39900 1.802093 2.304714 3.464422 5.355291
BSU39890 1.150499 2.53483 3.751564 5.580559
BSU39880 0.418429 1.152457 2.205843 3.864053
..
locus
A_F71Y-WTB_R61K-WTC_R61H-WT
Null-WT clusterID
BSU40420 -0.18052 -0.54343 -1.15383 -1.50486
1
BSU38810 -0.26891 -0.54053 -0.98884 -1.25233
1
BSU35920 -0.7741 -0.64672 -1.26974 -1.16388
1
BSU23560 -0.61699 -0.62814 -1.16469 -1.16779
1
BSU18120 -0.44501 -0.53468 -1.21979 -1.5308
2
BSU15560 -0.53203 -0.50688 -0.83918 -1.09821
2
BSU15550 -0.48608 -0.45498 -0.84874 -1.01399
2
BSU15540 -0.73774 -0.49082 -0.89019 -1.06595
2
BSU15530 -0.54945 -0.51365 -0.83166 -1.03166
3
BSU15520 -0.47401 -0.46463 -0.83956 -1.03379
3
BSU15510 -0.33067
-0.52 -0.82483 -1.04785
3
BSU15500 -0.30324 -0.51485 -0.80274 -1.04997
3
BSU15490 -0.34899 -0.46083 -0.83381
-1.053
3
BSU14700 -0.25436 -0.36245 -0.59358 -1.05428
3
..
Results

Normally the results should be ready in seconds and generates 4 main tables;

Table 1: All combinations of class / experiment are represented in one table. Values are only shown if the p-value is lower
then the cutoff value (0.01). Within brackets: the number of genes of the class that are differential expressed (TopHits). The
light to dark blue coloring represents low to high significance, respectively. The intensity of the color is based on
(TopHits/ClassSize) * -log2(adj-pvalue).


Items in the ClassID column links to external databases describing the class IDs

Items in the Experiment columns links to genes and gene annotations which are member of that specific class / experiment combination

The ClassSize column show the total number of genes that are member of the classID in the selected organism
Table 2: Heatmap of Class x Experiments and clickable to the ‘GSEA-Pro BarGraph’

The GSEA-Pro BarGraph show the overrepresented classes and its p-value (as –log).

A detailed table links to online information of classIDs and links to the genes found for the specific class

Table 3: Heatmap of Class x Experiments and clickable to the full class table

Table 4: Overview of the locus-tags of each experiment or cluster used for the GSEA

TreeMap: Global visualization and quick mining trough the GSEA-Pro results