Transcript PaLS
PaLS: Pathways and Literature Strainer
Filtering common literature, ontology terms and pathway information.
Andrés Cañada Pallarés
Instituto Nacional de Bioinformática
[email protected]
-Studies of differential expression and, specially, gene selection in the
context of classification and prediction with microarray data, usually
output lists of “interesting genes”.
-some of the members of those lists have a function in common or
do they belong to the same metabolic pathway?
-PaLS takes a list or set of lists of gene or protein identifiers and
shows which ones share certain descriptors
-Variable selection with microarray data (where number of
variables>>number of samples) can lead to many solutions.
Different rounds of the same algorithms often return different
lists of “interesting genes”. It is a problem for the interpretability of
the results.
-PaLS allows us to try to discover the major biological themes that
are shared among different solutions. Even if the identity of genes
in each solution is different
#Run.1.component.1
NM_002358
NM_001786
NM_003258
NM_001809
NM_003318
NM_020188
NM_004203
NM_004217
#Run.2.component.1
NM_001809
NM_001826
NM_001827
NM_003318
NM_020242
NM_003600
.
.
.
-Main input file. Text Plain
-List or several lists of gene/proteins
-Each list can have its own name
-Type of identifiers accepted:
-Ensembl Gene IDs
-UniGene Cluster IDs
-Gene names (HUGO)
-GenBank accessions
-Clone IDs
-Affymetrix IDs
-EntrezGene IDs
-RefSeq_RNAs
-RefSeq_peptides
-SwissProt Names
-Organisms accepted:
-Human
-Mouse
-Rat
-PaLS has three different methods of filtering annotations:
1.- Filter descriptors referenced with more than a given percentage, giving
results for each list separately. Intended to be used to discern which list
has some common published information that shows that those
genes/proteins share a similar function.
2.- Group all lists in one list (removing duplicates) and display those
descriptors that are more referenced in the global list. To see
commonalities even if they are not seen within each list.
3.- Look for those descriptors that are referenced by more than a given
threshold of identifiers in more than a given percentage of lists.
Looking for commonalities present within and among sets of lists.
-Threshold values are part of input information needed. Defaults to 50%
-Lower values are suggested
Most time cosuming process is the first search. After that, the user
can change thresholds for each type of descriptor and filtering method,
obtaining an answer in a short time (Redo Analysis button, see
figure later)
-Output are lists of those descriptors that fulfill the threshold criteria
selected by the user. Every input identifier related to each descriptor is
linked to IDClight to present the user as much information as possible.
-For lists of less of 100 nodes, graph plots that describe the data
structure of the lists are created. These plots show the genes/proteins
that share at least one descriptor. The more descriptors they share the
closer they appear.
-Data set from van’t Veer et al (Gene expression profiling predicts
clinical outcome of breast cancer. Nature, 415(6871), 530-536)
-Lists of genes obtained using our cnio application SignS (Díaz-Uriarte,
R)
-at 50% threshold, GO terms in most lists refer to “nucleus”
-at 40% threshold, the term “cell cycle” appears in several of the lists.
As reported in the original van’t Veer et al. paper, genes involved in cell
cycle are upregulated in the poor prognosis signature
-at 20% threshold, the term “mitosis” appears in most of the lists
-If we examine PaLS results from Reactome at the 20% threshold we
see “cell cycle. Mitotic” in most of the lists.
-The list “6th. Cross-validation run” shows “E2F mediated regulation of
DNA replication”
-Ramón Díaz-Uriarte. Structural Biology and Biocomputing. CNIO
-Andreu Alibés. EMBL-CRG Systems Biology Unit.
-Edward R. Morrissey. Systems Biology DTC. University of Warwick