Diapositiva 1 - Rosario Michael Piro

Download Report

Transcript Diapositiva 1 - Rosario Michael Piro

Functional annotation and identification of candidate disease genes by
computational analysis of normal tissue gene expression data
L. Miozzi1, U. Ala1, R. Piro2, F. Rosa3, F. Di Cunto1 and P. Provero1
1Dipartimento di Genetica, Biologia e Biochimica, Università di Torino, Torino, Italy; 2INFN, Sezione di Torino, Torino, Italy; 3ISI Foundation, Torino, Italy
Introduction
Among the open problems of molecular biology in the post-genomic era the functional annotation of the human genome and the identification of genes involved in genetic
diseases are especially important. Expression data on a genomic scale have been available for several years thanks to a set of new experimental techniques, and are widely
believed to contain much information potentially relevant towards the solution of such problems.
Here we present the results of a computational analysis of publicly available expression data on human normal tissues, based on the integration of data obtained with the two
most important experimental platforms (microarrays and SAGE) and different measures of dissimilarity between expression profiles. The building blocks of the procedure are
the Gene Expression Neighborhoods (GEN), small sets of tightly coexpressed genes which are analyzed in terms of functional annotation and relevance to human diseases.
This analysis provides putative functional annotations for many genes, and identifies promising candidate disease genes for experimental verification.
The “guilt by association” principle:
The presented work is based on the following principle: “ since there is a strong correlation between coexpression and functional relatedness, a gene found to be coexpressed
with several others involved in the same biological process can be putatively given the same functional annotation (Brazma A. et Vilo J., 2000, FEBS Lett. 480:17-24) ”.
Publicly available expression
data
Method
In this work we analyze publicly available expression data on human normal tissues obtained with Affymetrix
microarrays (http://symatlas.gnf.org/SymAtlas/) and with SAGE (Serial Analysis of Gene Expression;
http://cgap.nci.nih.gov/).
We considered 158 experiments concerning 12109 genes for Affymetrix and 62 experiments concerning 11741
genes for SAGE.
Microarrays
SAGE
integration of different quantitative measures of dissimilarity
between expression profiles
Different measures of dissimilarity between expression profiles have been defined and integrated: Euclidean
distance and Pearson linear dissimilarity for the microarray data, Euclidean distance and a dissimilarity measure
based on the Poisson distribution (developed in Van Helden J., 2004, Bioinformatics 20(3):399-406 in a different
context) for SAGE data.
Identification of Gene Expression Neighborhoods (GEN)
The unit of functional analysis, named Gene Expression Neighborhood (GEN), has been defined as a gene plus its k
nearest expression neighbors, with k typically a rather small number (the results we report were obtained with k=6).
For each dataset and each choice of dissimilarity measure we identified a number of GENs equal to the number of
genes represented in the dataset.
GEN functional analysis
using the controlled annotation vocabulary Gene Ontology
A GEN was considered functionally characterized if there was at least one Gene Ontology term
(http://www.geneontology.org/) shared by the majority (K) of its genes (K=4 genes in the results presented). To avoid
too generic GO terms, the analysis has been limited to those terms, shared by no more than a given maximum
number M of genes in the whole experimental dataset under investigation (M=300 in the results presented). This
limit ensures that the majority rule used to define functionally characterized GENs automatically implies statistically
significant overrepresentation of the GO term involved.
Estimation of false discovery rate
The false discovery rate for the functionally characterized GENs has been estimated: random GENs have been
generated by reshuffling the gene names in the whole dataset (thus preserving the characteristics of the actual
GENs, such as their degree of self-overlapping) and subjected to the same functional analysis.
Leave-one-out
A leave-one-out analysis has been performed to estimate how many correct annotations the method can correctly
identify.
Putative new GO functional
annotations
Characterized GENs have been used to determine putative new functional annotations: for each functionally
characterized GEN and for each GO term associated to it (shared by the majority of its genes), the same GO term
has been putatively attributed to the genes in the GEN not associated to it.
Finally, we looked for functionally characterized GENs containing at least 3 genes associated with a genetic disease
in the OMIM database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM). When the relevant OMIM entries
were related to each other, the genes in the GEN not associated to OMIM entries have been considered as
interesting candidates to be involved in similar pathologies.
Integration with OMIM data
Potential new disease genes (OMIM)
Dataset
Disease
Gene
Microarray+Pearson
ACROMEGALOID FEATURES, OVERGROWTH, CLEFT PALATE, AND HERNIA
ENSG00000069482
Microarray+Pearson
AORTIC ANEURYSM, FAMILIAL THORACIC 1
ENSG00000149591
Microarray+Pearson
CARDIOMYOPATHY, DILATED, 1C; CMD1C
ENSG00000107796
Microarray+Pearson
CHARCOT-MARIE-TOOTH DISEASE, AXONAL, TYPE 2G; CMT2G
ENSG00000166986
Microarray+Pearson
CHARCOT-MARIE-TOOTH DISEASE, DOMINANT INTERMEDIATE A
ENSG00000166197
Microarray+Pearson
CONVULSIONS, BENIGN FAMILIAL INFANTILE, 2
ENSG00000087258
Microarray+Pearson
CONVULSIONS, FAMILIAL INFANTILE, WITH PAROXYSMAL CHOREOATHETOSIS; ICCA
ENSG00000087258
Microarray+Pearson
DEAFNESS, NEUROSENSORY, AUTOSOMAL RECESSIVE 46; DFNB46
ENSG00000101608
Microarray+Pearson
EPILEPSY, IDIOPATHIC GENERALIZED, SUSCEPTIBILITY TO, 3; EIG3
ENSG00000078725
Microarray+Pearson
EPILEPSY, PARTIAL, WITH VARIABLE FOCI
ENSG00000100095
Microarray+Pearson
FACIOSCAPULOHUMERAL MUSCULAR DYSTROPHY 1A; FSHMD1A
ENSG00000154553
Microarray+Pearson
MUSCULAR DYSTROPHY, LIMB-GIRDLE, TYPE 1F; LGMD1F
ENSG00000128595
Microarray+Pearson
PARKINSON DISEASE 3, AUTOSOMAL DOMINANT LEWY BODY; PARK3
ENSG00000075340
Microarray+Pearson
POLYDACTYLY, PREAXIAL II; PPD2
ENSG00000106538
Microarray+Pearson
ROSSELLI-GULIENETTI SYNDROME
ENSG00000137699
Microarray+Pearson
SCAPULOPERONEAL MYOPATHY; SPM
ENSG00000139329
Microarray+Pearson
VACUOLAR NEUROMYOPATHY
ENSG00000077009
Microarray+Pearson
VACUOLAR NEUROMYOPATHY
ENSG00000099800
Microarray+Pearson
ACROMEGALOID FEATURES, OVERGROWTH, CLEFT PALATE, AND HERNIA
ENSG00000131808
Microarray+Pearson
BREAST CANCER, 11-22 TRANSLOCATION ASSOCIATED
ENSG00000137713
Microarray+Pearson
BREAST CANCER, DUCTAL, 1; BRCD1
ENSG00000139618
Microarray+Pearson
ELECTROENCEPHALOGRAM, LOW-VOLTAGE
ENSG00000075043
Microarray+Pearson
EOSINOPHILIA, FAMILIAL
ENSG00000113721
Microarray+Pearson
MICROCEPHALY, PRIMARY AUTOSOMAL RECESSIVE, 4; MCPH4
ENSG00000156970
Microarray+Pearson
MUSCULAR DYSTROPHY, CONGENITAL, 1B
ENSG00000143632
Microarray+Pearson
SCAPULOPERONEAL MYOPATHY; SPM
ENSG00000011465
Microarray+Pearson
TRIPHALANGEAL THUMB-POLYSYNDACTYLY SYNDROME
ENSG00000106538
Microarray+Pearson
TUMOR SUPPRESSOR GENE ON CHROMOSOME 11
ENSG00000137713
Microarray+Pearson
CARDIOMYOPATHY, DILATED, 1F; CMD1F
ENSG00000118523
Microarray+Pearson
CARDIOMYOPATHY, DILATED, 1Q; CMD1Q
ENSG00000091136
Microarray+Pearson
DEAFNESS, AUTOSOMAL RECESSIVE 51; DFNB51
ENSG00000026508
Microarray+Pearson
MYOPATHY, LIMB-GIRDLE, WITH BONE FRAGILITY
ENSG00000147872
Microarray+Euclidea
ARRHYTHMOGENIC RIGHT VENTRICULAR DYSPLASIA, FAMILIAL, 5; ARVD5
ENSG00000160808
Microarray+Euclidea
NONCOMPACTION OF LEFT VENTRICULAR MYOCARDIUM, FAMILIAL ISOLATED, AUTOSOMAL DOMINANT 2
ENSG00000130598
Microarray+Euclidea
SCAPULOPERONEAL MYOPATHY; SPM
ENSG00000011465
Microarray+Euclidea
MUSCULAR DYSTROPHY, CONGENITAL, 1B
ENSG00000143632
Microarray+Euclidea
CARDIOMYOPATHY, DILATED, 1C; CMD1C
ENSG00000122367
SAGE+Euclidean
ANEURYSM, INTRACRANIAL BERRY, 3
ENSG00000158747
SAGE+Euclidean
MYOPIA 5
ENSG00000108821
SAGE+Euclidean
MYOPIA 6
ENSG00000100122
SAGE+Euclidean
NONCOMPACTION OF LEFT VENTRICULAR MYOCARDIUM, FAMILIAL ISOLATED, AUTOSOMAL DOMINANT 2
ENSG00000130598
SAGE+Euclidean
MICROPHTHALMIA-CATARACT
ENSG00000167971
SAGE+Euclidean
EXFOLIATIVE ICHTHYOSIS, AUTOSOMAL RECESSIVE, ICHTHYOSIS BULLOSA OF SIEMENS-LIKE
ENSG00000186081
SAGE+Euclidean
MACULAR DYSTROPHY, RETINAL, 2, BULL'S EYE
ENSG00000007062
SAGE+Euclidean
CATARACT, CONGENITAL NUCLEAR, AUTOSOMAL RECESSIVE 1; CATCN1
ENSG00000105370
SAGE+Euclidean
CARDIOMYOPATHY, DILATED, 1C; CMD1C
ENSG00000122367
SAGE+Euclidean
ARRHYTHMOGENIC RIGHT VENTRICULAR DYSPLASIA, FAMILIAL, 5; ARVD5
ENSG00000160808
SAGE+Euclidean
ACHROMATOPSIA 1
ENSG00000129535
SAGE+Euclidean
ACHROMATOPSIA 1
ENSG00000139988
SAGE+Euclidean
CONE-ROD DYSTROPHY 5; CORD5
ENSG00000109047
SAGE+Euclidean
CONE-ROD DYSTROPHY 5; CORD5
ENSG00000179036
SAGE+Euclidean
POSTERIOR COLUMN ATAXIA WITH RETINITIS PIGMENTOSA; AXPC1
ENSG00000116703
SAGE+Euclidean
MYOPIA 6
ENSG00000196431
SAGE+Euclidean
GLAUCOMA 3, PRIMARY INFANTILE, B; GLC3B
ENSG00000158747
SAGE+Euclidean
MICROPHTHALMIA-CATARACT
ENSG00000197253
SAGE+Euclidean
DUPUYTREN CONTRACTURE
ENSG00000087245
SAGE+Euclidean
CORNEAL DYSTROPHY, CRYSTALLINE, OF SCHNYDER
ENSG00000158747
SAGE+Euclidean
CATARACT, AUTOSOMAL RECESSIVE, EARLY-ONSET, PULVERULENT
ENSG00000172014
SAGE+Euclidean
CATARACT, POSTERIOR POLAR 3
ENSG00000125864
Table 3 – List of candidates genes potentially
involved in human genetic diseases.
Results
•The leave-one-out analysis showed that 1026 correct GO annotations involving 644 genes and 94 GO terms would
have been correctly identified by the method (see table 1).
Euclidean
Pearson
Poisson
Euclidean+
Pearson
Euclidean+
Poisson
Microarray
428
788
/
958
428
SAGE
50
/
51
50
92
Microarray+
SAGE
a)
468
788
51
992
504
Euclidean
Pearson
Poisson
Euclidean+
Pearson
Euclidean+
Poisson
Microarray
318
546
/
598
318
SAGE
48
/
48
48
82
353
546
48
625
376
Microarray+
SAGE
b)
Table 1 - Leave-one-out analysis results showing the number of GO annotations (a) and annotated genes (b) correctly identified.
•The distribution of GO terms among the three Gene Ontology branches changes significantly among the
experimental datasets-dissimilarity measures showing that different combinations are able to capture different
aspects of coexpression.
Microarray-Pearson
Microarray-Euclidean
Fig.1- the graphics show the distribution of correct
obtained GO annotations among the three GO
branch ( Biological process; Molecular function;
Cellular conponent)
SAGE-Euclidean
SAGE-Poisson
•Different definition of dissimilarity measures describe different aspects of coexpression correlated with different kinds
of functional annotation (see table 1 and 2) as shown by the fact that only a small fraction of GO annotations is
predicted by more than one dissimilarity measure – dataset.
c)
Euclidean
Pearson
Poisson
Euclidean+
Pearson
Euclidean+
Poisson
Microarray
569
950
/
1240
569
407
SAGE
173
/
216
173
362
1081
Microarray+
SAGE
720
950
216
1378
892
Euclidean
Pearson
Poisson
Euclidean+
Pearson
Euclidean+
Poisson
Microarray
688
1215
/
1731
688
SAGE
188
/
230
188
Microarray+
SAGE
866
1215
230
1906
d)
Table 2 - Number of obtained putative new functional GO annotations (c) and new annotated genes (d).
•We have obtained 2113 putative new GO annotations involving 1540 genes and 194 GO terms (see table 2).
•The integration of our functional annotation results with the OMIM database allowed us to identify at least 59
interesting candidate genes potentially involved in human genetic disease (see table 3).
Conclusion
We have developed a useful approach to analyze and integrate information obtained with different experimental techniques and different definitions of dissimilarity measures
able to explore several aspects of coexpression. The results demonstrate that this integration increases the amount of useful information obtained.