Results from an attempt to correlate protein
Download
Report
Transcript Results from an attempt to correlate protein
Correlating PPI node degree with
SNP counts
Michael Grobe
(This work supported in part by:
Research Technologies
Indiana University)
1
Do PPI nodes of high degree “have” more
or fewer SNPs?
Are hubs more or less susceptible to SNPs
over evolutionary time?
If so, why?
If not, why?
2
Hypothesis: The degrees of genes in a PPI
network will correlate inversely with their SNP
count.
This hypothesis will be tested using (parts of) the
following data resources:
- dbSNP from NCBI,
- the Disease Gene Network data collected
by Rual, et al., Stetzl, et al., and Goh, et al.
- several other NCBI resources
3
dbSNP
dbSNP is a large relational database maintained by the National
Center for Biotechnology Information (NCBI) on a Microsoft
SQLServer. (dbSNP seems to be misnamed.)
NCBI provides several public interfaces to dbSNP:
- a web-based interface for public use
http://www.ncbi.nlm.nih.gov/SNP/
- a set of web-accessible scripts CGI scripts and (SOAP-based)
Web Services, known as the Entrez eUtils, and,
- an FTP repository of the data exported from the MS SQLServer.
NCBI does NOT provide an interface for submitting SQL commands
directly to the SQLServer.
However, IUSM downloads the dbSNP data from the NCBI FTP
repository, loads it into a local MS SQLServer, where it is available for
use via JDBC, and UITS makes it available via Web pages and
(SOAP-based) Web Services.
4
UITS maintains (on a DB2 datbase management system) a collection
of data resources called the Centralized Life Sciences Data (CLSD)
service that incorporates dbSNP via “data federation”.
dbSNP can be access via CLSD at
http://discern.uits.iu.edu:8421/access/index.html
and also via a SOAP-based interface to CLSD at
http://discern.uits.iu.edu:8421/axis/CLSDservice.jws?wsdl
dbSNP can also be accessed via JDBC, or through a direct JAX-RPC
interface, if necessary.
CLSD is described in detail at http://rac.uits.iu.edu/clsd/
5
List of CLSD data resources
BIND -- Pathways, Gene interactions
ENZYME -- Enzyme nomenclature
ePCR -- ePCR results of UniSTS vs Homo sapiens
SGD -- Saccharomyces Genome Database
DGN – The Disease Gene Network data from Goh, et al. (Provisional)
KEGG data sources:
+ LIGAND -- Pathways, Reactions, & Compounds
+ PATHWAY -- Pathway map coordinates
NCBI data sources:
+ LocusLink -- Genetic Loci. (retained for archival use.)
+ UniGene -- Gene clusters
Federated data sources, where the data is stored:
* at the originating site:
+ NCBI Nucleotide -- Nucleotide sequences
+ NCBI PubMed -- Journal abstracts
* on local (mirror) servers external to CLSD but housed at IU
* BLAST -- Basic Local Alignment Search Tool (mirrored at IU by UITS)
* Nucleotide data: NT
* Protein data: NR and Swiss-Prot
* dbSNP -- Single Nucleotide Polymorphisms (mirrored at IU by IUSM)
6
dbSNP is a relatively complex database. It includes
about 300 tables for each species, and the separate
species tables share about 80 additional tables.
dbSNP is also rather large: dbSNP catalogs Shared,
Human, and Mouse (circa early 2008) fill around 150 GB
and 3 billion rows (of which about 2.8 billion are in
dbSNP128_human).
New versions come out every 6 months or so. This study
uses Build 128, although Build 129 has been quite
recently announced.
The tutorial
“Using dbSNP via SQL queries”
describes the structure and use of dbSNP via SQL.
7
The DGN PPI network
The Disease Gene Network data within CLSD includes 3 networks:
- a network of diseases that are “connected” when they involve the
same gene,
- a network of 1777 genes that are “connected” when they are
implicated in the same disease, and
- a Protein Protein Interaction (PPI) network built from networks
defined by two different groups Rual, et al. and Stelzl, et al.
The PPI is defined in the table called PPI_RUAL_STELZL; it has
7533 unique genes and 22,052 edges (in a half-matrix form).
A companion table PPI_GENES lists every gene in the PPI network
The PPI network was traversed to construct a list of shortest paths
from each node to each other node:
PPI_SHORTEST_PATH_LENGTHS. This is a kind of transitive
closure and contains about 53 M records.
8
SNPContigLocusID
The main dbSNP table used in this project is SNPContigLocusID
which contains information about the genes associated with each
SNP.
The Build 128 version of SNPContigLocusID contains about
13,129,868 rows (though about half of them specify “NW_” mRNA
segments and were ignored).
Here is a query that retrieves the records for 2 SNPs (among many
others) that appear within, or close to, the coding region for JAK3.
select
*
from
b126_SNPContigLocusId_36_1
where
snp_id in ( 3212724, 3212755 )
9
Query results:
Note that both of these SNPs have several records; SNP
ID is NOT a key. SNPs may even map to different
chromosomes!
10
Here is a table of the Function Class (FXN_CLASS) codes
.
11
Number of (NT_) SNPs in each SNP function class
select
fxn_class, count(*)
from
dbSNP128_human.b128_SNPContigLocusId_36_2
where contig_acc like 'NT_%‘
[so not all 13 Mrows will appear]
GROUP BY fxn_class
ORDER BY fxn_class
FXN_CLASS Count
3
78797
6
6008473
8
192868
13
168608
15
166205
41
2753
FXN_CLASS Count
42
98053
44
15848
53
144123
55
27990
73
645
75
483
12
Get gene IDs, symbols, and SNP counts
The following query uses both DGN and dbSNP data to get a list of gene IDs, their
symbols, and the number SNPs associated with each gene:
select
a.locus_id, b.locus_symbol, snp_counter
from
(select
locus_id, count(*) as snp_counter
from
dbsnp128_human.b128_SNPContigLocusId_36_2
where
contig_acc like 'NT_%'
and
locus_id in (select gene_id from disease_gene_net.ppi_genes )
group by locus_id) as a
join
(select
distinct locus_id, locus_symbol
from
dbsnp128_human.b128_SNPContigLocusId_36_2) as b
on b.locus_id = a.locus_id
order by snp_counter desc
13
Gene IDs, symbols, and SNP counts
Here is a list of PPI genes with the top 100 SNP counts:
1756
5799
26047
5071
5789
1305
9734
8379
5152
9586
2917
5649
1523
221935
9223
23085
23236
1129
2272
2918
2139
29119
6487
53616
5890
DMD
PTPRN2
CNTNAP2
PARK2
PTPRD
COL13A1
HDAC9
MAD1L1
PDE9A
CREB5
GRM7
RELN
CUTL1
SDK1
MAGI1
ERC1
PLCB1
CHRM2
FHIT
GRM8
EYA2
CTNNA3
ST3GAL3
ADAM22
RAD51L1
60069
30328
21661
19464
15867
15719
14461
12441
12052
11921
11738
11200
9956
9194
9091
9046
8905
8895
8366
8128
8091
8089
8077
8006
7786
2104
1956
9369
9215
56899
672
8618
6938
10207
1837
3084
1896
23345
4638
9378
5558
4897
2898
9844
3784
6660
1740
1630
5592
3123
ESRRG
EGFR
NRXN3
LARGE
ANKS1B
BRCA1
CADPS
TCF12
INADL
DTNA
NRG1
EDA
SYNE1
MYLK
NRXN1
PRIM2
NRCAM
GRIK2
ELMO1
KCNQ1
SOX5
DLG2
DCC
PRKG1
HLA-DRB1
7647
7522
7088
7046
7044
7024
6798
6772
6721
6678
6392
6350
6343
6301
6157
5995
5928
5877
5835
5777
5736
5616
5606
5574
5533
8224
9577
8997
4212
600
351
2887
93986
800
3119
659
10580
273
2066
6095
79109
57509
4915
1730
7518
27185
1010
55714
6091
2895
SYN3
BRE
KALRN
MEIS2
DAB1
APP
GRB10
FOXP2
CALD1
HLA-DQB1
PDE4DIP
SORBS1
AMPH
ERBB4
RORA
MAPKAP1
MTUS1
NTRK2
DIAPH2
XRCC4
DISC1
CDH12
ODZ3
ROBO1
GRID2
5495
5455
5387
5271
5252
5196
5190
5184
5073
5039
4999
4872
4844
4736
4644
4643
4603
4562
4481
4434
4413
4370
4370
4346
4332
1002
5884
23254
817
5602
8464
84570
10142
10466
64754
7492
27133
1390
6262
8038
1501
89797
10659
31
11214
1301
7273
1838
7399
CDH4
RAD17
KIAA1026
CAMK2D
MAPK10
SUPT3H
COL25A1
AKAP9
COG5
SMYD3
ARID1B
KCNH5
CREM
RYR2
ADAM12
CTNND2
NAV2
CUGBP2
ACACA
AKAP13
COL11A1
TTN
DTNB
USH2A
4331
4286
4275
4207
4189
4135
4101
4072
4018
3988
3981
3910
3891
3879
3874
3836
3798
3786
3755
3751
3736
3733
3698
3694
14
PPI node degree
Here is a query that uses
PPI_SHORTEST_PATH_LENGTHS
to get degree for each node:
select
source, count(*) as degree
from
disease_gene_net.PPI_SHORTEST_PATH_LENGTHS
where
length = 1
and
source in
( select gene_id from DISEASE_GENE_NET.PPI_GENES )
group by source
order by degree
15
PPI node degree
Here is a query using that closure to get gene counts for each degree:
select
degree, count(*)
from
(select
source, count(*) as degree
from
disease_gene_net.PPI_SHORTEST_PATH_LENGTHS
where
length = 1
and
source in ( select gene_id from DISEASE_GENE_NET.PPI_GENES )
group by source ) as a
group by degree
order by degree
16
Degree and gene count for all genes in the PPI net:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
2267
1217
849
589
465
343
248
198
176
145
119
99
106
88
59
52
58
34
38
23
26
21
24
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
24
16
15
16
19
13
12
13
12
13
8
7
7
5
7
9
2
3
7
3
4
1
3
46
47
48
49
50
51
53
54
55
56
57
58
59
60
62
63
64
65
67
69
73
75
76
3
2
3
4
8
6
1
2
3
2
2
4
4
4
4
1
1
1
1
1
1
2
4
76
77
78
79
80
82
83
84
87
89
94
95
97
99
103
105
118
123
124
129
151
153
176
4
1
4
3
1
1
1
1
1
2
1
1
1
1
2
1
1
1
1
1
1
1
1
17
To get PPI gene IDs, symbols, and degrees:
select
b.locus_id, b.locus_symbol, degree
from
(select
source, count(*) as degree
from
disease_gene_net.PPI_SHORTEST_PATH_LENGTHS
where
length = 1
and
source in ( select gene_id from DISEASE_GENE_NET.PPI_GENES )
group by source) as a
join
(select
distinct locus_id, locus_symbol
from
dbsnp128_human.b128_SNPContigLocusId_36_2) as b
on
b.locus_id = a.source
18
PPI gene IDs, symbols, and their degrees (top 100 degree values):
7157
5829
2885
11007
7186
7414
4087
2130
4088
4093
1956
4188
6714
55791
2534
3725
5295
2099
5925
7094
367
1387
5764
6464
2033
TP53
176
PXN
153
GRB2
151
CCDC85B 129
TRAF2
124
VCL
123
SMAD2
118
EWSR1
105
SMAD3
103
SMAD9
103
EGFR
99
MDFI
97
SRC
95
C1orf103 94
FYN
89
JUN
89
PIK3R1
87
ESR1
84
RB1
83
TLN1
82
AR
80
CREBBP
79
PTN
79
SHC1
79
EP300
78
4089
7431
7704
9094
672
1499
1915
3065
6498
9869
83755
7329
5359
5781
2908
1742
1107
1937
4086
57562
998
4110
5335
10980
2335
SMAD4
VIM
ZBTB16
UNC119
BRCA1
CTNNB1
EEF1A1
HDAC1
SKIL
SETDB1
KRTAP4-1
UBE2I
PLSCR1
PTPN11
NR3C1
DLG4
CHD3
EEF1G
SMAD1
KIAA1377
CDC42
MAGEA11
PLCG1
COPS6
FN1
78
78
78
77
76
76
76
76
75
75
73
69
67
65
64
63
62
62
62
62
60
60
60
60
59
2547
6667
57473
4067
5111
7534
55729
5777
5970
6256
6908
596
1400
7088
3320
10524
3717
351
3932
5594
5747
9513
11161
867
3866
XRCC6
SP1
ZNF512B
LYN
PCNA
YWHAZ
ATF7IP
PTPN6
RELA
RXRA
TBP
BCL2
CRMP1
TLE1
HSP90AA1
HTATIP
JAK2
APP
LCK
MAPK1
PTK2
FXR2
C14orf1
CBL
KRT15
59
59
59
58
58
58
58
57
57
56
56
55
55
55
54
54
53
51
51
51
51
51
51
50
50
5879
6303
6774
8648
26994
55660
25
3064
4035
10241
4790
5894
10399
5371
7917
801
5578
8655
1051
2185
4609
11030
857
5300
RAC1
SAT1
STAT3
NCOA1
RNF11
PRPF40A
ABL1
HD
LRP1
CALCOCO2
NFKB1
RAF1
GNB2L1
PML
BAT3
CALM1
PRKCA
DYNLL1
CEBPB
PTK2B
MYC
RBPMS
CAV1
PIN1
50
50
50
50
50
50
49
49
49
49
48
48
48
47
47
46
46
46
45
45
45
44
43
43
19
Get SNP counts and degree values for each gene in the PPI:
select locus_id, degree, snp_counter from
(select
locus_id, count(*) as snp_counter
from
dbsnp128_human.b128_SNPContigLocusId_36_2
where
contig_acc like 'NT_%'
and
( fxn_class = 41 or fxn_class = 42 or fxn_class = 44 )
group by locus_id) as a
join
(select
source, count(*) as degree
from
disease_gene_net.PPI_SHORTEST_PATH_LENGTHS
where
length = 1
and
source in ( select gene_id from DISEASE_GENE_NET.PPI_GENES )
group by source
) as b
on source = locus_id
order by degree
20
Initial results:
The previous query was used to derive correlations
between degree values and SNP counts per gene for
every gene in the PPI network:
Degree SNP
Class
Genes
Mean
Mean
Correlation
All
7403
5.9
428
0.046
41,42,44 6569
6.0
8.5
0.062
Not 6
7397
5.9
55
0.094
13, 15
6
7383
7174
5.9
5.9
18
348
0.054
0.041
(Note that a few observations were omitted due to using the mer counting script for
non-mer work.)
21
More initial results:
The same approach was used to derive correlations
for the 1195 or so disease genes that also appear in
the PPI net:
Degree SNP
Class
Genes
Mean Mean
Correlation
All
1193
7.5
592
0.086
41,42,44 1121
7.5
14.9
0.089
Not 6
1193
7.5
82.7
0.117
13, 15
6
1193
1161
7.5
7.5
22.7
523
0.049
0.041
(Note that a few observations were omitted due to
using the mer counting script for non-mer work.)
22
Perhaps a correlation can be found as a function of mer counts?
That is, perhaps:
“DNA bases in the gene per SNP” or
“RNA bases in the gene transcript per SNP” or
“amino acids in the protein product per SNP”
will correlate with degree, especially for certain SNP classes?
Testing these claims requires gene, mRNA transcript, and/or protein
product lengths (and maybe intron lengths).
Note that the SNPContigLocusId table includes pointers to mRNA
and protein records, and includes the NCBI UIDs for each record.
Scripts (get-mRNA-lengths.pl and get-protein-lengths.pl) were written
to access the mRNA and protein contig data from NCBI and to count
base pairs or amino acids, respectively.
23
Scripts to download mer (base and aa) data
The libwww-perl (LWP) module was used to interact with
the NCBI eUtils that were mentioned earlier and are
documented in “Using the NCBI eUtilities via CGI” at
http://mypage.iu.edu/~dgrobe/entrez-dogma.html
DNA lengths were obtained using a service at
http://discern.uits.iu.edu:8421/view-sequences.html
called “Get NCBI sequences for genes or specified
regions” that will fetch gene FASTA records given gene
names and/or NCBI UIDs.
NCBI asks users to limit access to one every 3 seconds
during off-peak hours and one every 15 seconds
otherwise. As a result, these runs took over 24 hours.
24
The resulting “mer file” sizes are like:
- DNA length records:
- mRNA length records:
- Protein length records:
22259
32400
23803
There are frequently multiple mRNA and protein
records for a gene; mean lengths were computed
for each gene by downstream scripts.
A script (get-gene-mRNA-SNPs-mers-perSNP.pl) was written to compute mean lengths
and perform correlations on the mer data.
25
Here are correlations between node degree
and mRNA bases per SNP:
This table is for ALL PPI genes showing SNPs in
the specified function class:
Bases
Mean per
Class
Genes Degree SNP Correlation
Not 6
7406 5.9
96
-0.032
All
7412 5.9
428
-0.046
41, 42, 44 6576 6.0
922
0.001
Note that the correlation between base count and
SNP count was: -0.21.
26
Here are correlations between node degree
and DNA bases per SNP:
This table is for ALL PPI genes showing SNPs in
the specified function class :
Class
6
All
Bases
Mean per
Genes Degree SNP Correlation
7174 5.9
348
-0.039
7403 5.9
198
-0.033
Note that the correlations between base count
and SNP count were -0.097 and -0.12.
27
Conclusion
This study found no relationship between
SNP count and PPI node degree, or
between measures of mer counts per SNP
and node degree.
28
Discussion
Are cell networks so robust that variation which “should” normally
disrupt functioning gets over-ridden?
If so, how?
Are there parallel/redundant pathways for important processes?
Are non-parallel pathways constructed to minimize the effects of
variation?
Do chaperone proteins (like HSP90) help make variant proteins safe
for use within the cell (a la’ Whitesell and Lundquist)? (Note: around
20% of HSP-connected genes appear in the list of 100 genes (< 2%)
with the highest degree.)
Would hub genes within Reaction networks (as opposed to PPI
networks) show SNP counts that correlate with their degree?
Would PPIs composed only of co-located proteins display node
degree-SNP count correlations?
Do lethal genes show fewer SNPs?
29
References
The dbSNP Build Process
http://www.ncbi.nlm.nih.gov/books/bookres.fcgi/helpsnpfaq/Build.pdf
Using dbSNP via SQL queries
http://mypage.iu.edu/~dgrobe/dbSNP/using-dbSNP-via-SQL.html
Using the relational and eUtils interfaces to dbSNP
http://mypage.iu.edu/~dgrobe/dbSNP/using-dbSNP-at-IU.html
Using the NCBI eUtilities via CGI
http://mypage.iu.edu/~dgrobe/entrez-dogma.html
Kwang-Il Goh, Michael E. Cusick David Valle Hum, Barton Childs Hum, Marc Vidal,
and Albert-Laszlo Barabasi, The human disease network, PNAS, May 22, 2007, vol.
104, no. 21, 8685.
http://www.pnas.org/content/104/21/8685.abstract
Get contents of tables related to the Goh (2007) paper
http://discern.uits.iu.edu:8421/show-a-DISEASE_GENE_NET-Table.html
Whitesell, Luke, and Susan L. Lundquist, HSP and the chaparoning of cancer, Nat
Rev Cancer, 2005;510:761-772.
30