Knowledge Graph Search - Indiana University Bloomington

Download Report

Transcript Knowledge Graph Search - Indiana University Bloomington

Measuring Scholarly Impact and
Beyond
Ying Ding
Indiana University
[email protected]
http://info.slis.indiana.edu/~dingying/index.html
Outline
• Assessing the credibility of researchers and
institutions
• Identifying significant scientific advancement/
emerging trends
• Other Research Analytics
• Tool: Data2Knowledge Platform
• Next Steps
Outline
• Assessing the credibility of researchers and
institutions
• Identifying significant scientific advancement/
emerging trends
• Other Research Analytics
• Tool: Data2Knowledge Platform
• Next Steps
Productivity
Productivity
Top Journals
Top Researchers
Measuring Scholarly Impact in the field of Semantic Web
Data: 4,157 papers with 651,673 citations from Scopus (1975-2009), and 22,951
papers with 571,911 citations from WOS (1960-2009)
Impact through citation
Impact
Top Journals
Top Researchers
Rising Stars
• In WOS, M. A. Harris (Gene Ontology-related research), T.
Harris (design and implementation of programming
languages) and L. Ding (Swoogle – Semantic Web Search
Engine) are ranked as the top three authors with the highest
increase of citations.
• In Scopus, D. Roman (Semantic Web Services), J. De Bruijn
(logic programming) and L. Ding (Swoogle) are ranked as top
three for the significant increase in number of citations.
Ding, Y. (2010). Semantic Web: Who is Who in the field, Journal of Information Science,
36(3): 335-356.
Popular vs. Prestigious
Ding, Y., & Cronin, B. (2011). Popular and/or Prestigious? Measures of Scholarly Esteem,
Information Processing and Management, 47(1), 80-96.
Popular vs. Prestigious
Data: 15,370 IR papers with 341,871 cited references from 1956 to 2008.
Popular vs. Prestigious
Academic Career Peak
PageRank and Weighted PageRank
𝑊(𝑝)
𝑃𝑅_𝑊 𝑝 = (1 − 𝑑) 𝑁
+𝑑
𝑝
𝑖=1 𝑖
𝑘
𝑖=1
𝑃𝑅(𝑝𝑖 )
𝐶(𝑝𝑖 )
• The weighted-PageRanks bring finer granularity to ranking
experts under various situations by including different
contextual information as weighted vectors to PageRank
algorithms.
– including an author’s total publications as the weighted vector,
PageRank can calculate a contextualized ranking reflecting the
scholar’s productivity;
– adding author’s expertise as the weighted vector, PageRank can
calculate a contextualized ranking reflecting the scholar’s domain
knowledge and research interest.
Ding, Y. (2011). Applying weighted PageRank to author citation network. Journal of the
American Society for Information Science and Technology, 62(2), 236-245.
PageRank and other ranks
Rank
Degree
Betweennes
s
0.05
0.15
0.25
0.35
0.45
0.55
0.65
0.75
0.85
0.95
Citation
Closeness
SALTON G
1
1
1
1
1
1
1
1
1
1
1
1
1
1
ROBERTSON SE
3
2
2
2
2
2
2
2
2
2
2
11
21
10
ABITEBOUL S
2
3
3
3
3
4
4
5
8
13
3
4
7
71
BELKIN NJ
4
4
4
4
4
3
3
3
3
3
4
22
24
20
VANRIJSBERGEN CJ
5
5
5
5
5
5
5
4
4
4
5
5
13
4
RUI Y
6
6
6
6
7
8
9
9
10
15
6
24
31
23
SARACEVIC T
7
8
7
7
6
6
6
6
5
5
7
25
51
25
CROFT WB
9
9
9
9
9
7
7
7
6
6
8
15
14
14
SPINK A
13
13
13
12
12
11
11
11
9
8
9
57
93
57
JONES KS
12
10
10
10
10
10
8
8
7
7
10
17
30
18
SMITH JR
8
7
8
8
8
9
10
10
11
16
11
38
32
35
FALOUTSOS C
10
11
11
13
13
13
15
16
19
26
12
12
2
11
HARMAN D
11
12
12
11
11
12
12
12
12
11
13
3
14
3
VOORHEES EM
43
16
16
17
17
17
17
18
17
17
14
20
38
19
FLICKNER M
18
18
19
20
20
21
21
24
25
34
15
16
11
15
BATES MJ
15
17
17
15
16
16
16
15
15
12
16
78
81
78
CODD EF
14
14
14
16
18
19
20
22
28
38
17
74
23
74
BAEZAYATES R
34
43
45
48
47
50
53
54
55
55
18
6
16
6
FUHR N
19
19
20
19
19
18
18
19
18
21
19
2
9
2
JAIN AK
42
34
40
39
39
40
40
42
45
49
20
39
33
38
Ding, Y., Yan, E., Frazho, A., & Caverlee, J. (2009). PageRank for ranking authors in co-citation
networks. Journal of the American Society for Information Science and Technology, 60(11),
2229-2243.
Scatterplots
Topic-based Rank
(Productivity Rank)
Online
IR/Web IR
1956-1980
1981-1990
1991-2000
2001-2008
D.T. Hawkins, N.A. Stokolova, E.
Eisenbach, K. Yamanaka, T.
Radecki, R. Fugmann, J. Eyre,
D.H. Kraft, Z. Mazur, K. Hosono
P. Willett, S.P. Harter, C. Batt,
D. Ellis, M. Keen, S.E. Hocker, L.
Bronars, P.G. Enser, S.
Stigleman, B. Vickery
W.B. Corft, H.C. Chen, W.
Umstatter, C.A. Lynch, P. Martin,
D. Samson, N.J. Santora, C.
Womserhacker, N.J. Belkin, R.
Wagnerdobler
J. Han, D. Suciu, H.P. Kriegel, S.Y.
Su, K.L. Tan, G. Graefe, L. Wong,
L. Libkin, J.W. Su, P.Z. Revesz
M. Thelwall, C.C. Yang, A. Spink,
P. Jacso, I. Fourie, H.C. Chen, N.
Ford, H. Xie, G.G. Chowdhury, B.
Hjorland
Database and
Query
Processing
Evaluation
Medical IR
Multimedia
IR
G. Salton, A.G. Pickford, W.
Goffman, E. Garfield, G.K.
Thompson, W.S. Cooper, K.
Janda, F.W. Lancaster, R.
Fugmann, P. Willett
S.J. Martinez, M.G. Manzone,
C.M. Bowman, F.A. Landee, J.
Frome, I. Berghans, S.L. Visser, H.
Skolnik, Y.J. Lee, T.K.S. Engar
D.W. Stemple, R.H. Guting, A.
Sernadas, C. Katzeff, S.Y. Su, W.
Perrizo, J.S. Davis, C.T. Yu, B.S.
Goldshteyn, I.A. Macleod
C.L. Borgman, T. Radecki, G.
Salton, W.B. Croft, J.S. Ro, J.
Panyr, D.C. Blair, M.E. Maron, P.
Thompson, C.A. Lynch
J.Z. Li, F. Bry, H.J. Kim, D.
Papadias, K. Subieta, J. Van den
Bussche, D. Taniar, F. Geerts, M.
Song, Y.D. Chung
A. Spink, R.M. Losee, E. Levine, C.
Cole, P. Willett, W.R. Hersh, C.T.
Meadow, B. Hjorland, E. Garfield,
T. Cawkell
S.G. Aiken, I. Soutar, S. Barcza,
C.C. Tsai, W. Hersh, S.J.
Westerman, H.H. Emurian, L.L.
Consaul, H.J. Markowitsch, D.
Roberts
H.C. Chen, F. Crestani, A.K. Jain,
E. Wilhelm, J.I. Khan, B.S.
Manjunath, H.K. Kim, H.M. Wang,
S.F. Chang, S. Levialdi
R.N. Kostoff, U.J. Balis, G.
Eysenbach, R.B. Haynes, G.
Nilsson, H. Shatkay, N.L.
Wilczynski, C.R. Shyu, J.I.
Westbrook, G.O. Babnett
T.S. Huang, H.J. Zhang, G.J. Lu, J.
Li, C.C. Chang, E. Izquierdo, J.
Lassksonen, H. Burkhardt, C.J. Liu,
D. Ziou
Using Author-Topic Modeling Algorithm to rank author based on different topics
(Productivity Rank)
Data: Information Retrieval articles from 1956 to 2008 (15,367 papers with 350,750 citations)
Topic-based PageRank
(Citation Rank)
Online
IR/Web IR
1956-1980
I_PR
PR_t(.85)
PR_t(.5)
PR_t(.15)
D.T. Hawkins, N.A. Stokolova,
R.K. Summit, M.E. Williams, T.
Radecki, A. Macleodi, T.
Saracevic, R.S. Marcus, R.
Fugmann, C.T. Yu
G. Salton, D.T. Hawkins, M.E.
Williams, R.K. Summit, F.W.
Lancaster, A. Kent, N.A.
Stokolova, R. Fugmann, C.W.
Cleverdon, W.S. Cooper
D.T. Hawkins, N.A. Stokolova, R.
Fugmann, T. Radecki, G. Salton,
R.K. Summit, I.A. Macleod, J.
Farradane, M.E. Williams, A.M.
Rees
1981-1990
S.E. Robertson, D. Ellis, P.
Willett, P. Ingwersen, B.C.
Vickery, A.S. Pollitt, D.H. Kraft,
H.M. Brooks, A.F. Smeaton,
E.A. Fox
N.J. Belkin, W.B. Frakes, T.
Imielinski, G.W. Furnas, T.
Catarci, T. Kohonen, R.
Agrawal, S.K. Chang, H.C. Chen,
P. Valduriez
A. Spink, T. Saracevic, B.
Hjorland, S.E. Roberston, B.J.
Jansen, N.J. Belkin, E.M.
Voorhees, W.R. Hersh, P.
Ingwersen, P. Vakkari
G. Salton, A. Kent, M.E.
Williams, F.W. Lancaseter, R.K.
Summit, D.T. Hawkins, C.W.
Cleverdon, D.B. Mccarn, W.S.
Cooper, H. Martint, C.P.
Bourne
G. Salton, A. Bookstein, S.E.
Robertson, T. Radecki, W.B.
Croft, C.J. Vanrijsbergen, C.T.
Yu, W.S. Coopwer, P. Willett,
K.S. Jones
G. Salton, N.J. Belkin, S.E.
Robertson, S. Abiteboul, T.
Saracevic, C.J. Vanrijsbergen,
W.B. Croft, M.J. Bates, K.S.
Jones, D. Harman
G. Salton, A. Spink, N.J. Belkin,
T. Saracevic, S.E. Roberston, Y.
Rui, E.M. Voorhees, B.J. Jansen,
J.R. Smith, K.S. Jones
G. Salton, P. Willett, S.E.
Robertson, A. Bookstein, S.P.
Harter, W.B. Croft, T. Radecki,
C.J. Vanrijsbergen, C.T. Yu, D.
Ellis
G. Salton, N.J. Belkin, S.
Abiteboul, S.E. Robertson,
S.K. Chang, T. Saracevic, H.C.
Chen, C.J. Vanrijsbergen, W.B.
Croft, M.J. Bates
A. Spink, T. Saracevic, G.
Salton, H.C. Chen, B.J. Jansen,
B. Hjorland, N.J. Belkin, S.E.
Robertson, P. Vakkari, E.M.
Voorhees
P. Willett, S.P. Harter, D. Ellis,
S.E. Robertson, G. Salton, A.F.
Smeaton, P. Ingwersen, B.C.
Vickery, M.J. Bates, A.
Bookstein
H.C. Chen, N.J. Belkin, G.
Salton, S.K. Chang, N. Fuhr, T.
Saracevic, S.K.M. wong, M.J.
Bates, S. Abiteboul, T. Catarci
1991-2000
2001-2008
A. Spink, H.C. Chen, B. Hjorland,
T. Saracevic, B.J. Jansen, P.
Vakkari, P. Borlund, S.E.
Robertson, F. Crestani, N.J.
Belkin
Ding, Y. (2011). Topic-based PageRank for author cocitation networks. Journal of the
American Society for Information Science and Technology, 62(3), 449-466.
Citing vs. Mentioning
Data: Full text JASIST journal articles (2000-2011), 866 articles and 32,496 references
Counting methods matters!
400
350
CountX Rank
300
250
200
150
100
50
0
0
10
20
30
40
50
60
70
80
90
100
CountOne Rank
Spearman r=0.589, p<0.01 (2-tailed)
Ding, Y., Liu, X., Guo, C., & Cronin, B. (2013). The distribution of references across texts:
Some implications for citation analysis. Journal of Informetrics, 7(30, 583-592
Diversity Subgraph
Paths
Tim Berners-LeeJames HendlerV. S. SubrahmanianLaksv. S. LakshmananJiawei Han
Tim Berners-LeeJames HendlerQiang YangHongjun LuJiawei Han
Tim Berners-LeeJames HendlerQiang YangBing LiuJiawei Han
Tim Berners-LeeJames HendlerQiang YangKe WangJiawei Han
Tim Berners-LeeJames HendlerQiangY angJian PeiJiawei Han
current
0.0653
0.0513
0.0428
0.0428
0.0428
The
constraintbased
subgraph
between
Jiawei Han &
Tim BernersLee with
constraint on
James
Hendler
He, B., Ding, Y., Tang, J., Reguramalingam, V., & Bollen, J. (2013). Mining diversity subgraph in
multidisciplinary scientific collaboration networks: A meso perspective. Journal of
Informetrics, 7(1), 117-128
Institutions
Data: A total of 50,920 LIS articles written by 42,991 researchers published during 1955
to 2009
Country
He, B., Ding, Y., Yan, E. (2012). Mining patterns of author orders in scientific
publications. Journal of Informetrics, 6(3), 359-367
Outline
• Assessing the credibility of researchers and
institutions
• Identifying significant scientific advancement/
emerging trends
• Other Research Analytics
• Tool: Data2Knowledge Platform
• Next Steps
Mapping the field of IR
Author Co-citation Map in the field of Information Retrieval (1992-1997)
Data: 1,466 IR-related papers was selected from 367 journals with 44,836 citations.
Mapping the field of IR
Ding, Y., Chowdhury, G. and Foo, S. (1999). Mapping Intellectual Structure of
Information Retrieval: An Author Cocitation Analysis, 1987-1997, Journal of
Information Science, 25(1): 67-78.
Popular Topics in IR
Data: Information Retrieval articles from 1956 to 2008 (15,367 papers with 350,750 citations)
Popular Topics in IR
Online IR/Web
IR
1956-1980
1981-1990
1991-2000
2001-2008
system, online, language,
theory, query, computerized,
thesaurus, evaluation,
semantic, bibliography
online, systems, text, concepts,
reference, principles,
proceedings, practice,
knowledge, services
system, web, knowledge,
database, data, query, design,
text, management,
distributed
web, search, digital,
searching, knowledge,
system, query, user, model,
internet
query, language, queryprocessing, database, relational,
system, distributed, data,
database-system, comparison
query, database, databases,
data, object-oriented,
queries, processing,
relational, model, language
query, data, xml,
processing, queries,
databases, database,
efficient, web, querying
systems, document, full-text,
model, evaluation, fuzzy,
effectiveness, search, user,
expert
text, evaluation, systems,
searching, search, online,
relevance, library, user,
hypertext
Database and
Query
Processing
Evaluation
system, document, storage,
evaluation, data, automatic,
model, relevance, indexing,
online
Medical IR
system, data, storage,
computerized, chemical,
medical, literature, biomedical,
evaluation, management
Multimedia IR
database, medical, system,
clinical, patient, management,
health, identification,
automated, optical
database, medical, health,
clinical, management,
search, design, study,
support, knowledge
image, content-based,
system, indexing, databases,
multimedia, images, visual,
video, color
image, content-based,
learning, images,
relevance, color, feedback,
video, semantic, similarity
Ding, Y. (2011). Topic-based PageRank for author cocitation networks. Journal of the
American Society for Information Science and Technology, 62(3), 449-466.
Evolving of topics and communities
Li, D., Ding, Y., Sugimoto, C., He, B., Tang, J., Yan, E., Lin, N., Qin, Z. & Dong, T. (2011). Modeling
Topic and Community Structure in Social Tagging: the TTR-LDA-Community Model. Journal of
the American Society for Information Science and Technology, 62(9), 1849-1866.
Topic evolution in IR
Data: Information retrieval (IR) papers from Scopus for 2001-2007 (2001-2003,
12,194; 2004-2005, 19,145; and 2006-2007, 21,423).
Community Matching
Communities and Topics in IR
Yan, E., Ding, Y., Milojevic, S., & Sugimoto, C. R. (2012). Topics in dynamic research
communities: An exploratory study for the field of information retrieval. Journal of
Informetrics, 6(1), 140-153.
Lead-Lag Analysis
Data: astrophysics collected from WoS (166,191) and arXiv (117,913) for the last 20
years (1992-2011).
Lead-Lag Patterns
Hu, B., Dong, X., Zhang, C., Bowman, T., Ding, Y., Yan, E., Milojevic, S., Ni., C., & Lariviere, V.
(forthcoming). A lead-lag analysis of the topic evolution patterns for preprints and
publications. Journal of the Association for Information Science and Technology
Lead-Lag Patterns
Topic popularity
Mathematical Measures for Topic Popularity f’(t): Gaining Popularity; Losing
Popularity; Regaining Popularity; Duration of Popularity
Most of the topics in both arXiv and WoS affect the popularity of each other because
when a topic become s popular in either arXiv or WoS, it will become popular in the
other in less than 4 years.
Very few topics become popular in one channel without becoming popular in the
other. Only 10 topics’ lead time in arXiv and WoS are longer than 10 years (i.e.
meaning that a topic becomes popular in arXiv or WoS but is not popular in the other
channel in the next 10 years).
This work clearly demonstrates that open access preprints will have stronger growth
tendency as compared to traditional printed publications in astrophysics.
Subject Categories
34
Our Data
Scopus data
Journal citation
15 Years, 5 slices
18,500+ Journals
4+ Million links
13+ Million Citations
Yan, E., Ding, Y., Cronin, B., & Leydesdorff, L. (2013). A bird's-eye view of scientific trading:
35
Dependency relations among fields of science. Journal of Informetrics, 7(2), 249-264.
Our Metaphor
36
fastest growths
Exported knowledge
37
highest ratios
Export/import ratio
38
Source of incoming citations (who cites you)
39
Source of outgoing citations (whom you cite)
40
lowest shortest paths
Shortest path length
41
knowledge
destination (to)
knowledge
source (from)
Shortest path matrix (2011 data)
42
Critical knowledge path (2011 data)
43
Outline
• Assessing the credibility of researchers and
institutions
• Identifying significant scientific advancement/
emerging trends
• Other Research Analytics
• Tool: Data2Knowledge Platform
• Next Step
Next Generation of Bibliometrics
• Newly developed methods allow in-depth
analysis of scholarly communication
– Topic modeling (e.g., Latent Dirichlet Allocation)
– Information Extraction (e.g., Entity Extraction)
– Social Network Analysis (e.g., Community Detection)
• Big data demonstrates the power of connected
data to enable knowledge discovery
– Structured data
– Unstructured data
– Social media data
Bibliometrics and Beyond
46/46
Content-based Impact Analysis
• There are two levels:
– Syntactic level (position)
• Papers cited in different sections of the articles
• How many times papers are mentioned in one
article
– Semantic level (semantics)
• Citation Sentiment Analysis: sentence level,
window-size
• Concept-based: knowledge concept level (e.g.,
topic, knowledge unit/entity, or bio-entities)
Ding, Y., Song, M., Wang, X., Zhang, G., Zhai, C., & Chambers, T. (2014). Content-based
citation analysis: The next generation of citation analysis. Journal of the American Society
for Information Science & Technology, 65(9), 1820-1833.
Semantic Level
• Concepts
– Topics
– Major entities in research
• Bio entities
• Knowledge unit (e.g., domain theories, well-established algorithms)
• Why not keyword
–
–
–
–
Ambiguous literal words
Not normalized
But can be a starting point to extract concepts.
Concept (keywords, synonyms (students vs. pupils), antonyms
(birth vs. death), homonyms (pupil (student) vs. pupil (part of
eye), etc.)
Jeong, Y., Song, M., & Ding, Y. (2014). Content-based Author co-citation analysis. Journal of
Informetrics, 8(1),197-211.
Topic Modeling
• IR papers (1956-2008)
(No. of nodes, No. of
edges)
Coauthorship
network
Citation network
1956-1980
(Phase 1)
(930, 4256)
1981-1990
(Phase 2)
(961, 2252)
1991-2000
(Phase 3)
(6650, 24184)
2001-2008
(Phase 4)
(13640, 63140)
(6054, 11192)
(5978, 17084)
(36411, 171814)
(62636,444203 )
collaboration strength of productive
authors within topics
100.00%
% of collaboration
80.00%
60.00%
40.00%
20.00%
0.00%
1956-1980
Direct Collaboration
0.70%
Indirect Collaboration
0.11%
Loose Collaboration
0
No Collaboration
99.19%
Indirect: path=<6, loose: >6
1981-1990
0.32%
0
0
99.68%
1991-2000
1.54%
8.46%
2.76%
87.24%
2001-2008
1.79%
36.84%
11.05%
50.32%
collaboration strength of productive
authors across topics
% of collaboration
120.00%
100.00%
80.00%
60.00%
40.00%
20.00%
0.00%
Direct Collaboration
Indirect Collaboration
Loose Collaboration
No Collaboration
1956-1980
0
0.08%
0
99.92%
1981-1990
0.05%
0.03%
0
99.93%
1991-2000
0
2.01%
2.16%
95.83%
2001-2008
0.13%
29.65%
13.83%
56.40%
citation strength of productive authors
within topics
90.00%
80.00%
% of citation
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
1956-1980
Direct Citation
3.31%
Indirect Citation
12.69%
Loose Citation
3.10%
No Citation
80.91%
Indirect:=<3, loose: >3
1981-1990
3.89%
20.54%
5.28%
70.29%
1991-2000
7.28%
41.90%
7.36%
43.47%
2001-2008
9.47%
69.37%
9.79%
14.95%
citation strength of productive authors
across topics
90.00%
80.00%
% of citation
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
1956-1980
Direct Citation
0.98%
Indirect Citation
13.18%
Loose Citation
18.35%
No Citation
72.49%
1981-1990
0.83%
13.90%
8.80%
76.47%
1991-2000
1.20%
49.39%
8.51%
40.89%
2001-2008
1.45%
81.15%
4.05%
13.35%
Ding, Y. (2011). Scientific collaboration and endorsement: Network analysis of coauthorship
and citation networks. Journal of Informetrics, 5(1), 187-203
EntityMetrics
Entitymetrics is defined as using entities (i.e., evaluative entities or knowledge
entities) in the measurement of impact, knowledge usage, and knowledge
transfer, to facilitate knowledge discovery.
Ding, Y., Song, M., Han, J., Yu, Q., Yan, E., Lin, L., & Chambers, T. (2013). Entitymetrics:
Measuring the impact of entities. PLoS One, 8(8): 1-14.
EntityMetrics
Drug
Disease
Protein
Pathway
Gene
Cite
PubMed Entities
Entity Graph
• Heterogeneous Entity Graph
Bcl-2 Inhibitor
Diabetes
p53
Cancer
STAT3
ci: cite
co: co-occur
ci/co:
Drug
Metformin
Disease
Breast Cancer
AMPK
P13K
Protein
Pathway
Gene
Metformin related entity-entity
citation network
Data: 4,770 articles retrieved from PubMed Central with 134,844 references, and
1,969 bio-entities (i.e., 880 genes, 376 drugs, and 713 diseases)
Metformin related entity-entity
citation network
Entity Citation Network vs. Entity CoOccurrence Network
• Gene Gene Co-Occurrence Network (GG) vs. Gene Cite
Gene Network (GCG)
– The GCG network shares many genes with the GG network and
as a result is a competitive complement to the GG network
– Using gene relationships based on citation relation extends the
assumption of gene interaction being limited to the same article
and opens up a new opportunity to analyze gene interaction
from a wider spectrum of datasets.
– 1,149 gene pairs from GCG were found in GG. A total of 164
pairs out of 1,149 were not found in GG before 2005, but were
found in GCG before 2005. In particular, the PARK2 and PINK1
gene pair ranks fifth by co-occurrence frequency in the GG
network, implying the gene pair has highly been studied since
2005
Song, M., Han, N., Kim, Y., Ding, Y., & Chambers, T. (2013). Discovering implicit entity
relation with the gene-citation-gene network. PLoS One, 8(12), e84639
Big Data in Life Sciences
•
There is now an incredibly rich resource of public information relating compounds, targets,
genes, pathways, and diseases. Just for starters there is in the public domain information on:
–
–
–
–
–
–
–
•
69 million compounds and 449,392 bioassays (PubChem)
59 million compound bioactivities (PubChem Bioassay)
4,763 drugs (DrugBank)
9 million protein sequences (SwissProt) and 58,000 3D structures (PDB)
14 million human nucleotide sequences (EMBL)
22 million life sciences publications - 800,000 new each year (PubMed)
Multitude of other sets (drugs, toxicogenomics, chemogenomics, metagenomics …)
Even more important are the relationships between these entities. For example a chemical
compound can be linked to a gene or a protein target in a multitude of ways:
–
–
–
–
–
–
Biological assay with percent inhibition, IC50, etc
Crystal structure of ligand/protein complex
Co-occurrence in a paper abstract
Computational experiment (docking, predictive model)
Statistical relationship
System association (e.g. involved in same pathways cellular processes)
Wild, D. J., Ding, Y., Sheth, A. P., Harland, L., Gifford, E. M., & Lajiness, M. S. (2012). System
chemical biology and the Semantic Web: What they mean for the future of drug discovery
research. Drug Discovery Today (impact factor=6.422), 17(9-10), 469-474.
Text
CSV
Table
HTML
XML
Patient
Disease
Tissue
Cell
Pathway
DNA
RNA
Protein
Drug
Chem2Bio2RDF
•
•
•
•
•
•
•
•
•
•
•
•
•
NCI Human Tumor Cell Lines Data
PubChem Compound Database
PubChem Bioassay Database
PubChem Descriptions of all PubChem bioassays
Pub3D: A similarity-searchable database of
minimized 3D structures for PubChem
compounds
Drugbank
MRTD: An implementation of the Maximum
Recommended Therapeutic Dose set
Medline: IDs of papers indexed in Medline, with
SMILES of chemical structures
ChEMBL chemogenomics database
KEGG Ligand pathway database
Comparative Toxicogenomics Database
PhenoPred Data
HuGEpedia: an encyclopedia of human genetic
variation in health and disease.
31m chemical structures
59m bioactivity data points
3m/19m publications
~5,000 drugs
Chen, B., Dong, X., Jiao, Dazhi, Wang, H., Zhu, Q., Ding, Y. and Wild, D. (2010).
Chem2Bio2RDF: A semantic framework for linking and mining chemogenomic and systems
chemical biology data. BMC Bioinformatics, 2010, 11, 255.
Dereferenable URI
PlotViz: Visualization
Bio2RDF
Browsing
Cytoscape Plugin
Chem2Bio2RDF
RDF
Triple store
Linked Path Generation and Ranking
LODD
uniprot
Others
SPARQL ENDPOINTS
Third party tools
Chen, B., Ding, Y., & Wild, D. J. (2012). Improving integrative searching of systems
chemical biology data using semantic annotation. Journal of Cheminformatics, 4:6
(doi:10.1186/1758-2946-4-6).
65
SEMANTIC GRAPH MINING: PATH FINDING ALGORITHM
15
5
8
2
13
23
3
6
1
19
14
9
16
21
10
18
4
25
17
7
11
20
12
Dijkstra’s algorithm
24
26
22
Bio-LDA
• Latent Dirichlet Allocation (LDA)
– The core of the group of powerful
statistical modeling techniques for
automated extraction of latent topics
from large document collections
• Bio-LDA
– Extended LDA model with Bio-terms
as latent variable
– Bio-terms: compound, gene, drug,
disease, protein, side effect,
pathways


Calculate bio-term entropies over
topics
Use the Kullback-Leibler
divergence as the non-symmetric
distance measure for two bioterms over topics
Example: Topic 10
Apply Bio-LDA on 336,899 PubMed article abstracts in 2009 and extract 50 topics
Wang, H., Ding, Y., Tang, J., Dong, X., He, B., Qiu, J. & Wild, D. (2011): Finding complex biological
relationships in recent PubMed articles using Bio-LDA. PLos One 6(3): e17243.
doi:10.1371/journal.pone.0017243.
Thiazolinediones (TZDs) – revolutionary treatment for type II Diabetes
Troglitazone (Rezulin): withdrawn in 2000 (liver disease)
Rosiglitazone (Avandia): restricted in 2010 (cardiac disease)
Rosiglitazone bound into PPAR-γ
Pioglitazone: ???? (does decrease blood sugar levels, was associated with bladder tumors and
has been withdrawn in some countries.)
PPARG: TZD target
SAA2: Involved in inflammatory response implicated in
cardiovascular disease (Current Opinion in Lipidology 15,3,,269278 2004)
APOE: Apolipoprotein E3 essential for lipoprotein catabolism.
Implicated in cardiovascular disease.
ADIPOQ: Adiponectin involved in fatty acid metabolism.
Implicated in metabolic syndrome, diabetes and cardiovascular
disease
CYP2C8: Cytochrome P450 present in cardiovascular tissue and
involved in metabolism of xenobiotics
CDKN2A: Tumor suppression gene
SLC29A1: Membrane transporter
Semantic Prediction
http://chem2bio2rdf.org/slap
Chen, B., Ding, Y., & Wild, D. (2012). Assessing Drug Target Association using Semantic
Linked Data. PLoS Computational Biology, 8(7): e1002574.
doi:10.1371/journal.pcbi.1002574,
Example: Troglitazone and PPARG
Association score: 2385.9
Association significance: 9.06 x 10-6 =>
missing link predicted
Topology is important for association
Cmpd 1
Cmpd 1
hasSubstructure
hasSubstructure
hasSubstructure
hasSubstructure
bind
Cmpd 2
Protein 1
bind
Cmpd 2
Protein 1
Semantics is important for association
Cmpd
1
bind
bind
Cmpd1
Cmpd1
Cmpd1
Cmpd1
bind
hasSideeffect
hasSubstructure
Protein
2
Protein
2
bind
hasGO
Protein
2
hyperten
sion
substruct
ure1
Cmpd 2
GO:0000
1
bind
hasGO
hasSubstructure
Cmpd 2
Cmpd 2
Protein
1
Protein
1
PPI
hasSide ffect
Protein
1
bind
bind
Protein
1
Protein
1
SLAP Pipeline
Path filtering
Cross-check with SEA
• SEA analysis (Nature 462, 175-181,
2009) predicts 184 new
compound-target pairs, 30 of
which were experimentally tested
• 23 of these pairs were
experimentally validated (<15uM)
including 15 aminergic GPCR
targets and 8 which crossed major
receptor classification boundaries
• 9 of the aminergic GPCR target
pairings were correctly predicted
by SLAP (p<0.05) – for the other 6
compounds were not present in
our set
• 1 of the 8 cross-boundary pairs
was predicted
Assessing drug similarity from
biological function
• Took 157 drugs with 10 known
therapeutic indications, and created
SLAP profiles against 1,683 human
targets
• Pearson correlation between profiles
> 0.9 from SLAP was used to create
associations between drugs
• Drugs with the same therapeutic
indication unsurprisingly cluster
together
• Some drugs with similar profile have
different indications – potential for
use in drug repurposing?
Outline
• Assessing the credibility of researchers and
institutions
• Identifying significant scientific advancement/
emerging trends
• Other Research Analytics
• Tool: Data2Knowledge Platform
• Next Steps
Data2Knowledge platform…
Data2Knowledge
AMiner
PMiner
SLAP
Mining knowledge from
articles:
• Researcher profiling
• Expert search
• Topic analysis
• Reviewer suggestion
Mining knowledge from
patents:
• Competitor analysis
• Company search
• Patent summarization
Mining drug discovery
data
• Predicting targets
• Repurposing drugs
• Heterogeneous
graph search
...
Mining more data…
AMiner
• Research
profiling
• Integration
• Interest analysis
• Topic analysis
• Course search
• Expert search
Researchers: 31,222,410
Publications: 69,962,333
Conferences/Journals: 330,236
Citations: 133,196,029
Knowledge Concepts: 7,854,301
• Association
• Disambiguation
• Suggestion
• Geo search
• Collaboration
recommendation
Expert Search
Basic Info.
Citation statistics
Research Interests
Publications
Social Network
Expertise
Search
Finding top experts,
top conferences,
and highly cited papers
for “data mining”
Geographic Search
Finding the most hot
regions on
“data mining”
Conference
Analysis
Which year is the most
successful year in the KDD’s
history?
Who are the highly cited
authors?
What is author nationality
distribution for the highly
cited KDD papers in the
past years?
Reviewer
Suggestion
Interest matching
COI avoiding
Load balancing
Forecast review quality
Cross-domain
Collaboratinon
Recommendation
What are the cross-domain
topics on which you can
work in the target domain?
Who are the best
collaborators on each of
these topics?
Topic Browser
200 topics have been discovered
automatically from the academic
articles
Academic Performance Measure
Academic Statistics
New Stars
Widely used..


The largest publisher:
Elsevier
Conferences
KDD 2010
KDD 2011
KDD 2012
KDD 2013
KDD 2014
WSDM 2011
WSDM 2014
ICDM 2011
ICDM 2012
SocInfo 2011
ICMLA 2011
WAIM 2011
etc.
……
What is PMiner?
• Current patent analysis systems
focus on search
– Google Patent, WikiPatent,
FreePatentsOnline
• PMiner is designed for an in-depth
analysis of patent activity at the
topic-level
–
–
–
–
Topic-driven modeling of patents
Heterogeneous network co-ranking
Intelligent competitive analysis
Patent summarization
* Patent data:
> 3.8M patents
> 2.4M inventors
> 400K companies
> 10M citation relationships
* Journal data:
> 2k journal papers
> 3.7k authors
The crawled data is increasing
to >300 Gigabytes.
J. Tang, B. Wang, Y. Yang, P. Hu, Y. Zhao, X. Yan, B. Gao, M. Huang, P. Xu, W. Li, and A. K. Usadi. PatentMiner: Topic-driven Patent Analysis and
Patent Search
Topics of search
results
Top Patents
Top Inventors
Top Companies
Topic-based Analysis
for “Microsoft”
• A court decision in 08/2012: Samsung’s Galaxy
smart phone infringed upon a series of patents
of Apple’s iphone, besides 4 appearance
design patents, 3 software patents so-called
381, 915, and 163 are included, respectively
cover "bounce back" , “pinch-to-zoom”, and
“tap-to-zoom”.
• The above 3 software patents all belong to the
following three patent categories: active solidstate devices (touch screen), computer
graphics processing (graph scaling), and
selective visual display systems (tap to select).
Demo
Y. Yang, J. Tang, J. Keomany, Y. Zhao, Y. Ding, J. Li, and L. Wang. Mining Competitive Relationships by Learning across Heterogeneous Networks.
Outline
• Assessing the credibility of researchers and
institutions
• Identifying significant scientific advancement/
emerging trends
• Other Research Analytics
• Tool: Data2Knowledge Platform
• Next Steps
Challenges for Identifying Emerging
Trends
•
Paradigm-changing discoveries have notoriously limited early impact because
the more a discovery deviates from the current paradigm, the longer it takes
to be appreciated by the community – Dashun Wang, Chaoming Song, and
Albert-László Barabási. 2013. Quantifying Long-Term Scientific Impact. Science
342 (6154) , three variables:
– Preferential attachment (# of citations)
– Citation decay (aging of citations)
– Community recognition (scholarly conformity, controlled by domain leaders)
•
Domain leader recognition is critical (peer review or who cite this article)
• Long-term citation (t->infinite), the # of citations are decided only by
community recognition
• High innovative papers usually still cite conventional approaches or
knowledge. For example, Newton laws of gravitation using geometry
rather than calculus, Darwin origin of species, using conventional
examples of breeding of dogs.– Brian Uzzi et al. (2013), Atypical
combinations and scientific impact, Science, 342, 468
Proposed Approaches
• Step 1: identify early features for scientific
innovation, for example, study the features of
highly cited articles the first 10 year:
– each year citation patterns, citation increase rates
(popularity vs. prestige)
– citation (author, venue), reference (author, venue), coauthor, venue, (z-score),
– Adding topics (transdisciplinary vs. within discipline)
– Collaboration: Transdiciplinary team-authored
articles, features for high impact and high innovation
(z-score)
Proposed Approaches
• Step 2: Building mathematical models
– Categorizing learned features,
– Identifying variables
– Building math models
– Test and evaluation
• Step 3: Supervised machine learning methods
High impact articles vs.
low impact articles
High impact patent vs.
low impact patent
Knowledge is power!
Acknowledgements
Thanks to all the collaborators:
David Wild , Kyle Stirling, Judy Qiu (Indiana)
Jie Tang, Juanzi Li, Jing Zhang, Zhanpeng Fang, Yang Yang (TsingHua)
Jim Walson (Panoscopix)
Bing He (Johns Hopkins)
Chengxiang Zhai, Chi Wang, Brian Foote (UIUC)
Eric M. Gifford, Huijun Wang (Merck)
Bin Chen (Stanford)
Michael S. Lajiness (Eli Lilly)