SCIENCE MAPS SI767 – W10 – Matthew P. Simmons

Download Report

Transcript SCIENCE MAPS SI767 – W10 – Matthew P. Simmons

SCIENCE MAPS
SI767 – W10 – Matthew P. Simmons
Overview
• What are Science Maps?
o
o
Various definitions.
Usage and utility.
• What techniques are used?
o
o
Types of techniques.
Overview of certain techniques.
• What has been done on the topic?
o
o
Review of papers.
The small world of these readings.
What are Science Maps?
...and why do we care?
In the reading...
Science Maps are ...
• Topic modelso
Detecting the finite number of themes/topics that characterize the
content of a knowledge domain.
• Scientometric analysis of the provenance of ideaso
Tracking the memetic flow in scientific literature by analyzing the
pattern of citations and collaborations.
• Models of the evolution of bibliometric networkso
Finding the parameters that generate networks with similar
structure to citation and coauthorship networks, and determining
why those parameters matter.
Various techniques/approaches;
Common theme
Find the hidden ontologies that organize data.
Organize scientific data by:
• Topic o
"Which articles are about the same thing as this one?"
• Influence o
"What are the five most important papers in topic modeling?"
• Provenance o
"Where are bayesian inference theories coming from, and who is
using them?"
• "Hotness" of a field o
"Where are the research dollars in NLP today?"
Usage and Utility
•
•
•
•
Enhance the ability to navigate data.
Identify potential collaborators.
Determine the impact of an author or paper.
Identify the problems that need solving (and where the
money is...)
• Create tools to allow our allocation of attention to efficiently
scale with the massive increase of data.
• "Revealing implicit knowledge that is presently known only
to domain experts..."(Shiffrin et al. 2004)
What Techniques are Used?
Hint: Lots!
Answer: Lots!
•
•
•
•
•
Support Vector Machines
Clustering
Latent Dirichlet Allocation
Latent Semantic Analysis
Mixture Models
•
•
•
•
Network modeling
Network analysis
Network visualization
Bibliometric analysis
...and that's just what's mentioned in Shiffrin et al.!
In the readings...
•
•
•
•
•
Markov Chain Monte Carlo – Griffiths et. Al.
Latent Dirichlet Allocation – Blei et al. & Griffiths et. Al.
Pathfinder [Networks|Scaling] – Chen
Bayesian Networks – Blei et. Al & Griffiths et. Al.
Bibliometric networks - Garfield, Borner et al. & Chen
...and more, but there is an easy way to group them.
Two main categories
Statistical models/methods
•
•
•
•
•
Markov Chain
Monte Carlo method
Dirichlet Distribution
Bayesian Inference
Latent Dirichlet Allocation
Network models/methods
•
•
•
•
•
•
Degree distributions
Centrality measures
Scale free networks
Small world networks
Clustering coefficients
Pathfinder scaling
Stats Stuff
Everyone got all that?
Just kidding...
Let's start with Markov Chains
Markov Chains
• A system that can exist in various states where the
components of that system change in discrete steps.
• The changes of the components are determined by the
transitions probability which displays the Markov property.
• The Markov property states the the state of a component at
time n+1 is dependent on the state of the system at time n,
but not at any time < n. Hence the immediately previous
state is the only important factor in determining the next
state of the system.
• Example: A random walk on the number line with an equal
probability of moving +1 or -1 at each step.
Largely from: http://en.wikipedia.org/wiki/Markov_chain
Monte Carlo method
• A process that utilizes repeated random sampling to derive
an approximated result.
General Process:
1. Define a domain of possible inputs.
2. Generate inputs randomly from the domain using a certain
specified probability distribution.
3. Perform a deterministic computation using the inputs.
4. Aggregate the results of the individual computations into the
final result.
From: http://en.wikipedia.org/wiki/Monte_carlo_method
"By our powers combined..."
Markov Chain Monte Carlo Method
• AKA Gibbs sampling
From:
Dirichlet Processes, Chinese Restaurant Processes and All That,
Michael I. Jordan 2005
http://www.cs.berkeley.edu/~jordan
David MacKay, Information theory, inference, and learning algorithms
(Cambridge UK ;;New York: Cambridge University Press, 2003).
Bayesian Paradigm
From: Structured Bayesian Nonparametric Models with Variational
Inference
ACL Tutorial
Prague, Czech Republic
Percy Liang and Dan Klein
http://www.cs.berkeley.edu/~pliang/papers/tutorial-acl2007.pdf
Bayesian Inference
Think of A as the
topics in a
document and B
as the words
observed.
The goal is to infer
the most probable
topic distribution
given the observed
words.
From:
Robert Cowell, Introduction to inference for Bayesian networks, in
Learning in graphical models, ed. Michael Jordan (MIT Press, 1999),
9-27.
Latent Dirichlet Allocation
• A generative document model
• Each document is composed of a number of words drawn
from a number of topics that comprise the document.
• The is a probability distribution of topics defined across
documents and a probability distribution of words defined
across topics.
...pictures help here...
Latent Dirichlet Allocation - Cont.
From:
David M. Blei, Andrew Y. Ng, and Michael I. Jordan, Latent dirichlet
allocation, J. Mach. Learn. Res. 3 (2003): 993-1022.
Dirichlet Distribution
From: Structured Bayesian Nonparametric Models with Variational
Inference
ACL Tutorial
Prague, Czech Republic
Percy Liang and Dan Klein
http://www.cs.berkeley.edu/~pliang/papers/tutorial-acl2007.pdf
Network Analysis/Methods
From: Bollen, J., Van de Sompel, H., Hagberg, A., Bettencourt, L., Chute, R., Rodriguez, M. A.,
et al. (2009). Clickstream Data Yields High-Resolution Maps of Science. PLoS ONE, 4(3),
e4803. doi: 10.1371/journal.pone.0004803
Centrality
• Betweenness:
o Bridging
o # of shortest paths
through a node to other
nodes.
• Closeness
o Avg distance to other
nodes.
• Degree
o Number of edges of one
type or another.
Pathfinder Scaling
• Network scaling (edge reduction) method.
• Generates a minimum spanning tree plus a parameter
tunable number of redundant edges.
• Can use different metrics to determine which edges to
prune, such as the euclidian distance or edge weight.
What's been done on the topic?
What did we learn today, class?
R. M. Shiffrin and K. Börner, Mapping knowledge domains,
Proceedings of the National Academy of Sciences 101, no. suppl_1 (1,
2004): 5183-5185.
• Overview of the field and the PNAS articles.
• Important take away: There is a lot of research potential in
this area and one that benefits from an interdisciplinary
analysis involving a vareity of techniques.
T. L. Griffiths and Mark Steyvers, Finding scientific topics, Proceedings
of the National Academy of Sciences 101, no. suppl_1 (1, 2004): 52285235.
•
•
•
•
LDA
Optimal Topics ~300
LDA over PNAS abstracts 1991-2001
~3million words total - ~20k terms in vocabulary
Contributions:
• Hot and Cold topics
• Topical clustering (heatmap)
• Demonstrate that content analysis can reveal topics
The word distribution within 10 topics.
30 randomly generated "documents"
generated from the above 10 topics.
Using LDA to derive the original topics from the
observed documents.
Griffiths et al. cont...
Cold Topics.
Hot Topics!
David M. Blei and John D. Lafferty, A correlated topic model of Science,
The Annals of Applied Statistics 1, no. 1 (2007): 17-35
• Correlated Topic Models
o Evolution of LDA
o Introduces the notion that the probability of the topics
comprising a document are not necessarily independent.
o Replaces the use of the Dirichlet distribution with a log
normal distribution with a covariance structure as a
parameter.
Blei et al. cont...
Contributions:
• CTM outperforms LDA when the number of topics is larger.
• CTM also predicts words more accurately with less training
data than LDA.
• Both of these are credited to the effect of topic correlation on
the distribution of topics in a document.
• A science map!
o The covariance matrix is used to create a graph where
the topics are vertices and the edges represent some
level of covariance.
Blei et al.
Katy Börner and Jeegar T. Maru and Robert L. Goldstone, The
simultaneous evolution of author and paper networks, Proceedings of
the National Academy of Sciences 101, no. suppl_1 (1, 2004): 52665273.
• Bibliometric network encompassing coauthorship and
citation.
• Built to model the PNAS collaboration/citation network.
• 2 node types:
o Authors
o Papers
• Several edge types:
o directional information flow between paper:author and
paper:paper
o author and coauthor
Börner cont...
3 main parameters in the model:
1. Topics - i.e. scientific specializations
2. Aging - Meant to capture the bias to cite recent material.
3. Recursive Linking - The propensity to read the papers cited
by the papers you have read.
Iterative simulation that modeled the addition of new authors,
the removal of old ones, coauthorship, the propensity of
authors to publish within their topic domain.
Looks like a decent fit...
Börner cont...
Contributions:
• Models the constraint of aging on preferential attachment in
scale free network formation.
• Models the "splintering" of science caused by specialization.
Chaomei Chen, CiteSpace II: Detecting and visualizing emerging
trends and transient patterns in scientific literature, Journal of the
American Society for Information Science and Technology 57, no. 3
(2006): 359-377.
• Co-citation network.
• Clusters labeled based on Kleinberg's burst detection
algorithm (Kleinberg 2002).
9 steps:
1.
2.
3.
4.
5.
6.
Identify knowledge domain. i.e. "mass extinction"
Automated data collection. Uses PubMed and Web of Science.
Find burst terms. CiteSpace II scrapes [1-4]-grams.
Time slicing. Generate time series views of the network.
Choose thresholds (intellectual bases & research fronts).
Graph scaling. Reduce edges to improve visual clarity without sacrificing
critical visual data.
7. Layout. Typically a force directed layout to emphasize clustering.
8. Visual inspection. Tweak labels and display of metadata.
9. Verify pivot points.
Chen cont...
Contributions:
• Detecting research fronts,
intellectual bases, and
pivots.
• Detecting trends in
scientific research.
• Visualization of
knowledge domain.
Overview of Citespace II
Tree ring view of citations over time
Eugene Garfield, Historiographic Mapping of Knowledge Domains
Literature, Journal of Information Science 30, no. 2 (April 1, 2004): 119145.
• Co-founder of scientometrics.
• Bibliometric analysis and link tracking reveal impact of
papers on a field.
• Concept of local citation score and group citation score.
• Adding group and time slicing to learn more about the effect
of those slices on bibliometric records.
Garfield cont...
Contributions:
• Bibliometrics...
• Finding out which papers
were important at a certain
time vs. from a current
perspective.
Thank You
Questions or Comments?
Bonus Bibliography Slide!
•
•
•
•
•
•
•
•
•
•
•
•
Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of Science. The Annals of Applied Statistics, 1(1), 17-35.
doi: 10.1214/07-AOAS114
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. J. Mach. Learn. Res., 3, 993-1022.
Bollen, J., Van de Sompel, H., Hagberg, A., Bettencourt, L., Chute, R., Rodriguez, M. A., et al. (2009). Clickstream Data
Yields High-Resolution Maps of Science. PLoS ONE, 4(3), e4803. doi: 10.1371/journal.pone.0004803
Borner, K. (2004). The simultaneous evolution of author and paper networks. Proceedings of the National Academy of
Sciences, 101(suppl_1), 5266-5273. doi: 10.1073/pnas.0307625100
Börner, K. (2007). Making sense of mankind’s scholarly knowledge and expertise: collecting, interlinking, and organizing
what we know and different approaches to mapping (network) science. Environment and Planning B: Planning and
Design, 34(5), 808 – 825. doi: 10.1068/b3302t
Chen, C. (2006). CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature.
Journal of the American Society for Information Science and Technology, 57(3), 359-377. doi: 10.1002/asi.20317
Cowell, R. (1999). Introduction to inference for Bayesian networks. In M. Jordan (Ed.), Learning in graphical models (pp.
9-27). MIT Press.
Garfield, E. (2004). Historiographic Mapping of Knowledge Domains Literature. Journal of Information Science, 30(2),
119-145. doi: 10.1177/0165551504042802
Griffiths, T. L. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl_1), 52285235. doi: 10.1073/pnas.0307752101
Hall, D., Jurafsky, D., & Manning, C. D. (2008). Studying the history of ideas using topic models. In Proceedings of the
Conference on Empirical Methods in Natural Language Processing (pp. 363-371). Honolulu, Hawaii: Association for
Computational Linguistics. Retrieved from http://portal.acm.org.proxy.lib.umich.edu/citation.cfm?id=1613715.1613763
Hirsch, J. E. (2005). An index to quantify an individual's scientific research output. Proceedings of the National Academy
of Sciences of the United States of America, 102(46), 16569-16572. doi: 10.1073/pnas.0507655102
Janssens, F., Glänzel, W., & Moor, B. D. (2007). Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and
data mining (pp. 360-369). San Jose, California, USA: ACM. Retrieved from
http://portal.acm.org.proxy.lib.umich.edu/citation.cfm?id=1281192.1281233
Bibliography continued...
•
•
•
•
•
•
•
Jordan, M. I. (1999). Learning in graphical models. MIT Press.
Leicht, E. A., Clarkson, G., Shedden, K., & Newman, M. E. J. (2007). Large-scale structure of time evolving citation networks.
0706.0015. doi: doi:10.1140/epjb/e2007-00271-7
MacKay, D. (2003). Information theory, inference, and learning algorithms. Cambridge UK ;;New York: Cambridge University
Press.
Shibata, N., Kajikawa, Y., Takeda, Y., & Matsushima, K. (2009). Comparative study on methods of detecting research fronts
using different types of citation. J. Am. Soc. Inf. Sci. Technol., 60(3), 571-580.
Shiffrin, R. M. (2004). Mapping knowledge domains. Proceedings of the National Academy of Sciences, 101(suppl_1), 51835185. doi: 10.1073/pnas.0307852100
Torres-Moreno, J., St-Onge, P., Gagnon, M., El-Bèze, M., & Bellot, P. (2009, May 1). Automatic Summarization System
coupled with a Question-Answering System (QAAS). ArXiv e-prints. Retrieved January 11, 2010, from
http://adsabs.harvard.edu/abs/2009arXiv0905.2990T
Zhu D., & Porter A.L.[1]. (2002). Automated extraction and visualization of information for technological intelligence and
forecasting. Technological Forecasting and Social Change, 69, 495-506. doi: 10.1016/S0040-1625(01)00157-3