Author-Topic Modeling from Large Document Collections

Transcript Author-Topic Modeling from Large Document Collections

Algorithms for Data Mining and Querying with Graphs
Investigators: Padhraic Smyth, Sharad Mehrotra
University of California, Irvine
Students: Joshua O’ Madadhain, Dawit Seid, Jon Hutchins
JUNG: JAVA Universal Network/
Graph Framework
Link Prediction Algorithms
We have developed a general predictive learning
approach that can uses historical graph data to learn a
predictive model of whether a link is likely to exist
between any pair of nodes A and B in a future timeperiod. The prediction model utilizes information from
both structural graph features around A and B, as well
as individual node attributes for A and B. For example,
for co-author graphs, features can include distance in
the co-author graph of A from B, properties of A’s and
B’s graph neighborhoods, and topic models in the form
of probability distributions characterizing A’s and B’s
research interests.
Results on KDD Challenge/Biobase Data
This prediction competition in 2005 evaluated
different approaches for link prediction. The specific
problem was to predict new collaborations among
300,000 medical researchers in 2002, based on coauthor relations in 128,000 papers published from
1998-2001. The figure to the right shows the “lift
curve” the ratio of the number of true new
collaborations predicted by our models’ rankings
(relative to a random ranking). In the top 50
predictions for example, our models predict between
40 and 45 true collaborations (versus about 3 for a
random ranking).
Algorithms for Ranking Nodes in
Dynamic Networks
1 million emails, 21 months, 628 individuals
JUNG software is publicly
available at
http://jung.sourceforge.net
- active user/developer community
- 30,000 downloads, 1.3 million page visits
- ranked #60 out of 100k Sourceforge projects
- used in social network analysis, games, trust
metrics, upcoming version of HP Zoomgraph,
email visualization, and Netsight
Example of software built using JUNG:
Netsight, an interactive graph
visualization and analysis tool
Example of Rankings over Time
We have developed a novel algorithmic approach to the
problem of determining the importance of nodes in a
network where the links occur over time, e.g., an email
network or a co-author network. The concept is similar
to centrality ideas in social networks, and HITS and
PageRank for Web page ranking, but produces a
“dynamic rank” such that the rank of each node varies
over time as it receives messages in the network.
Data: Corporate Email History
- extensible, open source software library (API) for
graph/network modeling, analysis, and visualization
- can decorate graphs, vertices, edges with any JUNG
object
- complex filtering/transformation/subset management
- includes library of network and graph algorithms
- clustering, centrality, importance, paths, flows,
etc
- includes visualization API, or can use other
visualization APIs (e.g. prefuse)
- supports graphs, hypergraphs, parallel edges, mixedmode graphs, k-partite graphs
Email Rankings and Organizational Structure
GAAL: A General-Purpose
Graph Query Language
Algebraic Framework
We have developed a new query language called GAAL
that allows users to express complex relational
queries on attributed graphs, allowing for queries on
graph properties, aggregation operations, and
scalability to very large graphs. In 2005 we have
extended this approach to provide an algebraic
framework for spatio-temporal analysis of semantic
graphs.
Architecture
Graph
Query
Graph that embeds
input data
•
•
STProject
(G,P,O,T,F)
GProject(G,I,R)
STProject : projects out
spatial/temporal properties and
a target property of nodes/links
for spatial analysis.
GProject: Embeds a set of
nodes or a new relationship
type in the graph; nodes that
have the same type as those in
I are filtered out.
Point set,
Relationship set (pair of points),
Spatial/temporal value
STProject(TsunamiGraph,source_agency, report,
mentionedCity,,)
LEGEND:
G - base-graph
P - set of spatial/temporal properties
O - node/link type with the target property
T - set of target properties of nodes/edges (optional)
F - A filtering condition (optional)
I - spatial/temporal query output
R - relationships to be used in embedding of I
GRAPH SCHEMA
Source_
agency
Topic
About
Step 2: Find the top 3 cities with most damage:
mentions
Distinguish(topic=damage,,sum,3)
reports
References
Damage
related
topics
DBMS Specific Adapters
Extensible ORDBMS
Report
Replicates
filedBy
Differentiate these
mentionedCity
Pid
Multi-relational
Graph
(attributed graph)
Schema and
representation
Other metadata entity/event data
Temporal
Queries
Query Example
Step1: Spatial Projection:
GAAL Language
(Graph Querying Algebra)
Spatial
Queries
A triple of:
(target-property,
Spatial-property,
objectId)
country
Find the news source that had the most coverage of
most heavily damaged regions during the Tsunami
disaster ?
basedIn
Visualization/
Analysis Applications
(NetSight)
Graph Schema
Definition Interface
Node set,
Edge set
Property value
Graph
Graph-set
Rid Pid
Rid name URL
title
abs year
Paper
GProject(TsunamiGraph,topCities,range(mentionedCity))
pos
Write
Step 4: Find the top 3 sources using graph query
language:
Pid1 Pid2
Researcher
Rid Iid
Iid name type
WorksIn Institute
Step 3: Project into graph to find the sources:
Cite
SELECT ?source
FROM {?report,mentionedCity,?city,
?source, reports,?report}
WHERE ?city IN <topCities>
GROUP BY source_agency
AGGREGATE branchwise count(reports) INTO
?source.repCount)
fileHour
fileDate
fileLocation
city
city
Reporter
name
Hour
date
name

Author-Topic Modeling from Large Document Collections

Transcript Author-Topic Modeling from Large Document Collections

Directory