Click here for the PowerPoint presentation

Download Report

Transcript Click here for the PowerPoint presentation

WWW Conference Paper Review
Jonathan
Artificial Intelligence Lab
University of Arizona
2015/4/8
1
Outline
• Overview of WWW
• Paper Review
• Summary
2015/4/8
2
Overview
• Annual Conference
– 2008 Beijing, China
– 2009 Madrid, Spain
– 2010 Raleigh, US
– 2011 Hyderabad, India
• Submission Track
– Research Papers
– Poster
– Other
• Demo Proposal, Workshop Proposal etc.
2015/4/8
3
Areas and Topics
• Data Mining and Machine Learning
– Deriving actionable insight from Web information sources:
query logs, Web graph, click trails, text documents, etc.
• Social Networks
– Models, algorithms, systems and issues around social
networks and collaborative environments.
• Internet Monetization
– Markets, auctions, games, pricing, advertising, and other
Web-specific economic activities.
2015/4/8
4
Areas and Topics
•
•
•
•
•
•
•
•
•
•
Security and Privacy
Semantic Web
Search
Bridging Structured and Unstructured Data
Software Architecture and Infrastructure
Performance, Scalability and Availability
Networking and Mobility
Users Interfaces and Rich Interaction
Rich Media
Web Services and Service-Oriented Computing
2015/4/8
5
Major Groups in Data Mining
• Academy
–
–
–
–
University of Illinois Urbana-Champaign(11)
Stanford University(4)
Cornell University(3)
Arizona State University(3)
• Industry
–
–
–
–
2015/4/8
Google(8)
Microsoft(7)
Yahoo!(5)
IBM(2)
6
Best Papers
• 2008
– IRLbot: Scaling to 6 Billion Pages and Beyond, Hsin-Tsang Lee, Derek Leonard,
Xiaoming Wang, and Dmitri Loguinov, (Texas A&M University)
• 2009
– Hybrid Keyword Search Auctions. Ashish Goel (Stanford University), Kamesh
Munagala (Duke University)
• 2010
– Factorizing Personalized Markov Chains for Next-Basket Recommendation.
Steffen Rendle(Osaka University), Christoph Freudenthaler and Lars SchmidtThieme(University of Hildersheim).
2015/4/8
7
Trends
• More and more research is combining and
integrating different approaches to the same
problem as an innovation.
• In data mining field, increasing number of
studies are combining both text mining and
social network analysis.
2015/4/8
8
Factorizing Personalized Markov Chains for NextBasket Recommendation
S. Rendle (Osaka University, Japan)
C. Fraudenthaler (University of Hildersheim, Germany)
L.S. Thieme (University of Hildersheim, Germany)
2015/4/8
9
Overview
• 2010 Best Paper
• Research Question: Can we
provide better product
recommendation for
different users?
Non personalized MC
– Matrix Factorization(MF)
– Markov Chain (MC)
• Methodology:Factorized
Personalized Markov Chain
Model(FPMC)
• Cinclusion: Proposed
method outperforms MF
model and non personalized
MC model.
2015/4/8
FPMC
10
FPMC
• Task:estimate each ? in the cube.
– But too many ?s, too little data for each user.
• Solution: FPMC
– Factorize the cube, so each user u's transition probability from item i to j is
influenced by the transitions by the same user, from the same item i and to the
same item j.
FPMC
2015/4/8
11
Take-away
• May be helpful when modeling Sequential
data which can be grouped(personalized).
• Potential application in AI Lab
– Combine authorship analysis and sequential text
mining.
– Predict the next word/sentence/paragraph of a
particular author.
2015/4/8
12
Topic Modeling with Network
Regularization
Q. Mei (University of Illinois at Urbana-Champaign)
D. Cai (University of Illinois at Urbana-Champaign)
D. Zhang (University of Illinois at Urbana-Champaign)
C. Zhai (University of Illinois at Urbana-Champaign)
2015/4/8
13
Overview
•
Research Question: Can we
improve topic modeling by
incorporating knowledge on
network structure?
•
Methodology: Topic Modeling
with Network Structure(TMN).
Network Probabilistic Latent
Semancitc Analysis(NetPLSA) was
used for example.
•
Conclusion: Proposed approach
outperforms both pure textoriented method and networkoriented methods.
2015/4/8
14
Topic Modeling with Network Structure
• In general, TMN is a framework for combining arbitrary topic model and
network constraints.
– It builds an objective function to balance between maximizing the likelihood of the
generated topic model and minimizing the topic distribution differences of adjacent nodes
on the network graph.
Geographic topic distribution for Hurricane Katrina
2015/4/8
15
Take-away
• When we want to deal with text data to which
a network structure is attached, we may find
TMN framework helpful.
• Potential application in AI Lab
– Geopolitical topic modeling.
– Incorporate reply network into forum topic
modeling.
2015/4/8
16
Exploiting Social Context for Review
Quality Prediction
Y. Lu (University of Illinois at Urbana-Champaign)
P. Tsaparas (Microsoft)
A. Ntoulas (Microsoft)
L. Polanyi (Microsoft)
2015/4/8
17
Overview
• Research Question: Can we improve review
quality prediction by incorporating social
context into text features?
• Methodology: Linear Regresion with
Regulariziation constraints.
• Conclusion: Prediction accuracy is greatly
increased.
2015/4/8
18
Regression with social context constriants
• In the regression model, besides textual features,
social context features are used as regularization
constraints in the regression model.
– Author consistency
– Trust consistency
– Co-citation consistency
• Minimize both mean square error and the conflicts to
the above three consistency conditions.
2015/4/8
19
Take-away
• A good example of utilizing text data with a
network structure attached.
• When we want to give numerical scores for
textual data, we can use relationship to adjust
these scores.
• Potential application in AI Lab
– sentiment analysis.
2015/4/8
20
Topic Initiator Detection on the
World Wide Web
X. Jin (University of Illinois at Urbana-Champaign)
S. Spangler (IBM)
R. Ma (IBM)
J. Han (University of Illinois at Urbana-Champaign)
2015/4/8
21
Overview
• Research Question: How can
we find the initiator on some
topic in online media?
• Methodology: InitRank
• Conclusion: Proposed
method outperforms
baseline models such as
sorting the documents by
time.
2015/4/8
22
InitRank:TCL Graph
•
After extracting initiator indicator attributes for all documents on a topic, TCL
graph is constructed
– TCL=Time+Content+Link
•
Two kinds of relationship exist between document nodes in the graph.
– Link(Solid)
• Point to referenced document.
– Document similarity(dashed)
• Point to earlier document.
•
Initiator values for nodes are
initialized by other attributes
such as centrality, novelty,
originality and document length.
• Then, these values are optimized
on the graph.
2015/4/8
23
Take-away
• Again, a good example of combining text
mining and social network analysis.
• Potential application in AI Lab
– This framework may be useful in modeling the
"paths" in information diffusion.
2015/4/8
24
AdHeat: An Influence-based Diffusioin Model for
Propagating Hints to Match Ads
H. Bao (Google)
E. Chang (Google)
2015/4/8
25
Overview
• Research Question: In social network, is targeting
ads to a user based upon other users' influences
better than targeting based on this user's features?
– Empirically, a user expertised in one area shows no interest in ads in this
area.
– In this regard, this research attempts to target ads based upon other user's
information that influence the target user best.
• Methodology: Heat diffusion model
• Conclusion: Influence based model
outperforms traditional model in terms of
click-through-rate(CTR).
2015/4/8
26
AdHeat Model
• 1)Social Network Constructing
– a. Edge weights are calculated based on relationship attributes.
– b. Influence score for each user is calculated based on HITS(Hypertext
Induced Topic Selection).
• 2)Hint-word Generation--LDA(Latent Dirichlet Allocation)
• 3)Influence Propagation--Heat Diffusion Equation
0.6
0.8
u1
0.8
0.6
0.4
0.4
u2
u4
u1
music,0.4; guitar,0.6
u2
movie,0.14;art,0.46;music,0.2; guitar,0.2
u3
basketball,0.6;movie,0.15;art,0.25
u4
concert,0.1;cooking0.12;music,0.21;guitar,0.25;
27
movie,0.14;art,0.02;basketball,0.16;
0.6
0.5
u3
2015/4/8
0.2
Take-away
• Attributes for an instance(user) can also be
modeled indirectly from other nodes by
looking at their relationships.
• Potential application in AI Lab
– When clustering stakeholder groups, besides
writing style, we can also pay attention to what
topics are read most by an author to help
identifying his group.
2015/4/8
28
Incorporating Site-Level Knowledge to
Extract Structured Data from Web Forums
J Yang (Microsoft)
R. Cai (Microsoft)
Y. Wang (Chinese Academy of Science)
J. Zhu (Tsinghua University)
L. Zhang (Microsoft)
W. Ma (Microsoft)
2015/4/8
29
Overview
• Research Question: Can
we create a general forum
crawler to extract
structured data from any
forums?
• Methodology: Markov
Logic Networks(MLN)
• Conclusion: Proposed
mechanism is shown to
be quite promising.
2015/4/8
30
Markov Logic Networks
• A probabilistic extension of first-order logics.
– A Markov logic contains multiple assertions called formulas, each of which is
assigned a weight.
– An instance does not have to meet all the formulas to confirm the final assertion.
– This "fuzziness" handles the differences in various forum designs, and
contributes to the generalizability of the forum crawler.
*Example of detecting thread title:
h: an HTML element
2015/4/8
31
Take-away
• A promising framework to facilitate spidering
and parsing in future.
• MLN may be useful when you need to
enhance compatibility of a system.
• Potential application in AI Lab
– Employ MLN to process textual information
intelligently.
2015/4/8
32
Summary
Situation
Potentially useful model
Sequential data that can be grouped
FPMC
Text data with a clear network structure
attached
TMN or RegularizedRegression
Modeling information diffusion among
dataset containing noises
TCL
Features can be extracted from a class
of entities are too limited
AdHeat
need flexibility in decision making or
compatibility for system
MLN
2015/4/8
33
References
• B. Hongji, E.Y. Chang. 2010. AdHeat: An Influence-based Diffusioin Model
for Propagating Hints to Match Ads. In Proceedings of the 19th
international conference on World wide web.
• S. Goel, R. Muhamad, D. Watts. 2009. Social Search in "Small-World"
Experiments. In Proceedings of the 18th international conference on
World wide web.
• X. Jin, S. Spangle, R. Ma, J, Han. 2010. Topic Initiator Detection on the
World Wode Web. In Proceedings of the 19th international conference on
World wide web.
• Y. Lu, P. Tsaparas, A. Ntoulas, L. Polyani. 2010. Exploiting Social Context
for Review Quality Prediction. In Proceedings of the 19th international
conference on World wide web.
• S. Rendle, C. Freudenthaler, L.S. Thieme. 2010. Factorizing Personalized
Markov Chains for Next-Basket Recommendation. In Proceedings of the
19th international conference on World wide web.
2015/4/8
34
References
• H. Lee, D. Leonard, X. Wang, D. Loguinov. 2008. IRLbot: Scaling to 6 Billion
Pages and Beyond. In Proceedings of the 17th international conference on
World wide web.
• A. Goel, K. Munagala. 2009. Hybrid Keyword Search Auctions. In
Proceedings of the 18th international conference on World wide web.
• J Yang, R. Cai, Y. Wang,J. Zhu,L. Zhang, W. Ma. 2008. Incorporating SiteLevel Knowledge to Extract Structured Data from Web Forums. In
Proceedings of the 17th international conference on World wide web.
• J. Y, R. Cai, Y. Wang, J. Zhu, L. Zhang, W. Ma. 2009. Incorporating Site-Level
Knowledge to Extract Structured Data from Web Forums. In Proceedings
of the 18th international conference on World wide web.
2015/4/8
35
Social Search in "Small-World"
Experiments
S. Goel (Yahoo!)
R. Muhamad (Columbia University)
D. Watts (Yahoo!)
2015/4/8
36
Overview
• 2009 Best Paper Nominee
• Research Question: Are individuals able to find theoretically
shortest path connecting to anyone in the social network?
– Every pair of individuals are connected by about 6 intermediaries.
• Topological distance
• Search distance
• Methodology: Message-forwarding experiment;Logistic
Multilevel Regression
• Conclusion: The mean chain length in algorithmic sense is
much larger than 6.
2015/4/8
37
Attrition in Connectivity
• Attrition Rate
– The probability of message forwarding to stop at some
node.
• Motivation:estimate the real algorithmic chain length
– Chain length cannot be directly obtained since in
experiment, more than 99% messages fail to reach final
recipients because of "attrition".
• Attrition rate can be affected by network topology and
individual difference.
– People with high social status(educated, wealthy etc.)
tend to have lower attrition rate.
2015/4/8
38
Take-away
• When modeling directed social relationship,
we may take individual differences into
account.
• Potential application in AI Lab
– Consider the attrition in opinion diffusion model.
2015/4/8
39