2010 DSSI Expert Search slides - Multimodal Information Access

Download Report

Transcript 2010 DSSI Expert Search slides - Multimodal Information Access

Expert Search Group
Department of Computer Science
University of Illinois at Urbana-Champaign


Overall Goal:
 A system that takes as input a small
amount of text (keywords or an abstract of a
paper) and outputs a ranked list of people
interested in the query text.
Application:
 A mailing list agent which determines whom
to invite to on-campus talks.
Multimodal Information Access & Synthesis
Multimodal Information Access & Synthesis
Multimodal Information Access & Synthesis
Expert Finder
Data Mining Group
Mohammed Khalila
Sanvadra Chea
Rick Barber
Multimodal Information Access & Synthesis
Task Overview
 Given list of faculty, gather data which is relevant to the
task at hand
 Search for personal web pages and publications
 Data will be unstructured and noisy
 Preprocess the data and store relevant pages and
publications in database
Multimodal Information Access & Synthesis
Our Database
PaperIdentity
PK
PersonIdentity
PK
UID
FirstName
MiddleName
LastName
1
Each Author Can
have Multiple Papers
and/or web pages
PaperID
Title
Content
Site
∞
UIDOwner
OfficeAddress
URL
OfficePhone
WebIdentity
PK
Email
WebIndex
Department
Title
Title
Content
Alias
Site
∞
UIDOwner
URL
Multimodal Information Access & Synthesis
Web Crawling Methodology
 Crawl uiuc.edu and illinois.edu domains for web pages
only
 Facilitated by Google Search API
 Limitations – 64 results, only 1000 searches per day
 Limitations – Google’s idea of relevance not same as
ours
 Insert results into MySQL database
Multimodal Information Access & Synthesis
Publication Crawling Methodology
 Goal: find as many publications as possible
 From Google Scholar
– Use Google Scholar to search for publications based on the
person’s name
– Google Scholar does not have an API, so we wrote our own
crawler and parser
 From web pages
– For every page we found, extract the publications and their links if
found
– Note: many professors do not have web pages and thus no
publications to extract
Multimodal Information Access & Synthesis
Data Statistics
Papers and Webpages
Contributions
Web pages and Papers Count Statistic
Label
Counts
Percentage
Authors
Percentage
Papers WebPages
103,536
72,345
58.87%
41.13%
3,055
3,277
89.46%
95.96%
Papers
Web
Total
175,881
Multimodal Information Access & Synthesis
Web
103,536(41%)
3,415
Papers
72,345(59%)
Imperfect Data
PaperIdentity
PK
PersonIdentity
PK
UID
FirstName
MiddleName
LastName
1
Each Author Can
have Multiple Papers
and/or web pages
PaperID
Title
Content
Site
∞
UIDOwner
OfficeAddress
URL
OfficePhone
WebIdentity
PK
Email
WebIndex
Department
Title
Title
Content
Alias
Site
∞
UIDOwner
URL
Multimodal Information Access & Synthesis
What is Next?
 Relevance problem for web pages
– Heuristics exist for classifying home pages (TREC conferences)
– Modify these approaches to find pages with research interests (ie
pages which are relevant to our task)
 Integrity problem for publications
– If we have relevant pages
• Query each publication against the body of relevant text
– If we don’t
• Cluster publications and discard unlikely clusters
Multimodal Information Access & Synthesis
References
[1] Jin, Hai, Huang, Li, and Yuan, Pingpeng: Name Disambiguation Using Semantic Association
Clustering, e-Business Engineering, 2009. ICEBE '09. IEEE International Conference on, volume
, 42–48, 21-23 2009
[2] Chen, Ying, Martin, J., and Palmer, M.: Robust Disambiguation of Web-Based Personal Names,
Semantic Computing, 2008 IEEE International Conference on, volume , 276–283, 4-7 2008
[3] Han, H., Giles, L., Zha, H., Li, C., and Tsioutsiouliklis, K.: Two supervised learning approaches for
name disambiguation in author citations, Digital Libraries, 2004. Proceedings of the 2004 Joint
ACM/IEEE Conference on, volume , 296–305, 7-11 2004
[4] Sun, Y., Han, J., Zhao, P., Yin, Z., Cheng, H., and Wu, T.: RankClus: integrating clustering with
ranking for heterogeneous information network analysis, Proceedings of the 12th International
Conference on Extending Database Technology: Advances in Database Technology, 565–576,
2009
Multimodal Information Access & Synthesis
Expert Finder
Topic Modeling/NLP Group
Alex van Esbroeck
Drew Onderko
Surya Kallumadi
Yanbo Xu
Multimodal Information Access & Synthesis
Goal of Topic Modeling & Learning


Automatically organizing, understanding, searching,
and summarizing large document archives.
Identifying Topic experts in a certain domain.
e.g. Who are the experts in the field of “syntactic
parsing” in Natural Language Processing (NLP)?
Multimodal Information Access & Synthesis
Generative Model
- Topic Models, David Blei, Princeton,2002
Multimodal Information Access & Synthesis
Overview
β
η
Author Topic Model
(ATM)
+idf
Φt
T
+Feedback
A
W
Visualization
Θa
Z
K
Ad
Nd
Information
Retrieval
D
Weighting data
Web Pages
Extract text from PDF/ Data cleaning/
Multi-words/ Text segmentation/ …
Multimodal Information Access & Synthesis
Papers
Sample Results
ACL data set (2,084 authors, 1,326 papers)
Topic 20th:
word_segmentation 0.002763
adaptor_grammar 0.002035
repair 0.001766
speech_repairs 0.001422
channel_model 0.001192
adaptor_grammars 0.000962
linear_models 0.000885
shifting 0.000847
posterior_distribution 0.000808
speech_repair 0.000770
Sharon Goldwater 0.553309
Bin Yu 0.388989
Matthew Lease 0.334416
Anna Krasnyanskaya 0.322222
Mark Johnson 0.321784
William Schuler 0.288091
Daichi Mochihashi 0.194129
Izhak Shafran 0.180851
Takeshi Yamada 0.156762
Ying Lin 0.145747
Mark Johnson: Using Adaptor Grammars to Identify Synergies in the
Unsupervised Acquisition of Linguistic Structure. ACL 2008: 398-406
Jianfeng Gao, Galen Andrew, Mark Johnson, Kristina Toutanova: A Comparative
Study of Parameter Estimation Methods for Statistical Natural Language
Processing. ACL 2007
Multimodal Information Access & Synthesis
Sample Results
ACL data set (2,084 authors, 1,326 papers)
Term
Probability
Author
Probability
translation
0.013
Chi-Ho Li
0.939
word_alignment
0.010
Kazuhide Yamamoto
0.921
alignment
0.009
Necip Ayan
0.920
alignments
0.008
Kazuteru Ohashi
0.906
aligned
0.008
Nicolas Stroppa
0.905
translation
0.007
Shouxun Lin
0.893
machine_translation
0.007
Boxing Chen
0.887
translation_model
0.007
Stephan Vogel
0.881
target_language
0.006
Yaser Al-Onaizan
0.861
translated
0.005
Bowen Zhou
0.822
Multimodal Information Access & Synthesis
Current work
 Out of box:
• Extracting text from pdf documents
• Segmenting papers and web pages into categories for weighting
• Generating term-frequency vectors from text documents
 Inside box:
• Author topic model (modified JGibbsLDA, GibbsLDA++)
• Author topic model + idf
• Author topic model + non-uniform Dirichlet prior (for Feedback)
Multimodal Information Access & Synthesis
Future Work
 Out of box:
•
•
•
•
•
Extracting segments from papers and webpages
Assigning variable weights for segments
Authors normalization/disambiguation
Integrating feedback and supervised Topic modeling
Output for visualization
 Inside box:
• More feedback
 Evaluating our results and assigning names to Topics
Other topic models!
Multimodal Information Access & Synthesis
Expert Finder
Information Retrieval Group
Max Isenholt
Irwin Purifoy
Bhargavi Sriram
Multimodal Information Access & Synthesis
TARGET
 To develop a web page that takes text data (Abstract) as
input and outputs a ranked list of people(Professors in
UIUC) interested in the input text (Topics)
 Input text
– Can be the abstract of the presentation that the speaker is going
to give a talk on.
– Key words from the abstract.
 Output
– List of professors of UIUC (with their email), who will be
interested in the talk.
 Input source
– Data from the data mining part
– Data from the topic modeling part
Multimodal Information Access & Synthesis
INFORMATION RETRIEVAL
SYSTEM
Data from either Data
Mining /Topic Modeling
Input Query
Professors
E-mails
Ranked
Multimodal Information Access & Synthesis
IDEAS FOR IMPLEMENTATION
– Simple Text Retrieval
• Given the keywords, retrieve documents with general text
retrieval algorithms.
Uses Lemur ToolKit
– Topic Modeling Retrieval
• Retrieval using keywords from the abstract
P(W , a)   ai  P(wi , a | t ) P(t )   ai  P(wi | t ) P(a | t ) P(t )
wi
t
wi
t
• Get the keywords from the abstract, match the keyword with
the words under different topics from topic modeling output,
calculate the product of probabilities of author given topic,
topic given words and topic’s probability.
Multimodal Information Access & Synthesis
CURRENT STATUS
 Gather initial experimental data
– Few Professors text data from web(sample data)
– Toy topic modeling result
 Training on Lemur toolkit
– Simple text retrieval using sample data from web
 Understanding the Topic-Author Model
– Generated a sample output (Author retrieval) from the toy topic
modeling result.
Multimodal Information Access & Synthesis
SAMPLE OUTPUT
 Given the keyword as “unsupervised learning” the output
was
Multimodal Information Access & Synthesis
FUTURE PLANS
 Implement retrieve Author given an Author as input query
– Given Author, find other author’s related to the given author in the
topics extracted from the abstract.
 Generate short summary for each retrieved author
– E.g. show top 10 keywords per author
Multimodal Information Access & Synthesis
Multimodal Information Access & Synthesis