Min-Yen KAN - NUS School of Computing

Download Report

Transcript Min-Yen KAN - NUS School of Computing

Web IR/NLP Group
(WING) @ NUS
Min-Yen Kan
School of Computing
National University of Singapore
http://wing.comp.nus.edu.sg/
Min-Yen Kan
Web IR/NLP Group @ NUS
One of many groups doing these type of research at NUS
PI: Min-Yen KAN (NLP and IR/DL)
Postdoc:
• Su Nam KIM (Multiword Expressions)
PhDs:
• Hendra SETIAWAN (Stat MT)
Support staff (undergraduate)
• System administrators
• System programmers
• Long QIU (Scenario Templates)
• Yee Fan TAN (Web Record Linkage)
• Jin ZHAO (Math IR)
• Jesse PRABAWA (UI/HCI for DLs)
• Ziheng LIN (Summarization)
Undergraduate Projects
• 4 this year (ask me about topics)
Will go over NLP then DL for today
MSRA Web-Scale NLP Worshop (Daedeok, Korea)
2
Min-Yen Kan
Information Extraction
• Keyphase Extraction
– Idea: Use section information as evidence (ICADL 07)
•Scenario Template Generation (Long Qiu)
– Aim: to generate database rows from similar news events
Charley landed further south on the Gulf Coast than predicted, … The hurricane …
was weakened and is moving over South Carolina
At least 21 missing after the storm hit … But Tokage had weakened by the time it
passed over Tokyo, where it had left little damage before moving out to sea.
– Model context and cluster to convergence using EM
(EMNLP 06)
MSRA Web-Scale NLP Worshop (Daedeok, Korea)
3
Min-Yen Kan
Using less data
• URL Classification (WWW 04)
http://www.usatoday.com/stories/080502/ent/hilton.html
http://www.cancersupportgroup.org/forum/230.html
– Classifies 1000’s of URLs per minute, with 2/3rds of full
text accuracy
– Useful for focused crawling, web mining applications
MSRA Web-Scale NLP Worshop (Daedeok, Korea)
4
Min-Yen Kan
Question-Answering (Hang Cui)
• Our Approaches to QA
– Use of external resources from Web & WordNet (SIGIR04)
– Employ dependency & SRL for answer extraction (SIGIR05, 06)
– Soft pattern analysis of definitional patterns (WWW 05)
– Explore temporal relationships and events
– Extend techniques to precise passage retrieval
– Came 2nd (in 2003, 2004 & 2005) in TREC QA Task
– Licensed technology to company in legal search
• Current focus
– Relation-based IE & QA – continue focus on linguistic knowledge
– Ontology-based Interactive QA – leverage on domain knowledge
– Searching for answers and mining terminology from the Web
MSRA Web-Scale NLP Worshop (Daedeok, Korea)
5
Min-Yen Kan
Summarization (Ziheng Lin)
• Document Concept Lattice Model (IPM 07)
– Aim to find list of sentences that result in minimal info lost
– Extract key concept terms, and build concept lattice
– Perform sentence extraction that covers max concept terms
– Participated in DUC, came in 1st (2005) and 2nd (2006)
• Pioneered iterative construction model for graph-based
summarization (DUC 07)
doc1
doc1
doc1
doc2
doc3
s1
MSRA Web-Scale NLP Worshop (Daedeok, Korea)
doc2
doc3
doc2
doc3
s1
s1
s2
s2
s3
6
Min-Yen Kan
Statistical Machine Translation (Hendra Setiawan)
a form
is
a page
on
data entry fields
of
a coll.
表单 是 网页 上 的 数据 输 域 的 集合
上 网页
Function Word Based
Reordering (ACL 07)
on a page
数据 输 域 的 上 网页
data entry fields
on a page
集合
a coll.
的 数据 输 域 的 上 网页
of
data entry fields on a page
表单 是 集合 的 数据 输 域 的 上 网页
a page is a coll. of data entry fields on a page
MSRA Web-Scale NLP Worshop (Daedeok, Korea)
7
Min-Yen Kan
Commercial record linkage (Yee Fan Tan)
• Addresses
– Dongwon Lee, 110 E. Foster Ave. #410, State College, PA, 16802
– LEE Dong, 110 East Foster Avenue Apartment 410, Univ. Park, PA 16802-2343
• Products
– Honda Fix vs. Honda Jazz
– Apple iPod Nano 4GB vs. 4GB iPod nano 4GB
• Idea:
use web as additional context for disambiguation and
clustering (JCDL 06, WIDM 07)
• Placed 3rd in Web People Search Task (WEPS 2007)
MSRA Web-Scale NLP Worshop (Daedeok, Korea)
8
Min-Yen Kan
Multi(ple) Extensions
• Multimodal Alignment
– Lyrics with Audio
(ACM MM 04)
– Slides with Paper
(JCDL 07)
• Current and future work:
Text in Focus
Slide in Focus
– Extracted Terminology with User Tagging
–
MSRA Web-Scale NLP Worshop (Daedeok, Korea)
9
Min-Yen Kan
Focusing on the User
Understanding user searches better
– Known item search (JCDL 2005)
– Faceted classification of web queries (WebQ 2007)
• Building better user interfaces (Jesse Prabawa)
– Revisiting library
catalog interfaces to
better support
searching
(JCDL 2007)
MSRA Web-Scale NLP Worshop (Daedeok, Korea)
10
Min-Yen Kan
Putting it all together
We’re building a niche academic research repository
– e.g., MS Libra, CiteSeer, DBLP, Google Scholar
What? Another one? What’s the catch?
– The user interaction and community involvement is central
– Overcome faults of imperfect machine learning
– Platform for researching how web-scale NLP actively involves
user feedback and mechanisms for channeling this
What about Web NLP / IR?
– My group emphasizes practical outcomes and deliverables
– Find research within industry and practical problems
– Multilingual, multimedia, web-as-data angles likely to continue
MSRA Web-Scale NLP Worshop (Daedeok, Korea)
11
Min-Yen Kan
Other pointers (NUS-wide)
• Text Processing Seminar (with archived slides)
http://wing.comp.nus.edu.sg/chimetext
• Machine Learning (Graphical Models) Reading Group
http://groups.google.com/group/mlnus/
• NLP Reading Group
http://wing.comp.nus.edu.sg/NLPReading/index.php/Main_Page
<AD>
Shameless plug for my group: http://wing.comp.nus.edu.sg
</AD>
Thanks for listening!
MSRA Web-Scale NLP Worshop (Daedeok, Korea)
12