ASQA: Academia Sinica

Transcript ASQA: Academia Sinica

Implications of Web 2.0 on
Information Research
Wen-Lian Hsu
Academia Sinica, Taiwan
中央研究院資訊所許聞廉
[email protected]
1
Outline
 What is Web 2.0?
 Web 2.0 and Research




Human-based Computation
Folksonomy (Social Tagging)
Academic Data Analysis
GIO-Info
 Conclusion
2
3
What is Web 2.0?
 Web 2.0 Conference (October 2004)
 Tim O'Reilly







The Web As a Platform
Harnessing Collective Intelligence
Data is the Next Intel Inside
End of the Software Release Cycle
Lightweight Programming Models
Software Above the Level of a Single Device
Rich User Experiences
4
Key Web 2.0 services/applications







Blogs
Wikis
Tagging and social bookmarking
Multimedia sharing
RSS and syndication
Podcasting
P2P
5
Social Bookmarking
Source: http://funp.com/push/
6
Source: http://digg.com/
Soruce: http://www.hemidemi.com/
7
Social bookmark
adsense
Blog
Content
comments
Source: http://carol.bluecircus.net/
8
Skype
Source: S.A Baset, H. Schulzrinne (September 14, 2004).
An Analysis of the Skype Peer-to-Peer Internet Telephony Protocol. Technical Report. Columbia University.
9
Wikipedia
10
Second Life
11
Symbiosis (共生機制) is the Key
Blog
Social bookmark
12
The Web Changes in Several
Dimensions





Dynamics
Heterogeneity
Collaboration
Composition
Socialization
13
Current Research Activities
 Information Retrieval on Blogs
 NTCIR-7 CLIRB (Cross-Lingual Information Retrieval for Blog)
 Question Answering on Blogs
 TREC 2007 QA Track
 Question Answering on Wikipedia
 QA@CLEF 2007
 CLEF 2006 WiQA
 given a Wikipedia page, locate information snippets in Wikipedia
 PASCAL Ontology Learning Challenge
 Ontology construction
 Ontology extension
 Ontology population
 Concept naming
 LinkKDD2006, Textlink2007, MRDM2007
14
International Competition
 1st/9 place in the NTCIR5 2005 CLQA Chinese Question
Answering Contest (44.5%)
 1st/13 place in the WS CityU closed track of the SIGHAN 2006
Word Segmentation Contest (97.2%)
 2nd/10 place in the WS CKIP closed track of the SIGHAN 2006
Word Segmentation Contest (95.7%)
 2nd/8 place in the NER CityU closed track of the SIGHAN 2006
Named Entity Recognition Contest (88%)
 1st place in the NTCIR6 2006 CLQA Chinese Question
Answering Contest (55.3%)
 1st place in the NTCIR6 2006 CLQA English-Chinese Question
Answering Contest (34%)
15
Factoid Questions

PERSON:
請問芬蘭第一位女總統為誰？
Who is Finland's first woman president?

LOCATION:
請問狂牛症最早起源於何國？
Which country is the mad cow disease originated from?

ORGANIZATION:
請問收購南韓三星汽車的外國廠商為何？
Which corporation bought South Korea's Samsung Motors?



TIME
NUMBER
ARTIFACT
16
IASL QA Architecture
Answer Extraction
Question Processing
SVM
InfoMap
Mencius
ME
AutoTag
Mencius
Answer Ranking
Passage Retrieval
Lucene
Filter
AutoTag
Answers
word index
char index
documents
17
Chinese Question Taxonomy
for NTCIR CLQA Factoid Question Answering
18
Knowledge Representation of Chinese
Questions
Chinese Question:
2004年奧運在哪一個城市舉行?
(In which city were the Olympics held in 2004?)
[5 Time]:[3 Organization]:[7 Q_Location]:([9 LocaitonRelatedEvent])
19
QC by SVM
 Two types of feature used for CQC
 Syntactic features
 Bag-of-Words
 character-based bigram (CB)
 word-based bigram (WB)
 Part-of-Speech (POS)
 AUTOTAG
 POS tagger developed by CKIP, Academia Sinica
 Semantic Features
 HowNet Senses
 HowNet Main Definition (HNMD)
 HowNet Definition (HND)
20
Question Classification Accuracy
Chinese Question Classification (CQC)
100.0%
88.0%
90.0%
80.0%
92.0%
73.5%
Accuracy
70.0%
60.0%
50.0%
40.0%
30.0%
20.0%
10.0%
0.0%
Machine Learning
Approach
(SVM)
Knowledge-based
Approach
(INFOMAP)
Hybrid Approach
(SVM + INFOMAP)
21
Answer Extraction
廿一世紀美國總統
總統父子檔美國第二對
美國總統性事錄
翻開美國總統傳訊史
美國總統匆忙赴晚宴
陸文斯基瘋狂愛上美國總統
美國總統大選選舉人票分析
前越南總統阮文紹病逝美國
美國總統柯林頓表示
Answer Extraction
Mencius
Filter
陸文斯基
阮文紹
柯林頓
22
Templates generated by local alignment

..因/Cbb/O 台中縣/Nc/LOC 議長/Na/OCC 顏清標/Nb/PER 涉嫌/VK/O..
.. 清朝/Nd/O 台灣/Nc/LOC 巡撫/Na/OCC 劉銘傳/Nb/PER 所/D/O..
 LOC OCC PER
(contains only NEs)

被/P/O 大陸/Nc/LOC 國家/Na/O 主席/Na/OCC 江澤民/Nb/O 形容為/VG/O..
/COMMA/O 香港/Nc/LOC 行政/Na/O 長官/Na/OCC
董建華/Nb/PER 近日..
俄羅斯/Nc/LOC 男子/Na/O 選手/Na/OCC 史莫契柯夫/Nb/O 在/P/O..
 LOC Na OCC Nb
(template contains POS-tag)

由/P/O 建業/Nc/O 所長/Na/OCC 張龍憲/Nb/PER 擔任/VG/O
由/P/O 安侯/Nb/O 所長/Na/OCC 魏忠華/Nb/PER 擔任/VG/O
由 N 所長 PER 擔任
(template contains paritial POS-tag, word)

在/P/O 卡達首都/Nc/LOC 多哈/D/PER,LOC 舉行/VC/O
於/P/O 國父紀念館/Nc/ORG 舉行/VC/O
在/P/O 國父紀念館/Nc/ORG 廣場/Nc/O
舉行/VC/O
 P Nc – 舉行
(template with gap ‘-’ )
23
Answer Extraction from Template

Question: 誰是台灣國防部長？
 Q-Type: PERSON Q-KEYWORD: 台灣國防部長

Tagged Passages


前任/A/O 美國/Nc/LOC 國防部長/Na/OCC溫柏格/Nb/PER 認為/VE/O ，
/COMMACATEGORY/O

美國/Nc/LOC 國防部長/Na/OCC柯恩/Nb/PER 今天/Nd/O 表示/VE/O ，/COMMA/O 華府
/Nc/ORG,LOC 當局/Na/O 正/D/O 設法/VF/O 釐清/VC/O 台灣/Nc/LOC

【/PAR/O 路透/Nb/ORG 東京/Nc/LOC 十九日/Nd/TIME 電/VC/ART 】/PAREN/O 台灣
/Nc/LOC 國防部長/Na/OCC唐飛/Nb/PER 昨天/Nd/O
Template matching and Relation building

Template: LOC OCC PER

Relation:

美國, 國防部長, 溫柏格, 柯恩

台灣, 國防部長, 唐飛
24
Answer Extraction from Template

Question: 黛安娜王妃的死亡車禍事故發生在哪裡？

Q-TYPE: LOCATION Q-KEYWORD: 黛安娜王妃死亡車禍事故發生

Tagged Passages

.. 則/D/O 把/P/O 英國/Nc/LOC 黛安娜/Nb/PER 王妃/Na/O 的/DE/O 巴黎/Nc/LOC 死
亡/VH/O 車禍/Na/O ，/COMMA/O 搬上/VC/O 舞台/Na/O ..

.. 英國/Nc/LOC 王妃/Na/O 黛安娜/Nb/PER 離開/VC/O 人世/Nc/O 四個多月
/Nd/TIME ..

Template matching and Relation building

Template:



PER Na DE LOC – Na
LOC Na PER - VC
Relation:


黛安娜/PER, 王妃/Na, 巴黎/LOC, 車禍/Na
英國/LOC, 黛安娜/PER, 王妃/Na, 離開/VC
25
Answer Ranking
 Features are combined as weighted sum
 Answer Ranking Features
 IR Score
 Answer Frequency (voting)
 * QFocus adjacency:
 “美國總統[布希]表示”
 “前往[惠氏藥廠]參觀”
 * Question Term and Answer Term (QAT) Co-occurrence
 * Answer Template
26
Web 2.0 and Research




Human-based Computation
Folksonomy (Social Tagging)
Academic Data Analysis
GIO-Info
27
Human-based Computation
28
Human-based Computation
 Social Search
 wayfinding tools informed by human judgment
 CAPTCHA
 reversed Turing test (Turing test 是由人來詢問系統，這裡
則是由系統來詢問使用者）
 Interactive Genetic Algorithm (IGA)
 a genetic algorithm informed by human judgment.
 由人工提供fitness function結果
 例子：描繪罪犯畫像，系統以GA方式產生嫌犯畫像，
目擊者負責評分看那個比較像，不斷重複過程直到接近
罪犯樣子為止
29
CAPTCHA
Completely Automated Public Turing test to tell Computers and
Humans Apart
 A CAPTCHA is a type of challenge-response
test used in computing to determine whether the
user is human.
wikipedia
SOURCE: http://recaptcha.net/
30
CAPTCHA
blog
Recognized
text
CAPTCHA
blog
CAPTCHA
blog
CAPTCHA
Unrecognized
text
31
The ESP Game
 a two-player game
 The goal is to guess what
your partner is typing on
each image.
 Once you both type the same
word(s), you get scores.
ESP
32
Source: http://www.espgame.org/
The Phetch Game
Play as a describer
33
The Phetch Game
Play as a seeker
Phetch
34
How about a game for describing
idioms?
壞事做太多
高抬貴手
不動如山
罄竹難書
如沐春風
罄竹難書: 壞事做太多
虎頭蛇尾: 做事沒有毅力
………
35
Folksonomy (Social Tagging)
36
Folksonomy (Social Tagging)
 Also known as social tagging, collaborative
tagging, social classification, social indexing
 Folksonomy is the practice and method of
collaboratively creating and managing tags to
annotate and categorize content.
Wikipedia
37
38
del.icio.us
Tags: Descriptive words
applied by users to links.
Tags are searchable
My Tags: Words I’ve used
to describe links in a way
that makes sense to me
39
Semantic Web
Source: Tim Berners-Lee
40
Using Folksonomy to Help Semantic
Web
 Top-down Semantic Annotation
 Approach
 Define an ontology first
 Use the ontology to add semantic markups to web
resources.
 The semantics is provided by the ontology which is
shared among different web agents and applications.
 Problem
 Negotiation
 Evolution (hard to maintain)
 High Barrier (background)
Source: Xian Wu, Lei Zhang, Yong Yu.
“Exploring Social Annotations for the Semantic
41Web”
Using Folksonomy to Help Semantic
Web
 Bottom-up approach with social tagging
 Advantage
 No common ontology or dictionary are needed
 Easy to access
 Sensitive to information drift
 Disadvantage
 Ambiguity Problem: For example, “XP” can refer to either
“Extreme Programming” or “Windows XP”.
 Group Synonymy Problem: two seemingly different
annotations may bear the same meaning.
Source: Xian Wu, Lei Zhang, Yong Yu.
“Exploring Social Annotations for the Semantic
42Web”
Or Folksonomy is the Solution?
 Ontology is Overrated
 Classification of the web has failed
 Classification itself is filled with bias and error
 Tagging is the solution
Source: http://www.shirky.com/writings/ontology_overrated.html
43
Academic Data Analysis
44
Academic Data Analysis
Users
participate and
interact with
data and
people
e-Lib, Lib 2.0
concept adding into
application, so
search platform
provide open API
for collecting more
data
Add My Library, Tag
Ex. Citeulike,
BibSonomy
Add Comments, Rating,
Recommendation
Ex. Techlens
Domain Focus Groups
Ex. Botanicus
Arxiv
Google Scholar
Windows Live
Academic Search
CiteSeer
PudMed
Citation index
Papers , journal/conference,
45 authors
An Example
 Let’s use an example of TechLen to imagine
what research on IR /NLP can do.
Authors
Readers
Papers
46
The Terminology
Entities
References
Aho, A. V.
Alfred V Aho
Alfred Aho
AV Aho
Alfred Aho, John Hopcroft, Jeffrey Ullman
Links
AV Aho, BW Kernighan, PJ Weinberger
Entity Groups
G1
(Programming Languages)
G3
(Algorithms)
G2
(Databases)
47
Imagine how we can make use of them
Papers
Reference
Extraction
Entity
Resolution
Authors
Rating
Comments
Readers
48
New Research Topics
 From those changes, key emerging challenge for “Data Mining” is
tackling the problem of dealing with richly structured, finding
patterns behind heterogeneous datasets, …, etc.
 Several researches focus on those problem like
 (Social) Network Analysis
 Link Mining
 PASCAL Ontology Learning Challenge
 …
49
Society
Nodes: individuals (Authors, Readers)
Links: social relationship
(family/work/friendship/belong to,…etc.)
S. Milgram (1967)
Six Degrees of Separation,
John Guare
Science
Social networks: Many individuals with
diverse social interactions between them.
50
source: www.cs.uiuc.edu/~hanj
Communication networks
The Earth is developing an electronic nervous system, a network with
diverse nodes and links are
-computers
-phone lines
-routers
-TV cables
-satellites
-EM waves
-Papers
- Relations between
artifacts
-User IP
Artifacts in -Comments
Techlens
-Response
-…
Communication
networks: Many nonidentical components
with diverse
connections between
them.
51
source: www.cs.uiuc.edu/~hanj
Link-based Object Ranking
 Perhaps the most well known link mining task is that of link-based
object ranking (LBR), which is a primary focus of the link analysis
community. The objective of LBR is to exploit the link structure of
a graph to order or prioritize the set of objects within the graph.
 Example
 PageRank
 What paper is most important in this area?
 What journal/conference is most important in this area?
 What topic is important in this area?
52
Link-based Object Classification/ Linkbased Classification (LBC)
 Predicting the category of an object based on its
attributes and its links and attributes of linked
objects
 Web: Predict the category of a web page, based on words that
occur on the page, links between pages, anchor text, html tags,
etc.
 Citation: Predict the topic of a paper, based on word
occurrence, citations, co-citations
 Epidemic : Predict disease type based on characteristics of the
people; predict person’s age based on ages of people they have
been in contact with and disease type
53
Group Detection
 Cluster the nodes in the graph into groups that
share common characteristics. That is,
Predicting when a set of entities belong to the
same group based on clustering both object
attribute values and link structure.
 Web: identifying communities
 Citation: identifying research communities
54
Entity Resolution
 Predicting when two objects are the same,
based on their attributes and their links
 Web: predict when two sites are mirrors of each
other
 Citation: predicting when two citations are
referring to the same paper
 Epidemics: predicting when two disease strains are
the same
 Biology: learning when two names refer to the
same protein
55
Link Prediction
 Predict whether a link exists between two
entities, based on attributes and other observed
links
 Web: predict if there will be a link between two
pages
 Citation: predicting if a paper will cite another
paper, or predict the venue type of a publication
(conference, journal, workshop) based on properties
of the paper
 Epidemics: predicting who a patient’s contacts are
(在流行病學上需要去找出病源(灶)/傳染源)
56
Other Possible Research Directions
 Expert Finding
 like giving a suggestion of Paper Reviewer,
Conference committee member
 Ecological Evolution of Some Research
 Like one topic with different solution in a time
period
 A domain’s topic distribution
57
GEO-Info 地理資訊
58
GEO-Info
User
Participate
Google Earth
Community
Google Earth
Blog
Ogle Earth
….
GML
Photo-sharing
User Annotation
Open for
every one
Google Earth/Map
Limited user,
limited usage
GIS
59
Some Research Topics
 Until now, a lot of information can be combined into
google earth/map by KML.
 Hence such information can be integrated by
geocoding, some models become very interesting, such
as







Photo Annotation, Sharing, and Search
Live information
Planning
3D, Flights Animation
Travel experience, comments
Transportation information, survival information
Climate Change
60
Some Information bundled with Google
Earth/Map (中山公園)
Photo sharing,
(photo & Tags)
Integrated with Youtube (video & tags)
61
Some Application Integrate more
Information on Map
Personal Life Information Integration
GeoDDupe: A Novel Interface for
Interactive Entity Resolution in
Geospatial Data
62
Photo link with Map
Source: http://www.panoramio.com
63
63
Image-based Rendering (IBR)
 IBR relies on a set of two-dimensional images
of a scene to generate a three-dimensional
model and then render some novel views of this
scene.
 Web 2.0 enables sharing of photographs on a
truly massive scale
64
Microsoft PhotoSynth
 From SIFT to PhotoSynth
65
Conclusion
 Research results can be easily integrated on the Web 2.0 platform
 make restricted-domain research more useful for the public (such
as image-based rendering)
 Software agent
 Benefit human-based computation
 Certain research topics will be easier to tackle, such as
personalization in virtual world (more data available)
 Data becomes more task oriented (e.g. Wikipedia)
 More versatile data networks available
66
誠徵研究助理（歡迎替代役）
1. 資訊相關研究所畢業。
2. 具備研讀英文論文能力。
3. 對「中文自然語言處理」(「自然輸入法」、「問
答系統」)或「生物資訊」（「生物資訊演算法」、
「生物文獻檢索分析」）研究有熱忱。
4. 熟悉下列任一程式語言：C/C++/C#/JAVA 與問題解
決能力
5. 應徵輸入法相關工作者具下列任一條件尤佳：
WinCE/Win32 API。
6. 善於溝通與團隊合作。
67
Acknowledgement
 I would also like to thank two Ph. D. students
of mine who help organize the slides:
 李政緯，呂俊宏
68
Thank You
69

ASQA: Academia Sinica

Transcript ASQA: Academia Sinica

Directory