name - big humanities

Download Report

Transcript name - big humanities

Chao-Lin Liu, Chih-Kai Huang,
Hongsu Wang, Peter K. Bol
National Chengchi University, Taiwan
Harvard University, USA
29 October 2015
Enhance the CBDB contents by mining and
discovering more biographical information
from multiple sources with computing
methods
 Readily extensible to social network analysis
 Difangzhi for now

China Biographical Database
 Local gazetteers of China
 Language models
 Conditional-random-field models
 Discussions



CBDB (中國歷代人物傳記資料庫)
URL:



http://isites.harvard.edu/icb/icb.do?keyword=k16229&pageid=icb.page76535
Short URL:
http://goo.gl/hCUKpR
Open and free database
of about 360,000
Chinese individuals
ranging between 7th
and 19th century
地方志
 Local gazetteers
compiled by Chinese
governments
 A big collection of
biographical
information (and
others) since 6th
century

ifyouhadascreenplaytosellorbetteryetthenextbigtec
hstartuptopitchthewhitehouseseastroomwasthepla
cetobeonfridaynightgatekeepersofthesilverscreenan
dsiliconvalleywereoutinfullforceatthestatedinnerinh
onorofchinesepresidentxijinpinggiantsoftheindustry
includingfacebookceomarkzuckerbergandappleceoti
mcookrubbedelbowswithrobertigerceoofthewaltdis
neycompanyandjeffreykatzenbergceoofdreamworks
animationandwithallthecorporatetitansinattendanc
eyoudthinkthenightwouldbeallbusinessbutwhenask
edtopredictthebiggestitemontheeveningsagendakat
zenbergsaidfunihope
SOURCE: WASHINTON POST:
https://www.washingtonpost.com/news/reliable-source/wp/2015/09/25/state-dinner-recap-heavy-on-silicon-valley-and-the-silver-screen/
If you had a screenplay to sell, or better yet the next big tech start-up
to pitch, the White House’s East Room was the place to be on Friday
night.
Gatekeepers of the silver screen and Silicon Valley were out in full
force at the state dinner in honor of Chinese President Xi Jinping.
Giants of the industry, including Facebook CEO Mark Zuckerberg and
Apple CEO Tim Cook, rubbed elbows with Robert Iger, CEO of the Walt
Disney Company, and Jeffrey Katzenberg, CEO of DreamWorks
Animation.
And with all the corporate titans in attendance you’d think the night
would be all business. But when asked to predict the biggest item on
the evening’s agenda Katzenberg said, “Fun. I hope.”
SOURCE: WASHINTON POST:
https://www.washingtonpost.com/news/reliable-source/wp/2015/09/25/state-dinner-recap-heavy-on-silicon-valley-and-the-silver-screen/
守義白劉廣其壁○
西兵龍永西德不不
城千池錫中立下知
城户○者書廟○勞
陷洪死潭省祀城洪
身武○州都之破武
中元焉人事○執元
數年○與城陳送年
鎗明曾瑜破瑜京楊
知兵尚同以○師璟
不圍賓事佩字不取
敵靜○率刀仲屈廣
自○江妻自庸死西
○江西子刎雷郡吉
尚人溺○州人尼
賓為於有人感堅
不兵曾州省立不不
敵圍尚人都廟下知
自靜賓與事祀城勞
江江瑜城之破洪
尚西同破陳執武
賓人事以瑜送元
守為率佩字京年
西義妻刀仲師楊
城兵子自庸不璟
城千溺刎雷屈取
陷户於有州死廣
身洪白劉人郡西
中武龍永廣人吉
數元池錫西感尼
鎗年死者中其堅
知明焉潭書德壁
不兵曾州省立不不
敵圍尚人都廟下知
自靜賓與事祀城勞
江江瑜城之破洪
尚西同破陳執武
賓人事以瑜送元
守為率佩字京年
西義妻刀仲師楊
城兵子自庸不璟
城千溺刎雷屈取
陷户於有州死廣
身洪白劉人郡西
中武龍永廣人吉
數元池錫西感尼
鎗年死者中其堅
知明焉潭書德壁
Difangzhi text files
annotate texts with
<NAME>, <ADDRESS>, <ENTRY>
<OFFICE>, and <NIANHAO>
analyze the label sequenes,
prefer frequent and consistent
ones of <NAME> and diversified
labels to create filter patterns
parse the text segments that
are covered by the selected
sequences to obtain
the desired records
extracted records
CBDB NAME data
CBDB ADDRESS data
CBDB ENTRY data
CBDB OFFICE data
CBDB NIANHAO data
陳瑜字仲庸雷州人廣西中書省都事
陳瑜字仲庸雷州人廣西中書省都事
陳瑜 (Yuan)
陳瑜 (Ming)
陳瑜 (Qing)
陳瑜字仲庸雷州人廣西中書省都事
陳瑜 (Yuan) 雷州
陳瑜 (Ming)
陳瑜 (Qing)
陳瑜字仲庸雷州人廣西中書省都事
陳瑜 (Yuan) 雷州 廣西
陳瑜 (Ming)
陳瑜 (Qing)
陳瑜字仲庸雷州人廣西中書省都事
陳瑜 (Yuan) 雷州 廣西 中書省都事 (Yuan)
陳瑜 (Ming)
陳瑜 (Qing)
中書 (Ming)
陳瑜字仲庸雷州人廣西中書省都事
陳瑜 (Yuan) 雷州 廣西 中書省都事 (Yuan)
陳瑜 (Ming) 雷州 廣西 中書 (Ming)
陳瑜 (Qing)
<NAME><ADDRESS><ADDRESS><OFFICE>
Difangzhi text files
annotate texts with
<NAME>, <ADDRESS>, <ENTRY>
<OFFICE>, and <NIANHAO>
analyze the label sequenes,
prefer frequent and consistent
ones of <NAME> and diversified
labels to create filter patterns
parse the text segments that
are covered by the selected
sequences to obtain
the desired records
extracted records
CBDB NAME data
CBDB ADDRESS data
CBDB ENTRY data
CBDB OFFICE data
CBDB NIANHAO data
ngrams
 Examples

<NAME><ADDRESS><REIGN PERIOD><ENTRY>
 <NAME><ADDRESS><ENTRY><REIGN PERIOD>
 <NAME><NAME><ADDRESS><ADDRESS>
 <NAME><ADDRESS><ADDRESS><OFFICE>

<name><address><address><office>
陳瑜字仲庸雷州人廣西中書省都事
<name>陳瑜</name>字仲庸<addr>雷州<addr>人
<addr>廣西</addr><office>中書省都事</office>
陳瑜字仲庸雷州人廣西中書
<name>陳瑜</name>字仲庸<addr>雷州<addr>人
<addr>廣西</addr><office>中書</office>
<name> 字 Z1 Z2 <address>
陳瑜字仲庸雷州人廣西中書省都事
<name>陳瑜</name>字仲庸<addr>雷州<addr>人
<addr>廣西</addr><office>中書省都事</office>
陳瑜(Yuan) 雷州 廣西 中書省都事(Yuan)
Yuan,陳瑜,仲庸
<name> 字 Z1 Z2 <address>
陳瑜字仲庸雷州人廣西中書
<name>陳瑜</name>字仲庸<addr>雷州<addr>人
<addr>廣西</addr><office>中書</office>
陳瑜(Yuan) 雷州 廣西 中書(Yuan)
Yuan,陳瑜,仲庸








清,吳贊誠,春帆
清,李滋然,命三
清,衛佐邦,楫臣
清,蔡雲,青岩
唐,高祐,小集
宋,高祐,小集
唐,賈耽,敦詩
唐,薛平,坦途








唐,蕭仿,思正
唐,劉沔,子汪
未詳,劉沔,子汪
宋,裴濟,莊時
唐,裴濟,莊時
宋,曹彬,國華
宋,燕度,唐卿
宋,張君平,士衡
Type Dynasty Name
1
○
○
2
○
○
3
×
○
4
○
×
Style
Name
○
×
○
○
5
×
○
6
×
7
○/×
Quan. Prop.
609
44.6%
665
43.2%
117
3.17%
262
2.46%
×
220
2.30%
×
○
45
1.59%
×
×
234
10.87%
○ C1 C2 字 Z1 Z2
○陳瑜○字仲庸
○ C1 C2 C3 字 Z1 Z2
○曾尚賓○江西人
守義白劉廣其壁○
西兵龍永西德不不
城千池錫中立下知
城户○者書廟○勞
陷洪死潭省祀城洪
身武○州都之破武
中元焉人事○執元
數年○與城陳送年
鎗明曾瑜破瑜京楊
知兵尚同以○師璟
不圍賓事佩字不取
敵靜○率刀仲屈廣
自○江妻自庸死西
○江西子刎雷郡吉
尚人溺○州人尼
賓為於有人感堅
Type
Name
Style Name
Quan.
Prop.
1
○
○
1192
31.66%
2
○
×
885
23.51%
3
×
○
1104
29.32%
4
×
×
584
15.51%
Type Dynasty Name Style Name Quan.
Prop.
1
○
○
○
609
44.6%
2
○
○
×
665
43.2%
3
×
○
○
117
3.17%
4
○
×
○
262
2.46%
5
×
○
×
220
2.30%
6
×
×
○
45
1.59%
7
○/×
×
×
234
10.87%
Type
Name Style Name
Quan.
Prop.
1
○
○
1192
31.66%
2
○
×
885
23.51%
3
×
○
1104
29.32%
4
×
×
584
15.51%
A machine learning approach
 Goal: predicting the class for a character
 Given: the character itself and the labels
(features) for surrounding characters
 Data

Training data: 110000 records extracted from the
gazetteers with regular expressions (1.498 million
characters)
 Test data: unlabeled raw gazetteer texts (900
thousand characters)

characters
 Tool: MALLET CRF (Univ. of Massachusetts)
 Classes

NB for name begin; NI for name interior; NE for
name end
 AB for address begin; AI for address interior; AE
for address end
 O for others

Group
Types
1
2
Chinese characters
Chinese characters
3
4
relative positions of selected
named entities
usage
5
6
usage
named entities
Description
self
surrounding k
characters
office, entry, reign
period, and time
used in person or
location name
family name?
office, entry, reign
period, and time
陳瑜字仲庸雷州人廣西中書省都事
Group
1
2
3
4
5
Types
Chinese characters
Chinese characters
Description
self
surrounding k
characters
relative positions of office, entry, reign
selected named entities period, and time
usage
used in person or
location name
usage
family name?
Feature values
州
瑜,字,仲,庸,雷,人,
廣,西,中,書
officeRight@3
A probability
(discretized)
No

5-fold cross validation
O
NB
NI
NE
AB
AI
AE
Group 1+2
Prec. Recall
F1
0.97 0.94 0.95
0.85 0.94 0.89
0.86 0.91 0.88
0.82 0.92 0.87
0.85 0.86 0.86
0.71 0.84 0.77
0.85 0.86 0.86
Group 1+2+4+5+6
Prec. Recall
F1
0.97 0.97 0.97
0.93 0.95 0.94
0.93 0.93 0.93
0.91 0.93 0.92
0.91 0.89 0.90
0.83 0.89 0.86
0.91 0.89 0.90
1800 instances in all zones except the last
scores
checking the first 100 samples in each zone
zone correct expt’d zone correct expt’d
1
97
1746
6
70
1260
2
88
1584
7
77
1386
3
90
1620
8
69
1242
4
81
1458
9
59
1062
5
79
1422
10
59
1011
Many paragraphs start with person names and
other “signals”
 Paragraph boundary identification possible
 Identifying paragraph boundaries is tentative
to finding “owners” of the paragraphs
 This in turn lead to building social networks
amount persons

○ C1 C2 字 Z1 Z2
○ C1 C2 C3 字 Z1 Z2
56.3% using any consecutive markers
 73.0%, if the markers are correct

Language-model based methods and machine
learning methods useful for extracting
biographical information from literary Chinese
 Yet, the results are not perfect
 Will extend the current work for mining social
networks in historical documents

Language-model based methods and machine
learning methods useful for extracting
biographical information from literary Chinese
 Yet, the results are not perfect
 Will extend the current work for mining social
networks in historical documents
