a N-gram Visualization Tool of Chinese Buddhist Translations

Download Report

Transcript a N-gram Visualization Tool of Chinese Buddhist Translations

Buddha Ngram Viewer:
a N-gram Visualization Tool of
Chinese Buddhist Translations
Jen-Jou (Joey) Hung @ PNC Annual Conference and Joint
Meetings 2013, Kyoto University, DEC 10-12, 2013
Building the Digital Research Platform
for Chinese Buddhist Literature
Jen-Jou (Joey) Hung @ PNC Annual Conference and Joint Meetings
2013, Kyoto University, DEC 10-12, 2013
Achievements of Digitized Chinese
Buddhist Texts

CBETA (Chinese Buddhist Electronic Text
Association) is founded in 1998.

In the last 15 years, CBETA has
converted a substantial
number of Chinese Buddhist
scriptures to digital format.

In CBETA 2011 DVD, it consists
of more then 160 million
Chinese characters.
Statistics of CBETA Digitized Content
Works (部)
Fascicles
Characters
1998-2003 Taishō Tripiṭaka (大正藏)
2,373
8,982
78,770,000
2004-2007 Shinsan Zokuzōkyō (卍續藏)
1,229
5,066
71,220,000
1
10
333,000
77
136
1,663,000
100
100
74,000
385
2631
24,193,000
Time
Name of Collection
Passages concerning Buddhist
activities from the Official History
(正史佛教資料類編)
Buddhist texts not contained in the
Tripiṭaka (藏外佛教文獻)
2008-2011
Selection of stone rubbings from
Northern Dynasties
(北朝佛教石刻拓片百品)
Supplement from other editions of
Tripiṭaka (歷代藏經補輯)
Chinese Translations of Pali
Canon(Based on Yuan Heng Temple
Edition)
2012-2013
Selections from the Taiwan National
Central Library Buddhist Rare Book
Collection.
Total (總計)
36
c.a. 7,500,000
64
c.a. 5,500,000
4,265
16,925 c.a.189,253,000
The Chance and Challenge with “BIG
DATA” (I)

The rapid growth of digital resources let scholars to
be able to acquire more relevant materials with less
time.

However, most of digital resources are not
integrated. Scholars have to find an more efficient
way to master the large amount of data in order
not to be drown in the data ocean.
The Chance and Challenge with “BIG
DATA” (II)

We also believe that these large amount of digital
resources will not only provide a convenient research
environment but also will help to gain new insights.

One very promising solution is to perform text
analysis on Buddhist electronic text corpus to find out
hidden pattern behind texts.

However, it sounds like a very difficult task for
Buddhist scholars.
Digital Research Platform for Chinese Buddhist Literature

Main Mission of the Digital Research Platform:
1.
Data Providing: Provide complete, integrated reference
data in easy access way.
2.
Data Organizing:
Provide customization
tools for user to
organize materials into
knowledge.
3.
Data Analyzing:
Provide digital analysis
tools for discovering
hidden patterns.
Project Information

2 years project, granted by National Science Council.
(Digital Humanities Project). It consists of three sub-projects:

Sub-project1: responsible for digitizing new resource for
supporting this platform. (directed by Aming TU)

Sub-project2: responsible for developing new methodology for
analyzing digital corpus, especially focusing on phonology
materials. ( directed by Chien-Kang Huang)

Sub-project3: responsible for integrating project result, develop
text quantitative analysis tool and establishing the platform.
Plan for the First year
Target 1: build up the platform for integrating resources
Design a good way to integrate digital resources.
 Incudes: CBETA full text, catalogue, dictionaries, phonology
materials, other digital resource created by DDBC.

Target 2: implement text analysis functions


Building up data set for text analysis.
Creating tools. Ex: Buddha N-gram viewer is an example
tool for this purpose . It visualizes over time occurrences of
inputted phrases in Chinese Buddhist texts.
Target1: Building the Digital Research
Platform
Idea of the Research Platform

Our experience: in the last decade, we have executed
more than 20 digital achieve projects.

Every database has its own archive content, design
principle and different media type.

The only overlap is perhaps the sutra text

To integrate those resources, we decide to establish a
rich functional sutra reading interface, and bind other
related information to the text.
Main Idea of Integration
Text
Analysis
Tools
Tripitaka
Catalogue
Phonology
Materials
Sutra
Reading
Word
Segmentation
Tools
Dictionaries
Basic Information
Catalogue Data
Information from Sutra
catalogue, click here will be
leaded to our catalogue
Project.
Other Related Sutra
Only embed critical apparatus, and
gaiji information.
• Other Parallel Translation.
• List of Commentary
• Related Research
N-gram Information
婆羅,727
如是,705
比丘,694
羅門,693
沙門,614
世尊,477
如來,469
云何,428
眾生,388
由旬,387
爾時,384
復有,358
是為,346
阿難,317
無有,313
Extra Information for Selected Terms
Catalogue Data
Dictionary
Lookup
婆羅《丁福保佛學大辭典》
Information from our
【職位】Vihārapāla,維那之別名,
glossaries project, click here
譯曰次第,司僧中之次第順序者。行
will be leaded to glossaries
事鈔下二曰:「維那出要律儀翻為寺
project website.
護,又云悅眾。本正音婆邏,云次
第。」
Other Related Sutra
Occurrences of 婆羅 in different time period
婆羅
• Other Parallel Translation.
• List of Commentary
• Related Research
This information is from
Buddha
Ngram Viewer.
N-gram
Information
婆羅,727
世尊,477
爾時,384
如是,705
如來,469
復有,358
比丘,694
云何,428
是為,346
Word Segmentation Tools
羅門,693
眾生,388
阿難,317
沙門,614
Place
Name, Person
Name, Calendar無有,313
Look up
由旬,387
Target 2: Implement Text analysis
Functions
What is the Text Analysis

Text analysis: utilizing computer software to analyze the text
content in large size corpus, e.g.: CBETA. The objective is to
discover hidden patterns and further derive new insights.

The patterns could be:

Words that are frequently used in one place but never show
anywhere else.

High-frequency collocations in a group of documents.

Special usage patterns of commonly used words.

Other possible and meaningful patterns ……
Difficulties in applying text analysis to the
CBETA corpus

Data is too complex:

The textual content and structure of Buddhist works are highly
complex and complicated.

Analysis Tool is very difficult to learn
 The leverage of general text analysis tool requires some skills in
computer programming and advanced statistical knowledge.

How to let more (Humanity) scholars to adopt ‘text analysis’
technique in addressing their research questions?

We create some easy-use tools.
Buddha Ngram Viewer:
(http://dev.ddbc.edu.tw/BuddhaNgramViewer/)

Buddha Ngram Viewer (under construction)

A tool that allows users to visualize the over-time
occurrences of inputted phrases in Chinese Buddhist
texts.
Click any point in the
chart to start.
http://dev.ddbc.edu.tw/
BuddhaNgramViewer/
Idea of Buddha Ngram Viewer

Combine Search result and sutra translation time from
triptaka catalogue.
Search result in CBReader
+
Sutra No.
Sutra Name
Dynasty
T01n0001
長阿含經
後秦
T01n0005
佛般泥洹經
西晉
T01n0023
大樓炭經
西晉
+
後秦 = C.E. 410
西晉 = C.E. 314
||
Number of occurrences of search term
in different time period.
泥洹,涅槃
Chinese Dynasties
Click this point to see
the details of CE.401
Western Years
Number of occurrences
The occurrences of 泥洹,涅槃 in the sutras translated in C.E. 401
The occurrences in
the 22 fascicles of
T1 (長阿含經).
Click this point to see the
details of 3rd fascicles in T1
Scroll down for
more sutras
A quick way to understand the frequencies of selected terms in texts.
Shows the matched place of泥洹,涅槃 in the third fascicle of T1
Click here for displaying
only matches of 泥洹
Only display matches of 泥洹 in the third fascicle of T1
Click for viewing this
line in CBETA Text
CBETA Full text of the selected line
Integrate Buddha Ngram Viewer to the Research Platform
Dictionary Lookup
婆羅《丁福保佛學大辭典》
【職位】Vihārapāla,維那之別名,
譯曰次第,司僧中之次第順序者。行
事鈔下二曰:「維那出要律儀翻為寺
護,又云悅眾。本正音婆邏,云次
第。」
Occurrences of 婆羅 over time
婆羅
This information is from
Buddha Ngram Viewer.
Word Segmentation Tools
Place Name, Person Name, Calendar Look up
Future Work
Future Work

Keep adding temporal and spatial information of sutras:
 Taisho shinshu Daizokyo, Showa hobou makuroku.
 The Korean Buddhist Canon: A Descriptive Catalogue by Dr.
Lewis R. Lancaster, 1979.

Complete the sutra reading interface and continue to
integrate more related information to the platform.

Keep bring new idea to the platform.
Ex
Thank you for listening.
Q & A !!