Vidyanidhi experiences

Download Report

Transcript Vidyanidhi experiences

XML-Unicode environment
for creating and accessing of
Indian language theses:
Vidyanidhi experiences
Shalini R. Urs
Vidyanidhi Digital Library
University of Mysore,Mysore, India
[email protected]
Indo-US Workshop, June 25, 2003
Vidyanidhi Digital Library
• Vidyanidhi began as a pilot project in
2000
• Supported by the NISSAT, DSIR, GOI
• Objective was to demonstrate the
feasibility of an Electronic Thesis and
Dissertation( ETD) Initiative in the
Indian Context
• It is now evolving into a national effort
• Supported by the Ford Foundation
Indo-US Workshop, June 25, 2003
Vidyanidhi:Vision
To evolve into a information infrastructure to
strengthen the research capacities of
Indian Universities byDeveloping accessible digital libraries of
theses and dissertations.
Sensitizing and training doctoral research
students in
Scholarly writing, Epublishing and ETDs
Developing appropriate policies
Developing/making available requisite
tools and resources
Indo-US Workshop, June 25, 2003
Vidyanidhi: Strategies
• Policy Framework – through
meetings, liaison, participation
• Education and Training
• Content Building- full text and
metadata
• Resources and tools
(software,interfaces…)
Indo-US Workshop, June 25, 2003
Indian Academic Research Output
• Large system of higher education
• More than 300 universities-reservoir of
extensive doctoral research work
• Doctoral research output-around 30,000
annually
• English is the predominant language
• Increasing vernacularisation –20-25% in
Indian Languages
• This trend is increasing resulting in more
and more research output in Indian
Languages
Indo-US Workshop, June 25, 2003
Language Interoperability
• Vidyanidhi approach has been guided
by the language inter operability
factor
• Our choice of technology and tools
will have to be inter operable across
languages
Indo-US Workshop, June 25, 2003
Indian Languages: Diversity
• The rich diversity in Indian Languages and
scripts is simply overwhelming.
• India is made up of a number of separate
linguistic communities, each of which
shares a common language and culture.
• No of languages listed for India is 418
• 407 are living languages
• 11 are extinct.
• Many Languages -without script of their
own
Indo-US Workshop, June 25, 2003
Eighteen Indian languages
•
•
•
•
•
•
•
•
•
Assamese
Gujarati
Kashmiri
Malayalam
Marathi
Oriya
Punjabi
Sindhi
Telugu
•
•
•
•
•
•
•
•
•
Bengali
Hindi
Kannada
Konkani
Manipuri
Nepali
Sanskrit
Tamil
Urdu
Indo-US Workshop, June 25, 2003
Language Families of Indian
Languages
• Indo European- North and Central
India
• Dravidian – South India
• Mon-Khmer- Assam and some
Eastern parts of India
• Sino-Tibetan- Northern Himalayan
and Burmese border area
Indo-US Workshop, June 25, 2003
Indian Scripts
• Interestingly, though the languages belong
to four different language groups, Indian
scripts have a common root/origin
• Scripts of all Indian Languages are derived
from Bhahmi
• Greater uniformity in the arrangement of
Alphabets
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indian Alphabet: Characteristics
• Consonants
– Five Vargs (groups)
– Non varg
– Have an implicit + vowel
•
•
•
•
•
•
Anuswar ( a nasal consonant)
Chandrabindu ( a nasalisation Sign)
Visarg
Vowels and Vowel Signs
Vowel omission sign( Halant)
Conjuncts
Indo-US Workshop, June 25, 2003
Indian Languages and scripts
• Indic scripts are syllable orientedphonetic based with imprecise
character sets
• The different scripts look different
(different shapes) but have vastly
similar yet subtly different alphabet
base and script grammar
Indo-US Workshop, June 25, 2003
Indian Languages and
scripts:Issues
• The Indic characters consist of
consonants,
vowels,
dependent
vowels-called
‘matras’
or
a
combination of any or all of them
called conjuncts.
• Collation (sorting) is a contentious
issue as the script is phonetic based
and not alphabet based
Indo-US Workshop, June 25, 2003
Handling Indian
Languages:Possible approaches
• Transliteration
approach
-
Glyph
based
– Indic characters are encoded in either
ASCII or any other proprietary
encoding
– Use glyph technologies to display and
print Indic scripts
– Currently the most popular approach
for desktop publishing.
Indo-US Workshop, June 25, 2003
Handling Indian
Languages:Possible approaches
• Develop an encoding system for all the
possible
characters/combinations
running into nearly 13,000 characters in
each language-with a possibility of a
new combination leading to a new
character- an approach developed and
adopted by the IIT Madras development
team
• Adopt the ISCII/Unicode encoding
Indo-US Workshop, June 25, 2003
ISCII- Indian Script Code for
Information Interchange
• ISCII-91 -BIS Standard , IS 13194:1991
• An outcome of the efforts of Govt. of
India, DOE, MIT, C-DAC and many
other institutions
• Is an 8 bit code
• Is an extension of the 7 bit ASCII code
• Top 128 characters cater to the 10 Indian
Scripts
Indo-US Workshop, June 25, 2003
Unicode
• The Unicode consortium has
encoded all of the world’s scripts
• Unicode represents a carefully
thought out ,technically
impressive and a full featured
attempt at encoding Indic Scripts
• Unicode has unique code points
for all of the Indic scripts
Indo-US Workshop, June 25, 2003
Script
Unicode Range
Major Languages
Devanagari
U+0900 to U+097F
Hindi, Marathi, Sanskrit
Bengali
U+0980 to U+09FF
Bengali, Assamese
Gurumukhi
U+0A00 to U+0A7F
Punjabi
Gujurati
U+0A80 to U+0AFF
Gujarati
Oriya
U+0B00 to U+0B7F
Oriya
Tamil
U+0B80 to U+0BFF
Tamil
Telugu
U+0C00 to U+0C7F
Telugu
Kannada
U+0C80 to U+0CFF
Kannada
Malayalam
U+0D00 to U+0D7F
Malayalam
Indo-US Workshop, June 25, 2003
Unicode implementation for
Indic scripts
• Despite the robustness ,technical soundness
and practical viability, Unicode
implementation for Indic scripts is almost non
existent
• Our search of the major databases-LISA,
INSPEC, WOS did not show up any initiative
in this direction
• Vidyanidhi is an example of successful
implementation of Unicode for Indic scripts
Indo-US Workshop, June 25, 2003
Vidyanidhi approaches
• Taking Indian
thesis to the Web
– Full Text
– Metadata
Indo-US Workshop, June 25, 2003
Language
MS Word to XML
Template for
thesis in MS
Word
Student
submits thesis
in Word
Convert to XML
using the RTF to
XML Converter
Take them to
the Web
Indo-US Workshop, June 25, 2003
Full Text
• Vidyanidhi provides tools for the creation
of theses in Indian Languages
• Our approach is to• provide a style sheet /template on line
• When the thesis is submitted then convert
the same into to XML encoded in Unicode
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Vidyanidhi database-approach…
• Each script /language will have one
table. Currently there are three separate
tables for the three scripts- one each for
Roman, Hindi (Devanagari), & Kannada
• The theses in Indic languages will have
two records -one in the Roman script
(transliterated) and the other in the
vernacular. However the theses in
English will have only one record (in
English)
Indo-US Workshop, June 25, 2003
Vidyanidhi databaseapproach…
• The two records are linked by the
ThesisID number-a unique id for
the record
• The bibliographic description of
Vidyanidhi follows the ThesisMS
Dublin Core standard adopted by
the NDLTD and OCLC
Indo-US Workshop, June 25, 2003
Vidyanidhi - Platform
• Microsoft
• Windows XP supports all the 10 Indic
scripts
• Using Windows Glyph processing–
• Open Type Font Format
• Uniscribe-Unicode Script Processor
• Open Type Layout Services library
Indo-US Workshop, June 25, 2003
Vidaynidhi - platform
– MS SQL 2000
• A truly multilingual-capable SQL
• Achieves satisfactory collation
– Front End- ASP
– Java script
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Vidyanidhi:Accessing and
Searching
• One can search the Vidyanidhi
Database either in – In English ( Roman Script)
– The integrated ( Master) database has
metadata records for theses in all
languages
– Vernacular database has records of the
specific language only
Indo-US Workshop, June 25, 2003
Two approachesdifferences
• one affords search in the English
language and the other in the vernacular.
• The first approach also provides for
viewing records in Roman script for all
theses-search output- that satisfy the
conditions of the query and also an
option for viewing records in vernacular
script for theses in vernacular
Indo-US Workshop, June 25, 2003
• The second approach- enables
one to search only the vernacular
database and thus is limited to
records in that language.
• However, this approach enables
the search to be in the vernacular
language and script
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Unicode and Indic Scripts
• Vidyanidhi implementation dispels
certain
misconceptions
and
misconstructions about Unicode
• Supposed problems– Data Input
– Display and printing
– Collation
Indo-US Workshop, June 25, 2003
Data input/Keyboard layout
Our Test bed and comparison with other
methods:
• Unicode layout is as easy as the other in
terms of speed
• In terms of ‘no of key strokes’-No
difference and some times Unicode method
has less number of keystrokes involved
• Data input was almost comparable to
English records in terms of productivity
Indo-US Workshop, June 25, 2003
Display and Printing
• It is fairly satisfactory except for
a few issues/problem areas– Handling of certain conjuncts
– Inability to display non terminating
pure consonant
– Limited choice of font types
• Unicode can handle conjunct
clusters of four consonants
Indo-US Workshop, June 25, 2003
Collation issues-some
observations
• Consensus with respect of Indic
scripts is hard to come by
• Difference of opinion is not
uncommon as Indic languages are a
cross between syllabic and phonemic
writing systems
• Collation according to phonetic order
would be different from alphabetic
order
Indo-US Workshop, June 25, 2003
Collation Issues
• A few of the disorder stem from
the common script base and
order for all Indic scripts
• Differences between Indic
scripts -in the number and
arrangement of consonants and
vowels-despite strong similarity
Indo-US Workshop, June 25, 2003
Collation by Unicode
• Given the above collation
problems, the collation achieved
by Unicode is fairly satisfactory
and compares very well with
other more popular Font based
software package-Nudi
Indo-US Workshop, June 25, 2003
Conclusion
Unicode is able to handle
admirably the challenges of a
Multilanguage multi script
database implementation despite
the complexity and the minutiae
of a family of Indian languages
and scripts with strong
commonalities and faint
distinctions among themselves
Indo-US Workshop, June 25, 2003
Contact
[email protected]
Indo-US Workshop, June 25, 2003