NLP2RDF - SWRC

Download Report

Transcript NLP2RDF - SWRC

LOD2 KOREA :
Towards Publishing Korean Linked Data
on the Web
Key-Sun Choi
Joint work with
Martin Rezk
Jungyeul Park
Yoon Yongun
Kyungtae Lim
YoungGyun Hahm
Key-Sun Choi - Personal History
•
•
•
•
•
•
•
•
NEC C&C Lab. – PIVOT Japanese-Korean Machine Translation
Korean Part-of-Speech Tagset, Corpus, Dictionary
CoreNet (Korean-Chinese-Japanese) Semantic Wordnet (2004)
KORTERM: Korea Terminology Research Center for Language and
Knowledge Engineering (1998-2007), Research Center of Ministry of
Culture
KAIST Research Grand Award (1998)
ISO/TC37/SC4 Founding member (Language Resource Management
Standards)
ISWC 2007 PC Co-Chair (International Semantic Web Conference)
AFNLP President (2009-2010)
• DBPedia Korea http://ko.dbpedia.org/
• http://lod2.eu/ partner (EU FP7)
2
NLP2RDF
• Subject
• Object
• Predicate
• Extract from Sentences
• 野生種의 장미는 主로 北半球의 溫帶와 寒帶 地方에 分布한다.
• Wild rose is located mainly in the northern hemisphere of its
temperate and figid zones.
1.
2.
3.
Subject : 장미 (rose)
Object : 북반구의 온대 지방, 한대 지방 (Northern hemi-sphere, Temperate and
Frigid Zones)
Predicate : 分布 (isDistributedAt)
Key-Sun Choi - LOD2 Korea
• Triple in Natural Language
3
4
4
제조회사
consists_of
플랫폼
제조사
reside_on
시스템
소프트웨어
개발환경
운영체제
응용
프로그램
미들웨어
임베디드
시스템
임베디드
소프트웨어
브라우져
임베디드
운영체제
미디어
플레이어
실시간
임베디드
운영체제
RTOS
통신
미들웨어
가전기기
비실시간
임베디드
운영체제
디지털카메라
DVD
플레이어
MP3플레이어
셋탑박스
VRTX
VxWorks
pSOS
WinCE
5
마이크로소프트
Wind River
5
NLP2RDF
<Conceptonal Layer>
<DBpedia> (based on DBpedia Ontology)
Barack Obama
URI = dbpedia12415 (conceptonal Unique)
<Career> President <Nationality> United States
<Party> Democrats ,,,
LOD algorithm
Barack Obama is the President of the United States
Barack Obama
URI = sen1word1 (documentary Unique)
<POStag> NNG </POStag>
,,,
“KNIF”
Wrapper
The Output of
NLP tools
Sentence: ‘Barack Obama is the President of the United States’
For these work
1.
For RDF Mapping
• String Ontology
• Structured Sentence Ontology
• NIF and Korean language
2.
For LOD Mapping
•
•
URI for DBpedia entity
Mapping
Word in Text  DBpedia
Key-Sun Choi - LOD2 Korea
• Triples and URI
• Ontology
8
Parser tree to Summary
• 물체의 낙하 거리는 시간의 제곱에 비례한다
<Triple>
2. Predicate
• 비례한다
3. Contents
• 시간의 제곱
Key-Sun Choi - LOD2 Korea
1. Subject
• 물체의 낙하거리
10
Why NLP? Why Syntactic,
Semantics?
Key-Sun Choi - LOD2 Korea
• Advanced
technology on
the higher-level
layers
11
Key-Sun Choi - LOD2 Korea
NLP Layer Cake
12
Semantic Web vs. NLP layer
cake
Discourse
John: X1, room: L2
Syntactic
structure
subject, object, predicate
Phrase
Room in 2nd floor
Semantic
tagging
[John: Human], [2-FL: Loc],
[seminar-room: Room]
Morph. Analysis
+가//2+층+에
POS tagging
NPP/JOSA//Numeral/
Tokenization
철수가//2층에//
String URI
Encoding
Key-Sun Choi - LOD2 Korea
철수가 2층에 있는 세미나실을 예약한다.
John-SUBJ 2-floor-LOC room-OBJ reserve-FIN
13
How to develop parser and
semantic classifier creatively?
• Open Source NLP tools
• Rich English, Japanese open tools/resources
• A few Korean tools
• Already developed Korean language resources
• KAIST tools/resources
• KAIST open source in sourceforge and web
• Cambridge University Press: NLP Textbook (undergoing)
• Linked Data – http://lod2.eu/ partner
Key-Sun Choi - LOD2 Korea
• How to adapt Korean tools to the already developed tools
14
• The idea of linking data from different sources is
not new:
• Network Database Model: 70’s
• Linked Data: Today
• The goal is to facilitate sharing and re-using
information.
• Linked Data aims to extend the Web with data
commons by creating typed links between data
from different sources
Key-Sun Choi - LOD2 Korea
Background
15
Background
• Each piece of data is identified with an URI
Key-Sun Choi - LOD2 Korea
• These links are usually modeled using the
Resource Description Framework (RDF)
• The first task towards linking data is to identify
which resources and which properties we want to describe
16
• NLP2RDF is a LOD2 Community project that is developing the NLP
Interchange Format (NIF)
• NIF aims to achieve interoperability between Natural Language Processing
(NLP) tools, language resources and annotations
• The output of NLP tools can be converted into RDF and used in the LOD2
Stack
• http://nlp2rdf.org
 NIF…
•
Is based on RDF/OWL
•
Enables users to annotate for several languages in a uniform way
•
Enables users to query text documents with SPARQL
(EX http://semanticweb.kaist.ac.kr/nlp2rdf/ )
•
•
Sentence : 다크나이트는 미국의 영화이다.
Dark knight is a American film.
Key-Sun Choi - LOD2 Korea
Introduction
17
18
Key-Sun Choi - LOD2 Korea
NIF Wrapping
Key-Sun Choi - LOD2 Korea
• NLP Interchange Format (NIF) is an RDF/OWL-based format that allows to
combine and chain several Natural Language Processing (NLP) tools in a
flexible, light-weight way.
19
Sebastian Hellmann, AKSW, Universitat Leipzig, NLP Interchange Format(NIF)
Structure of NLP2RDF
Interchange
Layer
Data
Layer
Key-Sun Choi - LOD2 Korea
NLP Layer
20
Example of NLP Layer
English NLP
Tokenization
CFG Parser
Dependency
Parser
Key-Sun Choi - LOD2 Korea
Input
Sentence
21
How to create RDF from NLP output
Process
Example
Raw Texts
My dog also likes eating sausage.
output
NIF Wrapper
StanfordWrapper.Java
Key-Sun Choi - LOD2 Korea
NLP Tools
RDF
22
Example of NLP2RDF in ENG
• http://nlp2rdf.lod2.eu/demo.php
<http://prefix.given.by/theClient#offset_0_5> sso:oliaLink <http://purl.org/olia/penn.owl#NNP> ;
sso:posTag "NNP" ;
sso:lemma "Obama" ;
str:referenceContext<http://prefix.given.by/theClient#offset_0_30> ;
str:anchorOf "Obama" ;
rdf:type sso:Word ,
str:String .
Key-Sun Choi - LOD2 Korea
• Sentence: Obama is the president of USA.
23
Korean NLP2RDF
• Properties: POS, grammatical roles, etc.
• Problems to solve:
• Linguistic Modeling (OLiA)
• Processing Korean Text (NLP)
• How to Produce and Query RDF
Key-Sun Choi - LOD2 Korea
• Resources: morphemes, words (eojeols)
and sentences in Korean
24
Linguistic Modeling (1)
• We use OLiA (Ontologies of Linguistic Annotation) to link the
Sejong tagset with language-independent reference concepts.
• OLiA consists of three different ontologies:
• the OLiA reference model (language-independent),
• the OLiA annotation model (depends on the tagset),
• the OLiA linking model (depends on the tagset).
• We developed a fragment of these last two ontologies for
Korean, that is, for the Sejong tagset.
Key-Sun Choi - LOD2 Korea
• Sejong tagset is a Korean default standard
25
Linguistic Modeling (2)
• We use the NIF (NLP Interchange Format) to
• NIF provides two URI schemes to identify resources
• Offset-based
• Hash-based
Key-Sun Choi - LOD2 Korea
• standardize the input/output of the different tools to ease to
connection among them, and to
• uniquely identify (parts of) text, entities and relationships.
• We opt in our application for the Hash-based
26
Korean NLP2RDF Platform
RAW Text
HanNanum
•
Korean Open Source Morpheme Analyzer
•
Developed by SWRC, KAIST
Morpheme Analyzer
•
Training set: Modified Sejong Treebank
(DongHyun Choi, Jungyeul Park, Key-Sun
Parser
Choi , Korean Treebank Transformation for
Parsr Training, ACL - SPMRL 2012)
•
F1-score: 82.12%
Wrapper
Key-Sun Choi - LOD2 Korea
Korean Berkeley Parser
Produce triples
•
Use OLiA (Ontologies of Linguistic
Annotation) to link the Korean tagsets with
NIF output
language-independent reference concepts
•
The OLiA annotation model and the OLiA
linking model produce triples using the
Sejong tagset
27
Korean Language information
Morph.
Analyzer
Input
Korean
Sentence
CFG Parser
Korean
Grammar
Framework
Parsed result
URI, Tag
Dependency
Parser
DataBase
Mappings
Ontologies
RDF triples
SPARQL
Query
RDF
generator
SPARQL Query
Handler
OnTop
Framework
RDF triples
Key-Sun Choi - LOD2 Korea
Korean NLP
28
•
Each piece of data is identified with an URI (Hash-based)
•
Resources: Morphemes, Words (eojeols), Sentences in Korean
•
Properties: POS-tag, Grammatical roles, etc.
Some produced triples
DEMO site: http://semanticweb.kaist.ac.kr/nlp2rdf
Parsing results
Key-Sun Choi - LOD2 Korea
NIF Output
29
NIF Output
Key-Sun Choi - LOD2 Korea
이탈리아에서 공부하고 온 마틴은 한국을 사랑합니다.
Martin who came from Italy after studying there loves Korea.
30
Specific Issues of Korean
1.
2.
3.
4.
String
Word, Sentence, Phrase,,,
Tag
,,,
Ontology:
1.
2.
3.
4.
String Ontology
Structured Sentence Ontology (SSO)
OLiA
Penn
Sejong Tag Set
NLP2RDF:
Produce Triples
RDF output
1. Korean Tagset
2. Linking with OLiA
Key-Sun Choi - LOD2 Korea
Parser Output
31
superclass
Sejong
OLiA
LinguisticAnnotation/Tag/
LinguisticConcept/MorphosyntacticCategory/
Adverb
Adverb/ConjunctiveAdverb
Adverb
Adverb and Conjunction/CoordinatingConjunction
MAG
SN, XN
MM
SH, SL
IC
NA, NF, NN
XR
NNB, NNG
NNP
NP
SE, SF, SO, SP, SS
Adverb/GeneralAdverb
CardinalNumber
Determiner
ForeignWord
Interjection
Noun
Noun/BaseMorpheme
Noun/CommonNoun
Noun/ProperNoun
Pronoun
Symbol
Adverb
Quantifier/Numeral
PronounOrDeterminer/Determiner
Residual/Foreign
Interjection
Noun
Noun/CommonNoun
Noun/CommonNoun
Noun/ProperNoun
PronounOrDeterminer/Pronoun
Punctuation
NV, V
VA
VX
VC, VCN, VCP
Verb
Verb/Adjective
Verb/AuxiliaryPredicate
Verb/Copula
Verb
Verb and Adjective/PredicativeAdjective
Verb/AuxiliaryVerb
Verb/FiniteVerb
VV
E, JK, XP, XS
JC, JX
Verb/VerbalPredicate
Particle
Particle/AuxiliaryPostposition
Verb
MorphologicalCategory/morpheme/
Particle/CaseMarker
MorphologicalCategory/morpheme/MorphologicalParticle
Particle/Prefix
Particle/Suffix
MorphologicalCategory/morpheme/prefix
MorphologicalCategory/morpheme/suffix
MA
MAJ
JKB, JKC, JKG, J
KO, JKQ, JKS, JK
V
XPN
XSA, XSN, XSV
EC, EF, EP, ETM, E Particle/VerbalEnding
TN
Key-Sun Choi - LOD2 Korea
Tag
MorphologicalCategory/morpheme/MorphologicalParticle
MorphologicalCategory/morpheme/suffix
32
Conclusions:
• We presented a framework that allows
• The RDF outcome of our framework is compliant with the NIF (NLP
Interchange Format) and the OLiA ontologies to facilitate its
combination with other NLP tools
• Future:
• complete the development of the language-dependent part of the OLiA
ontologies,
• include the missing features required by NIF,
• allow richer SPARQL queries, and
• disambiguate the different entities in the text and link them with
Wikipedia articles.
Key-Sun Choi - LOD2 Korea
• processing Korean text,
• Efficiently producing RDF triples, and
• querying the NLP tools outcome
33
Issues
• Josa (postposition case marker)
• Korean specific grammatical feature
Sentence :
다크나이트는 미국의 영화이다.
Key-Sun Choi - LOD2 Korea
• DBpedia
• How to link between produced triples and DBpedia triples
Sentence :
Dark knight is the American movie.
34
Source
• Demo Site : for Korean
• http://semanticweb.kaist.ac.kr/nlp2rdf
• Demo site : for English
• http://nlp2rdf.lod2.eu/demo.php
Key-Sun Choi - LOD2 Korea
• OnTop
• https://babbage.inf.unibz.it/trac/obdapublic/wiki/ObdalibPluginIntro
• NLP2RDF
• http://nlp2rdf.org
35
Key-Sun Choi, Mun-Yong Yi, In-Young Koh, Younghee Lee
(CS/WebST, Knowledge Service Eng., CS/WebST, CS)
Tony Veale (Invited Professor, Computational Creativity)
Yoon, Yong-Un (research professor, NLP+DB)
Martin Rezk (postdoctoral researcher, Logic)
Park, Jung-Yeol (researcher, parser)
Lee, Jae-Sung (Professor, morphology and word)
Graduate Students:
Soon-Gil Hong, Young-Gyun Hahm , Kyungtae Lim,
Se-Mi Jang, Youngho Jeong, …
http://ko.dbpedia.org/
http://semanticweb.kaist.ac.kr
[email protected]