Transcript LOCN.N

FEISGILTT 2012
October 16-17
Seattle, Washington
Making Data Mining of XLIFF Artefacts Relevant for the On-going
Development of the XLIFF Standard
Asanka Wasala, David Filip, Chris Exton and Reinhard Schäler
Outline
Introduction
Background
Related work
Methodology
Data collection
Preprocessing
Construction of the XLIFF corpus
XLIFF data mining and construction of the database
Analysis
Results
Discussion
How to Make Data Mining of XLIFF Artefacts Relevant for the Ongoing Development of the XLIFF Standard?
3
Introduction
“Interoperability is the ability of two or more systems or
components to exchange information and to use the
information that has been exchanged”
(IEEE, 1991)
Data exchange formats play a prominent role in facilitating
interoperability
Standardization of data exchange formats:
Difficult due
Essential
forto
successful
constantlyinteroperability
evolving nature of technologies,
businesses, processes and tools
Standards need to be constantly reviewed and updated
4
Introduction
Framework to evaluate data exchange standard usage
Provides empirical evidence and statistics related
to the actual usage of different element, attributes,
element-values and attribute-values
Use the emprical evidance to inform the
development and maintenance of standards
Main Experiment -“Addressing Interoperability issues in Localisation Processes”
How to identify the limitations of data exchange standards
and implementations that are leading to interoperability
issues?
What elements, attributes and their values are leading to
interoperability issues?
What are the most important elements/attributes, and
element- attribute values for interoperability?
5
Introduction
XLIFF
6
Background
Localisation
XLIFF support in commonly used tools (Bly 2010)
Matrix containing tools and their level of support for individual XLIFF elements
Open source CAT tool named “Virtaal” (Morado-Vázquez and Wolff 2011)
Compare its level of XLIFF support with the matrix presented by Bly (2010)
Weakness in Bly’s (2010) analysis: does not take into account the relative
importance of different parts of the XLIFF specification
Simplification of XLIFF attributes: "approved" and "state"
XLIFF Support in CAT tools – XLIFF P&L Sub-Committee
(Morado-Vázquez and Filip, D. 2012)
Tracks quarterly changes in XLIFF support in major CAT tools
Limitations of XLIFF (Imhof 2010; Anastasiou and Morado-Vázquez 2010)
XLIFF’s extensibility, segmentation and inline elements, complexity
7
Background
Localisation
XLIFF↔LCX Comparison (Wasala et al. 2010)
Improvements to XLIFF and LCX
Interoperability issues associated with the XLIFF standard
LocConnect – SOA L10n interoperbility testing framework
(Wasala et al. 2011)
XLIFF as the messaging format
LIVE DEMO – 17th OCT. 10:15 – 11:10 (Federated Track)
CMSL10n ↔ SOLAS Integration as an ITS 2.0 ↔ XLIFF Test Bed
8
Background
Most attempted identifying lacks and deficits of implementations and
standards using top-down approaches
top down approaches = tools analysis, conceptual frameworks
Most of the previous studies present issues, only a few present
solutions.
Standard compliance, conformance and interoperability of tools and
technologies claiming to support standards, have not yet been
adequately addressed.
Propose an analytical framework that provides empirical evidences
and statistics related to the actual usage of different elements, attributes
and attribute values of a standard.
9
Methodology
1) Construction of a corpus
2) Construction of a repository
3) Data profiling (designing usage-analysis metrics)
4) Analysis of the results
10
Methodology
1) Construction of a corpus
2) Construction of a repository
3) Data profiling (designing usage-analysis metrics)
4) Analysis of the results
11
Methodology
Data Collection
Center for Next Generation Localisation (CNGL) industrial partners
3 Companies
Crawling
Google + Python scripts
Crawled on two occasions: on 26th and 29th August
2011
FileType:xlf, xliff, xml+xliff +”trans-unit” +body
XLIFF corpus is most likely not representative of
all the XLIFF files used in the real world
12
Methodology
Cleaning & Preprocessing (crawled files)
Cleaning of the file names
e.g. dialog.xlf-spec=svn8-r=8 (Python script)
Removal of non-XLIFF files
1st pass: Python script (regex matching)
2nd pass: Manual analysis of files
Encoding conversion
ASCII/UTF-32/UTF-16/BE/LE  UTF-8 (Without BOM) (Python script)
Removal of XML directive and DOCTYPE declarations
During the manual analysis
Extraction of embedded XLIFF content
During the manual analysis
Removal of duplicated files (by analysing content)
Using a freely available tool (Auslogics Duplicate File Finder)
13
Methodology
The 1st XLIFF Corpus
14
Methodology
The 1st XLIFF Corpus
Company A
Company B
Company C
38
29
1004
8.26 MB
Crawled
Eclipse
444
1664
3179 XLIFF files
File sizes & content vary
15
16.8 MB
Methodology
1) Construction of a corpus
2) Construction of a repository
3) Data profiling (designing usage-analysis metrics)
4) Analysis of the results
16
Methodology
Validate
Attributes
Children
Tags
Tables
Database
(~ 1 GB)
Corpus
17
Python scripts to validate, extract and populate tables with data
Methodology
XLIFF Data Mining
select tags from db order by
frequency desc
SQL
Queries
select attributes from db
where value!=“”
Database
select children, tag from db
where source like “company
A”
Corpus
18
Methodology
1) Construction of a corpus
2) Construction of a repository
3) Data profiling (designing usage-analysis metrics)
4) Analysis of the results
19
Analysis
Usage
Metrics
Research
Questions
How to identify the syntactic conformance issues of the standard?
Identify validation errors
Degree of syntactic conformance to the specification, common validation errors
How to simplify the standard?
Identify least frequently used and never used features
Features that can be removed
How to identify the most influential features of the standard?
Identify most commonly used features across organizations
Features that would have widest effects in case of a change
How to identify the candidate features that can be introduced to the standard?
Identify frequently added extensions and custom features
Frequently/most commonly used custom features that might be standardized in future, different extensions
in use where new features can be adopted
How to identify the best usage practices of the standard?
Feature usage patterns
To identify elegant and consistent solutions to recurring problems in the usage of the standard and promote
best-usage practices of the standard to improve tools interoperability
How to identify where organizations deviates from the norm w.r.t. usage of the standard?
Identify frequently used features of individual organizations (not employed by others)
Allows assessment and evaluation of organizations individualistic standard usage practices
How to resolve semantic ambiguities of features of the standard?
Attribute-values and element-values
Helps to identify features leading to semantic conflicts
20
Methodology
1) Construction of a corpus
2) Construction of a repository
3) Data profiling (designing usage-analysis metrics)
4) Analysis of the results
21
Analysis
Usage
Metrics
Research
Questions
How to identify the conformance issues of the standard?
Identify validation errors
Degree of syntactic conformance to the specification, common validation errors
How to simplify the standard?
Identify least frequently used and never used features
Features that can be removed
How to identify the most influential features of the standard?
Identify most commonly used features across organizations
Features that would have widest effects in case of a change
How to identify the candidate features that can be introduced to the standard?
Identify frequently added extensions and custom features
Frequently/most commonly used custom features that might be standardized in future, different extensions
in use where new features can be adopted
How to identify the best usage practices of the standard?
Feature usage patterns
To identify elegant and consistent solutions to recurring problems in the usage of the standard and promote
best-usage practices of the standard to improve tools interoperability
How to identify where organizations deviates from the norm w.r.t. usage of the standard?
Identify frequently used features (of individual organizations) not employed by others
Allows assessment and evaluation of organizations individualistic standard usage practices
How to resolve semantic ambiguities of features of the standard?
Attribute-values and element-values
Helps to identify features leading to semantic conflicts
22
How to identify the syntactic
conformance issues of the standard?
Identify validation errors
Degree of syntactic conformance to
the specification and common validation
errors
23
Results
Validation Errors
Overall Validation Results
3000
2500
Number of files
2000
not validated
1500
invalid
valid
strict valid
1000
500
0
xliff 1.0
xliff 1.1
xliff 1.2
XLIFF Version
24
undefined
Validation Errors
Invalid
Transitional
Strict
25
Results
Validation Errors
Element '{urn:oasis:names:tc:xliff:document:1.2}file', attribute 'target-language': 'tbd' is not a
valid value of the atomic type 'xs:language'.
Element '{urn:oasis:names:tc:xliff:document:1.2}trans-unit': Duplicate key-sequence ['Export'] in
key identity-constraint '{urn:oasis:names:tc:xliff:document:1.2}K_unit_id'.
Element '{urn:oasis:names:tc:xliff:document:1.2}trans-unit': The attribute 'id' is required but
missing.
Element '{urn:oasis:names:tc:xliff:document:1.2}file', attribute 'tool': The attribute 'tool' is not
allowed.
Element '{urn:oasis:names:tc:xliff:document:1.2}file', attribute '{okapi-framework:xliffextensions}inputEncoding': No matching global attribute declaration available, but demanded by
the strict wildcard.
Element '{urn:oasis:names:tc:xliff:document:1.2}group', attribute
'{http://www.gs4tr.org/schema/xliff-ext}segmented': No matching global attribute declaration
available, but demanded by the strict wildcard.
26
Analysis
Usage
Metrics
Research
Questions
How to identify the conformance issues of the standard?
Identify validation errors
Degree of syntactic conformance to the specification, common validation errors
How to simplify the standard?
Identify least frequently used and never used features
Features that can be removed
How to identify the most influential features of the standard?
Identify most commonly used features across organizations
Features that would have widest effects in case of a change
How to identify the candidate features that can be introduced to the standard?
Identify frequently added extensions and custom features
Frequently/most commonly used custom features that might be standardized in future, different extensions
in use where new features can be adopted
How to identify the best usage practices of the standard?
Feature usage patterns
To identify elegant and consistent solutions to recurring problems in the usage of the standard and promote
best-usage practices of the standard to improve tools interoperability
How to identify where organizations deviates from the norm w.r.t. usage of the standard?
Identify frequently used features (of individual organizations) not employed by others
Allows assessment and evaluation of organizations individualistic standard usage practices
How to resolve semantic ambiguities of features of the standard?
Attribute-values and element-values
Helps to identify features leading to semantic conflicts
27
How to simplify the standard?
Identify least frequently used and never
used features
Features that can be removed
28
Results
Least Frequently Used and Never Used Features
Relative usage of element X =
Number of times X appeared in the corpus x 100
Average use of an element/attribute in the corpus
Least frequently used elements/attributes = Relative usage < 1%
Least frequently used ≠ Not important
(e.g. XLIFF version attribute)
Weight of elements/attrbutes: Content, Structure, Presentation
29
Results
Least Frequently Used and Never Used Features
sub (0.48), seg-source (0.70)
prop-group (0.25), prop (0.47)
reference (0.00), internal-file (0.11), external-file (0.18)
skl (0.16)
ex (0.07), bx (0.07), it (0.28), g (0.93)
bin-target (0.01), bin-source (0.09), bin-unit (0.09)
Used in 1 source/organization
Used in 2 or more sources/organizations
30
Results
Least Frequently Used and Never Used Features
Parent
31
Analysis
Usage
Metrics
Research
Questions
How to identify the conformance issues of the standard?
Identify validation errors
Degree of syntactic conformance to the specification, common validation errors
How to simplify the standard?
Identify least frequently used and never used features
Features that can be removed
How to identify the most influential features of the standard?
Identify most commonly used features across organizations
Features that would have widest effects in case of a change
How to identify the candidate features that can be introduced to the standard?
Identify frequently added extensions and custom features
Frequently/most commonly used custom features that might be standardized in future, different extensions
in use where new features can be adopted
How to identify the best usage practices of the standard?
Feature usage patterns
To identify elegant and consistent solutions to recurring problems in the usage of the standard and promote
best-usage practices of the standard to improve tools interoperability
How to identify where organizations deviates from the norm w.r.t. usage of the standard?
Identify frequently used features (of individual organizations) not employed by others
Allows assessment and evaluation of organizations individualistic standard usage practices
How to resolve semantic ambiguities of features of the standard?
Attribute-values and element-values
Helps to identify features leading to semantic conflicts
32
How to identify the most influential
features of the standard?
Identify most commonly used features
across organizations
Features that would have widest effects in
case of a change
33
Results
Commonly Used Features Across Organizations
bin-unit
Company 2
Company 1
internalfile
glossary
x
ph
Commonly used features
header
xliff
phase
skl
seg-source
34
bx
Results
Commonly Used Features Across Organizations
<xliff>
<file>
<body>
<header>
<trans-unit>
<source>
<target>
<external-file>
<group>
<ph>
<alt-trans>
<note>
- version
- original, source-language, target-language, tool,
build-num, product-name, product-version
- id, approved, translate, resname, restype
- xml:space
- state, xml:lang
- href
- resname, restype
- id
- from
5 sources/organizations
More than 3 sources/organizations
35
Analysis
Usage
Metrics
Research
Questions
How to identify the conformance issues of the standard?
Identify validation errors
Degree of syntactic conformance to the specification, common validation errors
How to simplify the standard?
Identify least frequently used and never used features
Features that can be removed
How to identify the most influential features of the standard?
Identify most commonly used features across organizations
Features that would have widest effects in case of a change
How to identify the candidate features that can be introduced to the standard?
Identify frequently added extensions and custom features
Frequently/most commonly used custom features that might be standardized in future, different extensions
in use where new features can be adopted
How to identify the best usage practices of the standard?
Feature usage patterns
To identify elegant and consistent solutions to recurring problems in the usage of the standard and promote
best-usage practices of the standard to improve tools interoperability
How to identify where organizations deviates from the norm w.r.t. usage of the standard?
Identify frequently used features (of individual organizations) not employed by others
Allows assessment and evaluation of organizations individualistic standard usage practices
How to resolve semantic ambiguities of features of the standard?
Attribute-values and element-values
Helps to identify features leading to semantic conflicts
36
How to identify the candidate features that
can be introduced to the standard?
Identify frequently added extensions and
custom features
Frequently/most commonly used custom
features that might be standardized in
future, different extensions in use where
new features can be adopted
37
Results
Frequently Added Extensions
http://www.gs4tr.org/schema/xliff
http://www.idiominc.com/ws/asset
http://www.w3.org/1999/xhtml
http://www.sap.com
urn:xmarker
http://cmf.zope.org/namespaces/default/
http://www.gdf.com/xmlns/gdf-xstr.xsd
urn:ektron:xliff
http://sdl.com/FileTypes/SdlXliff/1.0
http://www.tektronix.com
http://www.ontram.de/XLIFF-Sup-V1
http://www.w3.org/2001/XMLSchema
http://www.crossmediasolutions.de/
rtwsk-extensions
38
Analysis
Usage
Metrics
Research
Questions
How to identify the conformance issues of the standard?
Identify validation errors
Degree of syntactic conformance to the specification, common validation errors
How to simplify the standard?
Identify least frequently used and never used features
Features that can be removed
How to identify the most influential features of the standard?
Identify most commonly used features across organizations
Features that would have widest effects in case of a change
How to identify the candidate features that can be introduced to the standard?
Identify frequently added extensions and custom features
Frequently/most commonly used custom features that might be standardized in future, different extensions
in use where new features can be adopted
How to identify the best usage practices of the standard?
Feature usage patterns
To identify elegant and consistent solutions to recurring problems in the usage of the standard and promote
best-usage practices of the standard to improve tools interoperability
How to identify where organizations deviates from the norm w.r.t. usage of the standard?
Identify frequently used features (of individual organizations) not employed by others
Allows assessment and evaluation of organizations individualistic standard usage practices
How to resolve semantic ambiguities of features of the standard?
Attribute-values and element-values
Helps to identify features leading to semantic conflicts
39
How to identify the best usage practices of
the standard?
Feature usage patterns
To identify elegant and consistent solutions
to recurring problems in the usage of the
standard and promote best-usage practices
of the standard to improve tools
interoperability
40
Feature Usage Patterns
Company B
41
Company C
Eclipse
Analysis
Usage
Metrics
Research
Questions
How to identify the conformance issues of the standard?
Identify validation errors
Degree of syntactic conformance to the specification, common validation errors
How to simplify the standard?
Identify least frequently used and never used features
Features that can be removed
How to identify the most influential features of the standard?
Identify most commonly used features across organizations
Features that would have widest effects in case of a change
How to identify the candidate features that can be introduced to the standard?
Identify frequently added extensions and custom features
Frequently/most commonly used custom features that might be standardized in future, different extensions
in use where new features can be adopted
How to identify the best usage practices of the standard?
Feature usage patterns
To identify elegant and consistent solutions to recurring problems in the usage of the standard and promote
best-usage practices of the standard to improve tools interoperability
How to identify where organizations deviates from the norm w.r.t. usage of the standard?
Identify frequently used features (of individual organizations) not employed by others
Allows assessment and evaluation of organizations individualistic standard usage practices
How to resolve semantic ambiguities of features of the standard?
Attribute-values and element-values
Helps to identify features leading to semantic conflicts
42
How to identify where organizations
deviates from the norm w.r.t. usage of
the standard?
Identify frequently used features of
individual organizations
Allows assessment and evaluation of
organizations individualistic standard usage
practices
43
Results
Frequently used Features of Individual Organizations
44
Frequently used Features of Individual Organizations
45
Analysis
Usage
Metrics
Research
Questions
How to identify the conformance issues of the standard?
Identify validation errors
Degree of syntactic conformance to the specification, common validation errors
How to simplify the standard?
Identify least frequently used and never used features
Features that can be removed
How to identify the most influential features of the standard?
Identify most commonly used features across organizations
Features that would have widest effects in case of a change
How to identify the candidate features that can be introduced to the standard?
Identify frequently added extensions and custom features
Frequently/most commonly used custom features that might be standardized in future, different extensions
in use where new features can be adopted
How to identify the best usage practices of the standard?
Feature usage patterns
To identify elegant and consistent solutions to recurring problems in the usage of the standard and promote
best-usage practices of the standard to improve tools interoperability
How to identify where organizations deviates from the norm w.r.t. usage of the standard?
Identify frequently used features (of individual organizations) not employed by others
Allows assessment and evaluation of organizations individualistic standard usage practices
How to resolve semantic ambiguities of features of the standard?
Attribute-values and element-values
Helps to identify features leading to semantic conflicts
46
How to resolve semantic ambiguities of
features of the standard?
Attribute-values and element-values
Helps to identify features leading to
semantic conflicts
47
Results
Attribute-Values and Element-Values
source-language, target-language and xml:lang:
en, EN, en-US, en-us, EN-US, en_US, ENGLISH, x-dev, unknown,
uz-UZ-Cyrl, tbd
date:
04/02/2009 23:24:18, 11/06/2008, 2001-04-01T05:30:02
2006-11-24, 2007-01-01, 2010-03-16T21:58:27Z
match-quality:
100,100%, 78.46, fuzzy, String, Guaranteed
match-quality:
final,needs-review,needs-review-l10n,needs-reviewtranslation,needs-translation,new,signedoff,translated,updated,user:translated,x-reviewed
48
Results
Other Possibilities
49
Results
Tools Involved
Snap-On Ireland, CATFile_Translation_UtilityNAIL.LUI 1.6.0.21409
Idiom WorldServer 9.0.5AgdaVS export tool
Benten
Ektron
IGD-2-XLIFF
ITS Translate Decorator
Maxprograms JavaPM
Okapi.Utilities.Set01
Pleiades
Swordfish III
blancoNLpackGenerator
genrb
Idiom WorldServer 9.2.0
LKR
PASSOLO 3.0
TM-ABCgenrbEktron
Rainbow v2.00Pleiades
50
Results
Use of different extensions (e.g .xlf, .xliff, .xml)
Use of different encoding mechanisms (e.g. utf-8, utf-8 bom, utf-16)
Inconsistencies of using DOCTYPE declaration and XML declarations
Never used attributes:
alttranstype, annotates, assoc, clone, comment, extype
Never used predefined attribute values: e.g. lisp for datatype attribute
Use of the improper syntax for user defined attribute values (i.e. not
using the 'x-' prefix);
e.g. 'text' for 'datatype' attribute;
Use of extreme values
e.g. Extremely lengthy strings for IDs, spaces within IDs;
Use of improper formats
Language (i.e. not as specified in BCP 47/RFC5646)
Date (i.e. not as specified in ISO 8601 Format )
Omission of required attributes
e.g. version attribute of the <xliff> element
Use of custom values instead of the predefined values
e.g. the use of pofile value instead of the predefined po value for the
datatype attribute
..
51
Discussion
XLIFF Data Mining Framework
Framework
Preprocessing
Crawling
52
Data Mining
Data Analysis
Discussion
First large XLIFF corpus and a novel empirical framework
Can analyse the use of the specification
focus on the aspects of the standard that are obviously more important or
less important to the actual stakeholders of the standard
Quick reference method to check the status of actual implementation of
dubious elements, attributes, attribute values, usage patterns
First framework that employs a systematic bottom-up approach for
identifying important criteria for standardization process
Can be applied for similar XML based file formats in other domains for
improving important aspects of interoperability
53
Discussion
Low External Validity
XLIFF corpus is most likely not
representative of all the XLIFF files used in
the real world
54
Discussion
How to Make Data Mining of XLIFF
Artifacts Relevant for the Ongoing
Development of the XLIFF Standard?
55
Discussion
Imagine that you could run the
previously shown queries on a
representative corpus!
56
Discussion
Change SotA Report methodology to
include sample files for the corpus
•
XLIFF P&L SC currently finalizes work on the 2nd edition of the SotA report
•
•
57
XLIFF SotA Report 2nd Ed DRAFT
At the same time we’re kicking off preparations for the 3rd and 4th editions
•
3rd edition will add XLIFF 2.0 to the mix
•
4th edition depends on designing and approving the process of
creation of the SC warranted corpus for empirical evaluation of XLIFF
feature support
Discussion
We want to hear from you if you are
willing to contribute to the creation
of the corpus
58
Discussion
Considerations for making the corpus
Proportion
Confidentiality
Process stage
Well formedness
Etc.
59
Thank
you!
This research is supported by the Science Foundation Ireland (Grant 07/CE/I1142) as part of the Centre for Next Generation
Localisation (www.cngl.ie) at the Localisation Research Centre (Department of Computer Science and Information Systems),
University of Limerick, Limerick, Ireland. Thanks to Dr. Ian O'Keeffe and Dr. Jim Buckley for all the guidance, suggestions and
feedback. We would like to thank the CNGL industrial partner organisations that contributed to the XLIFF corpus.
60