Title goes here

Download Report

Transcript Title goes here

Languages at Inxight
Ian Hersey
Co-Founder and SVP, Corporate Development and Strategy
Inxight at a Glance
Inxight provides the only complete solution for
organizing and accessing unstructured data to increase
the speed and accuracy of information discovery
20+ years of Xerox PARC research - 70 patents


Content & linguistic analysis (27 languages today)
Information visualization and discovery
Silicon Valley HQ; offices in US, Europe
250 major customers
Seasoned management team
Solid investor backing:

2
Vantage Point, Reed Elsevier, Deutsche Bank,
Dresdner Bank, Xerox, In-Q-Tel
Inxight Confidential
What we mean by language support
 Not pure statistics


“Language independence” is a fallacy when it comes
to text
Whitespace parsing + algorithmic stemming is a
cheap hack






Stem-internal changes
Compounding
Agglutination
Vocalization or lack thereof
Non-breaking languages
Phrases, terms and named entities can’t be extracted
effectively by n-gram indexing or pure machine
learning
3
Inxight Confidential
Text analysis fundamentals
4
Base layer
 Language and character set identification
 Document analysis
 Tokenization
 Stemming/normalization
Contextual analysis
 Part-of-speech tagging
 “Grouping”
Find the interesting stuff
 Named entity extraction
 Syntactic analysis (clause boundary identification,
subject/object identification, etc.)
Relate the interesting stuff; analyze meaning
 Semantic analysis (fact extraction, etc.)
Inxight Confidential
Don’t ignore statistics
 Feed linguistic markup into probabilistic
processing






Categorization (choose your algorithm)
Search/relevance ranking
Summarization
Co-occurrence analysis/entity resolution
Link analysis
Predictive analysis/data mining
5
Inxight Confidential
Base layer (LinguistX Platform)
 Morphological analyzer



Lexicon + rules
Compiled as a finite-state machine
Resource efficient, very fast


French lexicon recognizes 5M words; takes up 300K on
disk/RAM, and runs at over 2 GB/hr on a low-end
machine
Xerox finite-state tools tested on many languages
(Inxight’s 27 + others in research)
 Corpora to produce statistical models


Language and character set detection
Tagged corpus to produce Hidden Markov Model for
POS tagger
 Groupers
6

Finite-state “chunkers” – compiled regex
Inxight Confidential
Named entity extraction
(ThingFinder)
 Builds on base platform
 Requires additional resources



Enhanced lexicon (POS tagset insufficient for high
quality extraction)
Entity-specific groupers
Tagged corpus for accuracy testing
 Sometimes you need more




Genre-specific document analysis
Specialized tokenization, tagging
Knowledge base (“Name Catalog”)
Custom groupers
7
Inxight Confidential
Statistical models
 Summarization

Base layer + feature model (feature weights, stop
words, cue phrases)
 Categorization

Labeled training data
 …and lots of interactive tools
8
Inxight Confidential
Fact extraction
 Builds on base of linguistic markup + named
entities
 Modeled on specific templates
 Rules populate the templates
Additional linguistic resources
 Intra-document

Document analysis/genre identification
Subject/object identification
Anaphora resolution

Entity resolution


 Inter-document
9
Inxight Confidential
Developing a new language
 Resource acquisition



Corpora
Lexicon
Team
Computation linguist familiar with tools
 Native speaker

 Resource enhancement



Label tagged truth sets
Build out morphological classes
Fill lexical gaps
 Build, test and refine
Soup to nuts: $500K to $1M for V1.0
10
Inxight Confidential
Challenge of low-density languages
 Commercial non-viability
 Lack of lexical resources and corpora
 Lack of native speakers, or even proficient
speakers
 Greed
11
Inxight Confidential
Future developments on the language
frontier
 New languages
 Increased depth in existing languages

Named entity extraction
Added Arabic, Farsi and Chinese this year
 Enhanced English for DoD and DOJ


Fact extraction
 Other challenges



Name transliteration
Translation/glossing
Question answering
12
Inxight Confidential