Title goes here
Download
Report
Transcript Title goes here
Languages at Inxight
Ian Hersey
Co-Founder and SVP, Corporate Development and Strategy
Inxight at a Glance
Inxight provides the only complete solution for
organizing and accessing unstructured data to increase
the speed and accuracy of information discovery
20+ years of Xerox PARC research - 70 patents
Content & linguistic analysis (27 languages today)
Information visualization and discovery
Silicon Valley HQ; offices in US, Europe
250 major customers
Seasoned management team
Solid investor backing:
2
Vantage Point, Reed Elsevier, Deutsche Bank,
Dresdner Bank, Xerox, In-Q-Tel
Inxight Confidential
What we mean by language support
Not pure statistics
“Language independence” is a fallacy when it comes
to text
Whitespace parsing + algorithmic stemming is a
cheap hack
Stem-internal changes
Compounding
Agglutination
Vocalization or lack thereof
Non-breaking languages
Phrases, terms and named entities can’t be extracted
effectively by n-gram indexing or pure machine
learning
3
Inxight Confidential
Text analysis fundamentals
4
Base layer
Language and character set identification
Document analysis
Tokenization
Stemming/normalization
Contextual analysis
Part-of-speech tagging
“Grouping”
Find the interesting stuff
Named entity extraction
Syntactic analysis (clause boundary identification,
subject/object identification, etc.)
Relate the interesting stuff; analyze meaning
Semantic analysis (fact extraction, etc.)
Inxight Confidential
Don’t ignore statistics
Feed linguistic markup into probabilistic
processing
Categorization (choose your algorithm)
Search/relevance ranking
Summarization
Co-occurrence analysis/entity resolution
Link analysis
Predictive analysis/data mining
5
Inxight Confidential
Base layer (LinguistX Platform)
Morphological analyzer
Lexicon + rules
Compiled as a finite-state machine
Resource efficient, very fast
French lexicon recognizes 5M words; takes up 300K on
disk/RAM, and runs at over 2 GB/hr on a low-end
machine
Xerox finite-state tools tested on many languages
(Inxight’s 27 + others in research)
Corpora to produce statistical models
Language and character set detection
Tagged corpus to produce Hidden Markov Model for
POS tagger
Groupers
6
Finite-state “chunkers” – compiled regex
Inxight Confidential
Named entity extraction
(ThingFinder)
Builds on base platform
Requires additional resources
Enhanced lexicon (POS tagset insufficient for high
quality extraction)
Entity-specific groupers
Tagged corpus for accuracy testing
Sometimes you need more
Genre-specific document analysis
Specialized tokenization, tagging
Knowledge base (“Name Catalog”)
Custom groupers
7
Inxight Confidential
Statistical models
Summarization
Base layer + feature model (feature weights, stop
words, cue phrases)
Categorization
Labeled training data
…and lots of interactive tools
8
Inxight Confidential
Fact extraction
Builds on base of linguistic markup + named
entities
Modeled on specific templates
Rules populate the templates
Additional linguistic resources
Intra-document
Document analysis/genre identification
Subject/object identification
Anaphora resolution
Entity resolution
Inter-document
9
Inxight Confidential
Developing a new language
Resource acquisition
Corpora
Lexicon
Team
Computation linguist familiar with tools
Native speaker
Resource enhancement
Label tagged truth sets
Build out morphological classes
Fill lexical gaps
Build, test and refine
Soup to nuts: $500K to $1M for V1.0
10
Inxight Confidential
Challenge of low-density languages
Commercial non-viability
Lack of lexical resources and corpora
Lack of native speakers, or even proficient
speakers
Greed
11
Inxight Confidential
Future developments on the language
frontier
New languages
Increased depth in existing languages
Named entity extraction
Added Arabic, Farsi and Chinese this year
Enhanced English for DoD and DOJ
Fact extraction
Other challenges
Name transliteration
Translation/glossing
Question answering
12
Inxight Confidential