Slide 1

Transcript Slide 1

Why should I care about
Computational Linguistics &
Language Processing?
Hsiao-Wuen Hon
洪小文
Assistant Managing Director
Microsoft Research Asia
Should I care?

Medical school


金饭碗
Electronics

配股


Chip manufacture


TSMC, UMC
Hardware


Easy way to become millionaire
Acer, Quanta, 鸿海, BenQ, 英业达, MiTac
NLP? Speech? IR? HWR?
It is actually a good choice


People go on to have good careers
Many applications







IR, HWR
Investment banks
Bioinformatics
…..
With many smart people
Software Industry cares
Not overproducing students
Industry Cares


People you might know
Academics




Pillars of A.I.
Well funded
Taiwan professors
Oversea professors

V. Zue, B.H. Juang, F. Jelinek, M. Libermann, N.
Chomsky, Michael Collin, Fernando Pereira …
Industry Cares

Industrial R&D Labs

Executives


Microsoft



X.D. Huang,洪小文,马维英, Eric Chang, 周明, Eric
Brill, Ken Church, …
Continue hiring
Google




Kai-Fu Lee (MS), Qi Lu (Yahoo), …
Speech - Amit Singhal, Michael Riley, … etc.,
NL – Franz Och, Krishna Bharat, Dekang Lin, …
Aggressively hiring
Others…
Industry Cares

Other applications

Renaissance Technologies



Hedge fund management – 4 billions in assets
Time-series predication based on S&L
technologies
a.k.a ex-IBM S&L group


P. Brown, R. Mercer, P. De Souza, L. Bahl, Della Pietra
brothers, …
Startups

Nuance, SpeechWorks, InfoTalk, iPhrase,
Lexicus, …
Microsoft Cares

Bill Gates’ vision





PC on everyone’s desktop (’75)
Information at your finger tips (’90)
Seamless Computing (’03)
S&L technologies is the key
Billions of $ investment in S&L technologies




Full-size S&L product & research groups
Multi-lingual & multi-products
Continue hiring
Expanded investment due to search/Google
Information Agent



“Do what I mean”
“Find what I want”
How to turn on Firewall in Windows?

Speech recognition


Natural language understanding




Signal to text
Syntax/semantics
Domain knowledge
Knowledge search
AI-Complete
A Long Long Journey

Speech

Ubiquitous interface
Automatic Speech Recognition

Text-to-Speech


Natural Language




Spelling/grammar/style checking
IME
Machine translation
Information Retrieval & Mining
Speech

SAPI 1.0 – 6.0




Office Dictation


Chinese, English
Microsoft Speech Server


Window Sound System in ’92
Platform for building speech app. in
Windows
Accessibility support (Screen Reader)
Telephony speech & multiomdal platform
Other – Encarta, WinCE/Smartphone…
Speech
30%
Human Error Rate
25%
Machine Error Rate
20%
Log (Machine Error Rate)
15%
10%
5%
0%
1993
1996
1999
2002
2005
2008
2011
MSRA Speech

TTS – multi-lingual natural TTS
Elan Speech


Mulan
Chinese LVCSR - dictation/telephony/embedded
Fundamental research
AIME: Audio Info. Management & Extraction



AT&T
ASR


Loquendo
Audio/video file indexing/retrieval
Offline transcription/extraction/summarization
More in Eric’s keynote tomorrow

From the Lab to Ubiquity: Speech Technology's Road to Mainstream
NLP Contributes to MS Products













IME (Chinese, Japanese, …)
Spelling/grammar checking
Spam filtering
English Writing Wizard (EWW)
Spoken language interface
IR and CLIR
Text mining
Machine translation
Search engine
QA (AskMSR)
SLM for Speech
Text analysis for TTS
…..
NLP “Rainbow”
Understanding
Analysis
Discourse
Generation
Discourse
Knowledge base
Logical Form
Syntax
Dictionary
Morphology
Transfer
Logical Form
Syntax
Dictionary
Morphology
Word Breaking
Grammar
Machine
Source Text
Checking
Translation
Target Text
NLP at MSRA
Applications
Chinese IME
English writing wizard
Enterprise search
Japanese IME
Pocket translator
SQL Text Mining
Spelling check
Extended TM
Resume Routing
NLP
Machine Translation
Information Extraction
Information Retrieval
Meta data extraction
Skeleton parser
Research
Translation evaluation
paraphrasing
Term extraction
Named entity identification
Tran. know. acquisition
Shallow MT
Annotation tool
EBMT & SMT
Machine learning
Pos tagging
SLM
Linguistic Resources
Monolingual resources (C, J, E)
Bilingual resources (C, E)
MRD
MRD
Parsing lexicon
Web retrieval
Indexing
Special purpose
Bilingual corpus
Balanced corpus
Tagged corpus
Cross language IR
QMapping
Translation
Bilingual tagged
lexicon
corpus
Resume routing
NLP at MSRA


TIME
Email Routing





Spam filtering
Resume routing
Support routing
EWW
Translation
TIME Platform


Text Information Management & Extraction
Goal: extract information from text data




genres: email, newspaper, report, web pages
formats: Word document, PDF/PS, HTML/XML
languages: English, Chinese, Japanese, …
Applications: search, question answering, data
mining, machine translation
TIME
System
TIME Components

Linguistic processing  TIME linguistic platform




Information extraction



Text normalization: sentence splitting, tokenization,
morphological analysis
Entity extraction: person name, company name, time
expression, phrases
Relation learning: syntactic/semantic dependencies between
entities
Document property extraction: title, author, key term,
summary
Domain knowledge extraction: concept, concept relation,
glossary, taxonomy, event
Cross-lingual information exchange


Translation at word, entity, term, skeleton, text levels
Reading, writing, cross language information retrieval
TIME Demo
Multi-lingual linguistic unit processing

Word




Sentence


Tokenization
Named entity recognition (NER)
POS
Chunking (VP/NP)
Source-channel models:
TIME (linguistic unit processing)
Chinese Tokenization & NEI
English Chunking and POS Tagging
English Chunking and POS Tagging
Skeleton Parser

Skeleton == <subject V object>


Input: He is succeeded by Ivan Allen Jr.
Output
Sub
Obj
[He] is succeeded by [Ivan Allen Jr.]


More robust & faster than traditional parser
Adequate for most applications

Collocation checking, Spell checking, Grammar
checking, QA, Search
Skeleton Parser

Key Dependency Relations
A set of most important relations (e.g. subject, object…)
Definition based on application



Our Target: A Robust & Fast Dependency Extractor
Not rely on high quality (hand-annotated) training data.
High efficiency in dealing with large scale of data (e.g. web
data)



Potential Applications
Information Extraction, Q/A, TDT

Who (Subject-Verb), Whom (Verb-object), What (Adj-Noun)


Machine translation


Skeleton translation
NL-based Information Retrieval


Cross-Language IR
Re-ranking by triple matching
Proposed approach
Raw corpus
NLPWin Parser
Input Sentence
PoS Tagging
Parsed corpus
Chunking
Noise Filtering
Shallow Parser
Training Data
Training
Key Dependency Triples
The proposed approach
Raw corpus
NLPWin Parser
Input Sentence
PoS Tagging
Parsed corpus
Chunking
Noise Filtering
Shallow Parser
Training Data
Training
Key Dependency Triples
The proposed approach
Raw corpus
Input Sentence
NLPWin Parser
PoS Tagging
Parsed corpus
Chunking
Noise Filtering
Feature Extraction
Training Data
Training
Classification
Key Dependency Triples
Skeleton Parser
Skeleton Parser
Term Extraction
Text
Candidate
Generation
Term List
Options:
Ranking
Options:
Boundary determination
Term frequency Terms
BaseNP
TF-IDF
Pattern filtering
Entropy reduction
ER-IDF
Term Extraction
Term Extraction
Text Mining Roadmap
Information Desk
Meta Data for
Sharepoint
SQL Text Mining
Text Miner
Key technologies




Metadata extraction
Ranking algorithm
Multi-languages support
Information Desk

http://msra-nlc-tm1

http://msra-nlc-tm1/
Machine Translation Roadmap
Direction




Template based
Linguistic data acquisition from
Web mining
TIME
Search Engine
Office EWW
Mobility
Key technologies






Skeleton parser
Collocation checker
Paraphrase
Knowledge acquisition
Adaptive to new language
pairs
EWW (English Writing Wizard)
Objectives
 Make your English writing as good as native
speakers
Features
 Idiomatic usages
 Synonymous collocation
 Collocation translations
 Bilingual example sentences
Technology Highlights
Idiomatic Usage
Input: question
question (Noun)
Verb+question: raise ~, ask ~, resolve ~, pose ~
Adj+question: unanswered ~, serious ~, big ~, real ~
question (Verb)
question+Noun: ~ motive, ~ value, ~ truth, ~ boy
question+Adv: ~ intensely, ~ orally, ~ closely, ~ at_all
Adv+question: privately ~, cautiously ~, hardly ~
Synonymous Collocation
attain~dobj~level  achieve~dobj~level
attract~dobj~fan  draw~dobj~fan
take~dobj~reins  assume~dobj~reins|hold~dobj~reins
bad~Intnsifs~extremely  risky~Intnsifs~extremely
unusual~Intnsifs~quite  unusual~Intnsifs~rather
vision~Attrib~unusual  sight~Attrib~unusual
Improve~Mod~greatly  Improve~Mod~considerably
 Auto extraction of idiomatic usage
 Auto
extraction of synonymous collocation
 Auto
extraction of collocation translations

Example sentence retrieval
Collocation Translation
克服~困难
conquer difficulty, overcome difficulty, master~difficulty
overcome~adversity, surmount~difficulty
Web Search & Mining


Internet + Data + Information ->
Search, Mining, Sharing, & Intelligence
Lots of text




Text-based IR
Text Mining
Semantic/Structure Mining
Media Search



Surrounding text
Audio/video transcription
Make Billions of $ from trillions of words
Information Retrieval

Text Processing



Tokenization
Normalization – stemming, …
Precision/Recall

Beyond 1st order statistics (TF-IDF)


Better model of P(Doc|Query)


Classification vs. term frequency
Result Summarization

Query sensitive


N-gram for adaptive indexing
U盘 (优盘) vs. 大拇哥
Result clustering & classification
Search Long Result List



A user search for information
about “jaguar”, a Mac OS
However, the relevant results are
mixed with other pages
The user need to go through a
long list to find desired information
Clustering vs. Classification
Clustering Results for “jaguar”
Classification Results for “jaguar”
Document Clustering & Sub-topic Identification

Search Result Grouping



Overview of the returned documents
Locate useful information quickly
Word sense disambiguation
http://msra-idss-04:8080/prototype1
Text Mining




New research area
Highly statistically based
TIME on internet
Improving Precision/Recall

Title Extraction


10% improvement in ranking
XP Help & Support (support.microsoft.com)

Aggregate TF from


Newsgroup
Support emails
Text Mining
Location finder

Entity location


The physical address of the entity (e.g. organization,
corporation or person) owning the web
Crucial for geographical web retrieval and navigation


Content location



Yellow Pages, map services
The location that the content of the web resource is lied on.
Crucial for location based search & services
Context location


The geographical scope that the web resource reaches.
Crucial for B2C applications like local advertisement and ecommerce.
Three Types of Page Locations
the web site of the entity
Set
page1
pages1
Site
pages2
…
…
pagen
page3
page2
Content
Location
page1
Nevada
Pagesm
Context
Location
link
access
Entity
Location
the entity
Context
Location
Content
Location
Distribution of Geographical Keywords
Feature
Zip
Telephone
Geographical
Zip or Telephone
Any of three
Demo
Occurrence
Page(1053111)
Site(4430)
919170
232344 (22%)
3143 (71%)
1139677
236516 (22%)
3191 (72%)
80652212
822219 (78%)
4116 (93%)
2058847
323587 (31%)
3440 (78%)
82711059
835969 (79%)
4133 (93%)
Text Mining

AskMSR


Providing Answers inline instead of links to answers
USPS, UPC, Vehicle #s, Product IDs, Addresses, Stock
& financial #s, etc…
AskMSR


Leverage redundant web information
N-gram locator in results pages
Semantic Mining

Beyond document retrieval



Hierarchical clustering -> Mining
From non-structure to structure



Entity Identification
Relation Discovery
Mining on relation graph




Web mining & knowledge discovery
Clustering Multi-typed Interrelated Objects
Ranking
Graph Evolving
Relation visualization

Graph Matching/Morphing/embedding
http://msra-idss-04:8080/prototype1/(r0l5ivbnvijh4y45d5nyewee)/clustermain.aspx
Structure Paper Search
Relevant Term Mining

Search Term Suggestion (STS)




Document term may not match with real queries
Cluster the query terms into semantic topics
Classifying document terms into semantic topics
Rank the suggested terms by the popularity
http://msra-mm650-06/demo
 query1 
 query 
2

  


query m 
Query
q1
Query
Thesaurus
q2

qm
p1 p2 
 f11 f12 
  

 


 
pn
f1n 




f mn 
Query Log
Hyperlink
 word1 
 word 
2

  


 word n 
Web-page
p1
p2

pn
p1
 f11
 

 

 
p2  pn
f12  f1n 






f nn 
Media Search

Rely mostly on Text!


Surrounding text mining/extraction
Transcription from ASR



Audio/Video
AIME
Result presentation



Clustering/classification
Rely on text again!
Image and keyword co-occurrence matrix
Image Clustering
1710 JPG images in 1287 pages are crawled
within the website
Six Categories
Fish
Mammal
Bird
Amphibian
Reptile
Insect
Web Image Thesaurus
Basic Idea: Use abundant annotated images on the Web as training data
Term
coyote(21.3)
preprocessing
wolf(17.0)
mammal (16.0)
key
term
key region
extraction
coyote
WordNet
context
region
low-level codeword
correlation matrix
semantic-level codeword
Media Search
Cross-lingual Information Access
Query Translation
Ontology
Search
Web Page
Search Engine
Reading
Assistant
Chs. Doc
Translation
Engine
Query
Processing
Eng Docs
English Query
Chinese Query
Query
Translation
Translation
Engine
Reading Assistant
Cross-lingual Information Access

Important for non-English surfer



Access to English content
Using English content for ranking
Web-based Data Acquisition



Vast
Noisy
Parallel text
Cross-Lingual Information Retrieval
微软研究院
Cross-Lingual Reading Assistant
量子计算
平板电脑
Cross-Lingual Summarization
Title: 说话的计算机临近现实
Talking computers nearing reality
Author: Michael Kanellos
Time: July
2003.7.9
9, 2003
……
Summary:
Microsoft on Wednesday
released the first public beta of its
星期三微软发布了
它的第一个说话服务器，将让服务器更好
Speech Server, which will let servers better handle oral
处理口头命令。
comments.
Summary

Industry will continue



Build products using speech, NL, IR,…
Hiring people in speech, NL, NL, IR
Require more software to drive market


quoted by Barry Lam of Quanta
We should all care about these
technologies

Slide 1

Transcript Slide 1

Directory