Transcript Slide 1
Why should I care about
Computational Linguistics &
Language Processing?
Hsiao-Wuen Hon
洪小文
Assistant Managing Director
Microsoft Research Asia
Should I care?
Medical school
金饭碗
Electronics
配股
Chip manufacture
TSMC, UMC
Hardware
Easy way to become millionaire
Acer, Quanta, 鸿海, BenQ, 英业达, MiTac
NLP? Speech? IR? HWR?
It is actually a good choice
People go on to have good careers
Many applications
IR, HWR
Investment banks
Bioinformatics
…..
With many smart people
Software Industry cares
Not overproducing students
Industry Cares
People you might know
Academics
Pillars of A.I.
Well funded
Taiwan professors
Oversea professors
V. Zue, B.H. Juang, F. Jelinek, M. Libermann, N.
Chomsky, Michael Collin, Fernando Pereira …
Industry Cares
Industrial R&D Labs
Executives
Microsoft
X.D. Huang,洪小文,马维英, Eric Chang, 周明, Eric
Brill, Ken Church, …
Continue hiring
Google
Kai-Fu Lee (MS), Qi Lu (Yahoo), …
Speech - Amit Singhal, Michael Riley, … etc.,
NL – Franz Och, Krishna Bharat, Dekang Lin, …
Aggressively hiring
Others…
Industry Cares
Other applications
Renaissance Technologies
Hedge fund management – 4 billions in assets
Time-series predication based on S&L
technologies
a.k.a ex-IBM S&L group
P. Brown, R. Mercer, P. De Souza, L. Bahl, Della Pietra
brothers, …
Startups
Nuance, SpeechWorks, InfoTalk, iPhrase,
Lexicus, …
Microsoft Cares
Bill Gates’ vision
PC on everyone’s desktop (’75)
Information at your finger tips (’90)
Seamless Computing (’03)
S&L technologies is the key
Billions of $ investment in S&L technologies
Full-size S&L product & research groups
Multi-lingual & multi-products
Continue hiring
Expanded investment due to search/Google
Information Agent
“Do what I mean”
“Find what I want”
How to turn on Firewall in Windows?
Speech recognition
Natural language understanding
Signal to text
Syntax/semantics
Domain knowledge
Knowledge search
AI-Complete
A Long Long Journey
Speech
Ubiquitous interface
Automatic Speech Recognition
Text-to-Speech
Natural Language
Spelling/grammar/style checking
IME
Machine translation
Information Retrieval & Mining
Speech
SAPI 1.0 – 6.0
Office Dictation
Chinese, English
Microsoft Speech Server
Window Sound System in ’92
Platform for building speech app. in
Windows
Accessibility support (Screen Reader)
Telephony speech & multiomdal platform
Other – Encarta, WinCE/Smartphone…
Speech
30%
Human Error Rate
25%
Machine Error Rate
20%
Log (Machine Error Rate)
15%
10%
5%
0%
1993
1996
1999
2002
2005
2008
2011
MSRA Speech
TTS – multi-lingual natural TTS
Elan Speech
Mulan
Chinese LVCSR - dictation/telephony/embedded
Fundamental research
AIME: Audio Info. Management & Extraction
AT&T
ASR
Loquendo
Audio/video file indexing/retrieval
Offline transcription/extraction/summarization
More in Eric’s keynote tomorrow
From the Lab to Ubiquity: Speech Technology's Road to Mainstream
NLP Contributes to MS Products
IME (Chinese, Japanese, …)
Spelling/grammar checking
Spam filtering
English Writing Wizard (EWW)
Spoken language interface
IR and CLIR
Text mining
Machine translation
Search engine
QA (AskMSR)
SLM for Speech
Text analysis for TTS
…..
NLP “Rainbow”
Understanding
Analysis
Discourse
Generation
Discourse
Knowledge base
Logical Form
Syntax
Dictionary
Morphology
Transfer
Logical Form
Syntax
Dictionary
Morphology
Word Breaking
Grammar
Machine
Source Text
Checking
Translation
Target Text
NLP at MSRA
Applications
Chinese IME
English writing wizard
Enterprise search
Japanese IME
Pocket translator
SQL Text Mining
Spelling check
Extended TM
Resume Routing
NLP
Machine Translation
Information Extraction
Information Retrieval
Meta data extraction
Skeleton parser
Research
Translation evaluation
paraphrasing
Term extraction
Named entity identification
Tran. know. acquisition
Shallow MT
Annotation tool
EBMT & SMT
Machine learning
Pos tagging
SLM
Linguistic Resources
Monolingual resources (C, J, E)
Bilingual resources (C, E)
MRD
MRD
Parsing lexicon
Web retrieval
Indexing
Special purpose
Bilingual corpus
Balanced corpus
Tagged corpus
Cross language IR
QMapping
Translation
Bilingual tagged
lexicon
corpus
Resume routing
NLP at MSRA
TIME
Email Routing
Spam filtering
Resume routing
Support routing
EWW
Translation
TIME Platform
Text Information Management & Extraction
Goal: extract information from text data
genres: email, newspaper, report, web pages
formats: Word document, PDF/PS, HTML/XML
languages: English, Chinese, Japanese, …
Applications: search, question answering, data
mining, machine translation
TIME
System
TIME Components
Linguistic processing TIME linguistic platform
Information extraction
Text normalization: sentence splitting, tokenization,
morphological analysis
Entity extraction: person name, company name, time
expression, phrases
Relation learning: syntactic/semantic dependencies between
entities
Document property extraction: title, author, key term,
summary
Domain knowledge extraction: concept, concept relation,
glossary, taxonomy, event
Cross-lingual information exchange
Translation at word, entity, term, skeleton, text levels
Reading, writing, cross language information retrieval
TIME Demo
Multi-lingual linguistic unit processing
Word
Sentence
Tokenization
Named entity recognition (NER)
POS
Chunking (VP/NP)
Source-channel models:
TIME (linguistic unit processing)
Chinese Tokenization & NEI
English Chunking and POS Tagging
English Chunking and POS Tagging
Skeleton Parser
Skeleton == <subject V object>
Input: He is succeeded by Ivan Allen Jr.
Output
Sub
Obj
[He] is succeeded by [Ivan Allen Jr.]
More robust & faster than traditional parser
Adequate for most applications
Collocation checking, Spell checking, Grammar
checking, QA, Search
Skeleton Parser
Key Dependency Relations
A set of most important relations (e.g. subject, object…)
Definition based on application
Our Target: A Robust & Fast Dependency Extractor
Not rely on high quality (hand-annotated) training data.
High efficiency in dealing with large scale of data (e.g. web
data)
Potential Applications
Information Extraction, Q/A, TDT
Who (Subject-Verb), Whom (Verb-object), What (Adj-Noun)
Machine translation
Skeleton translation
NL-based Information Retrieval
Cross-Language IR
Re-ranking by triple matching
Proposed approach
Raw corpus
NLPWin Parser
Input Sentence
PoS Tagging
Parsed corpus
Chunking
Noise Filtering
Shallow Parser
Training Data
Training
Key Dependency Triples
The proposed approach
Raw corpus
NLPWin Parser
Input Sentence
PoS Tagging
Parsed corpus
Chunking
Noise Filtering
Shallow Parser
Training Data
Training
Key Dependency Triples
The proposed approach
Raw corpus
Input Sentence
NLPWin Parser
PoS Tagging
Parsed corpus
Chunking
Noise Filtering
Feature Extraction
Training Data
Training
Classification
Key Dependency Triples
Skeleton Parser
Skeleton Parser
Term Extraction
Text
Candidate
Generation
Term List
Options:
Ranking
Options:
Boundary determination
Term frequency Terms
BaseNP
TF-IDF
Pattern filtering
Entropy reduction
ER-IDF
Term Extraction
Term Extraction
Text Mining Roadmap
Information Desk
Meta Data for
Sharepoint
SQL Text Mining
Text Miner
Key technologies
Metadata extraction
Ranking algorithm
Multi-languages support
Information Desk
http://msra-nlc-tm1
http://msra-nlc-tm1/
Machine Translation Roadmap
Direction
Template based
Linguistic data acquisition from
Web mining
TIME
Search Engine
Office EWW
Mobility
Key technologies
Skeleton parser
Collocation checker
Paraphrase
Knowledge acquisition
Adaptive to new language
pairs
EWW (English Writing Wizard)
Objectives
Make your English writing as good as native
speakers
Features
Idiomatic usages
Synonymous collocation
Collocation translations
Bilingual example sentences
Technology Highlights
Idiomatic Usage
Input: question
question (Noun)
Verb+question: raise ~, ask ~, resolve ~, pose ~
Adj+question: unanswered ~, serious ~, big ~, real ~
question (Verb)
question+Noun: ~ motive, ~ value, ~ truth, ~ boy
question+Adv: ~ intensely, ~ orally, ~ closely, ~ at_all
Adv+question: privately ~, cautiously ~, hardly ~
Synonymous Collocation
attain~dobj~level achieve~dobj~level
attract~dobj~fan draw~dobj~fan
take~dobj~reins assume~dobj~reins|hold~dobj~reins
bad~Intnsifs~extremely risky~Intnsifs~extremely
unusual~Intnsifs~quite unusual~Intnsifs~rather
vision~Attrib~unusual sight~Attrib~unusual
Improve~Mod~greatly Improve~Mod~considerably
Auto extraction of idiomatic usage
Auto
extraction of synonymous collocation
Auto
extraction of collocation translations
Example sentence retrieval
Collocation Translation
克服~困难
conquer difficulty, overcome difficulty, master~difficulty
overcome~adversity, surmount~difficulty
Web Search & Mining
Internet + Data + Information ->
Search, Mining, Sharing, & Intelligence
Lots of text
Text-based IR
Text Mining
Semantic/Structure Mining
Media Search
Surrounding text
Audio/video transcription
Make Billions of $ from trillions of words
Information Retrieval
Text Processing
Tokenization
Normalization – stemming, …
Precision/Recall
Beyond 1st order statistics (TF-IDF)
Better model of P(Doc|Query)
Classification vs. term frequency
Result Summarization
Query sensitive
N-gram for adaptive indexing
U盘 (优盘) vs. 大拇哥
Result clustering & classification
Search Long Result List
A user search for information
about “jaguar”, a Mac OS
However, the relevant results are
mixed with other pages
The user need to go through a
long list to find desired information
Clustering vs. Classification
Clustering Results for “jaguar”
Classification Results for “jaguar”
Document Clustering & Sub-topic Identification
Search Result Grouping
Overview of the returned documents
Locate useful information quickly
Word sense disambiguation
http://msra-idss-04:8080/prototype1
Text Mining
New research area
Highly statistically based
TIME on internet
Improving Precision/Recall
Title Extraction
10% improvement in ranking
XP Help & Support (support.microsoft.com)
Aggregate TF from
Newsgroup
Support emails
Text Mining
Location finder
Entity location
The physical address of the entity (e.g. organization,
corporation or person) owning the web
Crucial for geographical web retrieval and navigation
Content location
Yellow Pages, map services
The location that the content of the web resource is lied on.
Crucial for location based search & services
Context location
The geographical scope that the web resource reaches.
Crucial for B2C applications like local advertisement and ecommerce.
Three Types of Page Locations
the web site of the entity
Set
page1
pages1
Site
pages2
…
…
pagen
page3
page2
Content
Location
page1
Nevada
Pagesm
Context
Location
link
access
Entity
Location
the entity
Context
Location
Content
Location
Distribution of Geographical Keywords
Feature
Zip
Telephone
Geographical
Zip or Telephone
Any of three
Demo
Occurrence
Page(1053111)
Site(4430)
919170
232344 (22%)
3143 (71%)
1139677
236516 (22%)
3191 (72%)
80652212
822219 (78%)
4116 (93%)
2058847
323587 (31%)
3440 (78%)
82711059
835969 (79%)
4133 (93%)
Text Mining
AskMSR
Providing Answers inline instead of links to answers
USPS, UPC, Vehicle #s, Product IDs, Addresses, Stock
& financial #s, etc…
AskMSR
Leverage redundant web information
N-gram locator in results pages
Semantic Mining
Beyond document retrieval
Hierarchical clustering -> Mining
From non-structure to structure
Entity Identification
Relation Discovery
Mining on relation graph
Web mining & knowledge discovery
Clustering Multi-typed Interrelated Objects
Ranking
Graph Evolving
Relation visualization
Graph Matching/Morphing/embedding
http://msra-idss-04:8080/prototype1/(r0l5ivbnvijh4y45d5nyewee)/clustermain.aspx
Structure Paper Search
Relevant Term Mining
Search Term Suggestion (STS)
Document term may not match with real queries
Cluster the query terms into semantic topics
Classifying document terms into semantic topics
Rank the suggested terms by the popularity
http://msra-mm650-06/demo
query1
query
2
query m
Query
q1
Query
Thesaurus
q2
qm
p1 p2
f11 f12
pn
f1n
f mn
Query Log
Hyperlink
word1
word
2
word n
Web-page
p1
p2
pn
p1
f11
p2 pn
f12 f1n
f nn
Media Search
Rely mostly on Text!
Surrounding text mining/extraction
Transcription from ASR
Audio/Video
AIME
Result presentation
Clustering/classification
Rely on text again!
Image and keyword co-occurrence matrix
Image Clustering
1710 JPG images in 1287 pages are crawled
within the website
Six Categories
Fish
Mammal
Bird
Amphibian
Reptile
Insect
Web Image Thesaurus
Basic Idea: Use abundant annotated images on the Web as training data
Term
coyote(21.3)
preprocessing
wolf(17.0)
mammal (16.0)
key
term
key region
extraction
coyote
WordNet
context
region
low-level codeword
correlation matrix
semantic-level codeword
Media Search
Cross-lingual Information Access
Query Translation
Ontology
Search
Web Page
Search Engine
Reading
Assistant
Chs. Doc
Translation
Engine
Query
Processing
Eng Docs
English Query
Chinese Query
Query
Translation
Translation
Engine
Reading Assistant
Cross-lingual Information Access
Important for non-English surfer
Access to English content
Using English content for ranking
Web-based Data Acquisition
Vast
Noisy
Parallel text
Cross-Lingual Information Retrieval
微软研究院
Cross-Lingual Reading Assistant
量子计算
平板电脑
Cross-Lingual Summarization
Title: 说话的计算机临近现实
Talking computers nearing reality
Author: Michael Kanellos
Time: July
2003.7.9
9, 2003
……
Summary:
Microsoft on Wednesday
released the first public beta of its
星期三微软发布了
它的第一个说话服务器,将让服务器更好
Speech Server, which will let servers better handle oral
处理口头命令。
comments.
Summary
Industry will continue
Build products using speech, NL, IR,…
Hiring people in speech, NL, NL, IR
Require more software to drive market
quoted by Barry Lam of Quanta
We should all care about these
technologies