Natural Language Processing

Download Report

Transcript Natural Language Processing

Natural Language Processing
Michel Bruley
June 2013
www.decideo.fr/bruley
Natural Language Processing (NLP)

NLP is the branch of computer science focused on developing systems
that allow computers to communicate with people using everyday
language

NLP is considered as a sub-field of artificial intelligence and has
significant overlap with the field of computational linguistics. It is
concerned with the interactions between computers and human (natural)
languages.
– Natural language generation systems convert information from
computer databases into readable human language
– Natural language understanding systems convert human language
into representations that are easier for computer programs to
manipulate.

NLP encompasses both text and speech, but work on speech processing
has evolved into a separate field
www.decideo.fr/bruley
Where does it fit in the CS* taxonomy?
Computers
Databases
Artificial Intelligence
Robotics
Information
Retrieval
* CS = Computer Science
www.decideo.fr/bruley
Algorithms
Search
Natural Language Processing
Machine
Translation
Networking
Language
Analysis
Semantics
Parsing
Why Natural Language Processing?
Applications for processing large amounts of texts require NLP expertise










Classify text into categories, index and search large texts: Classify documents
by topics, language, author, spam filtering, information retrieval (relevant, not
relevant), sentiment classification (positive, negative)
Extracting data from text: converting unstructured text into structure data
Information extraction: discover names of people and events they participate in,
from a document, …
Automatic summarization: Condense 1 book into 1 page, …
Speech processing, artificial voice: get flight information or book a hotel over
the phone, …
Question answering: find answers to natural language questions in a text
collection or database
Spelling & Grammar Corrections
Plagiarism detection
Automatic translation
Etc.
www.decideo.fr/bruley
The problem






When people see text, they understand its meaning (by and large)
According to research, it deosn’t mttaer in what oredr the ltteers in a
wrod are, the olny iprmoetnt tihng is that the frist and lsat ltteer are in
the rghit pclae. The rset can be a toatl mses and you can sitll raed it
wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by islelf
but the wrod as a wlohe.
When computers see text, they get only character strings (and perhaps
HTML tags)
We'd like computer agents to see meanings and be able to intelligently
process text
These desires have led to many proposals for structured, semantically
marked up formats
But often human beings still resolutely make use of text in human
languages
This problem isn’t likely to just go away
www.decideo.fr/bruley
Example: Natural language understanding
Natural language understanding process – Prof. Carolina Ruiz
Raw speech signal
• Speech recognition
Sequence of words spoken
• Syntactic analysis using knowledge of the grammar
Structure of the sentence
• Semantic analysis using info. about meaning of words
Partial representation of meaning of sentence
• Pragmatic analysis using info. about context
Final representation of meaning of sentence
www.decideo.fr/bruley
Example detail: Syntactic Analysis
• Syntactic analysis involves isolating phrases and sentences into a
hierarchical structure, allowing the study of its constituents.
• For example the sentence “the big cat is drinking milk” can be broken
up into the following constituents:
The big cat is drinking milk
Verb Phrase
Noun Phrase
Determiner
Adjective
Phrase
Noun
Auxiliary
Verb
Noun
Phrase
The
big
cat
is
drinking
milk
www.decideo.fr/bruley
Why NLP is difficult

Language is flexible
– New words, new meanings
– Different meanings in different contexts

Language is subtle
– He arrived at the lecture
– He chuckled at the lecture
– He chuckled his way through the lecture
– **He arrived his way through the lecture

Language is complex!
www.decideo.fr/bruley
Why NLP is difficult

MANY hidden variables
– Knowledge about the world
– Knowledge about the context
– Knowledge about human communication techniques
• Can you tell me the time?

Problem of scale
– Many (infinite?) possible words, meanings, context

Problem of sparsity
– Very difficult to do statistical analysis, most things (words,
concepts) are never seen before

Long range correlations
www.decideo.fr/bruley
Why NLP is difficult

Key problems:
– Representation of meaning
– Language presupposes knowledge about the world
– Language only reflects the surface of meaning
– Language presupposes communication between people
www.decideo.fr/bruley
Patented Natural Language Processing (NLP)
“Reads” Every Communication
 Each data feed is parsed
through one or more of the 7
NLP engines
 …it is then deconstructed to
provide context, subject, and
other information regarding
the customer (gender, name
etc.)
 Finally each identified
Natural language processing (NLP) is the study of the
interactions between computers and natural languages
(e.g., English, Polish). The crucial challenge that NLP
addresses is in deriving meaning from human or natural
language input and allowing consumers to analyze
parsed meanings in large volumes.
www.decideo.fr/bruley
customer is matched back to
the Discovery platform data to
gain a full view
For Example….
I bought an iPad2 for my mom last week. She loves the weight, but doesn’t like the color. She
wishes it came in blue. She says if it came in blue, then she’d buy one for all her friends






Entities (brands, people, locations, times, products…)
Events and relationships (purchasing event, my mom…)
Sentiment (product specifications)
Suggestions (feature specifications)
Intent (to purchase, to leave)
Geo/Temporal
QUESTION: Why is this a big deal?
NLP takes a simple English statement, parses them into the categories above (and more
categories) and VOILA…we got STRUCTURED DATA
www.decideo.fr/bruley
Architecture
Visualization (e.g.,
Tableau, MSTR)
Attensity Pipeline
Real-time
annotated
social media
data feed:
150+ million
social and
online sources
Predictive
ASTER DISCOVERY
PLATFORM
Aster
Pipeline Connector
Other Unstructured Data
Emails; Surveys;
CRM Notes….
ETL
“Nowstructured”
data
Customers /
Sales / Other
data
ASAS
Wrapper
SQL MR
Churn Score
SQL MR
NLP
www.decideo.fr/bruley
Aster + Attensity = Competitive Advantage

This integration provides types, subtypes, super types (“Savings”, “Checking”,
“Investment”)

Inclusion of the Anaphora: Connecting a subject (George Harrison) without
repeating the full name (“He”, “Him”)

Includes other languages besides English

Attensity’s Semantic Annotation Server (ASAS) capabilities
 Entity Extraction: Automatic detection and extraction of more than 35 entities such as Name,
Place
 Uses Attensity Triples to create context on entities and identify verbs, relationships, actions
 Auto Classification: Uses custom classification rules to classify articles by content, sort by
relevance, and discovers repeated information
 Exhaustive Extraction: Application of linguistic principles to extract context, entities, and
relationships similar to how the human mind would
 Voice Tags: to identify types of statements and auto classify them (Question, Intent,
Conditional)

Creates a unique identifier for each entity for cross reference
www.decideo.fr/bruley
Structuring Unstructured Data: Process Flow
The flight was delayed and flight attendant would not give us
any new information.
www.decideo.fr/bruley
How Triples are Extracted & Structured
Database Record from a Customer Survey
date
region
source
rec?
Why would you recommend/not recommend?
10-02-06
0006
telephone
4
The flight was delayed and flight attendant would
not give us any new information.
Same Record with Relational Facts
Extracted from Notes Field
Extract
Extract relational facts & Triples
from Notes field
Then Fuse
Populate new table with
attribute values and fuse with
structured data.
Who/What
Behavior
Fact/Triple
flight
delay
flight : delay
New Table: Customer Reactions
date
region
source
rec?
who-what
Behavior
Fact/Triple
10-2-12
0006
telephone
4
flight
delay
flight : delay
10-2-12
0006
telephone
4
information
give [not]
information :
give [not]
1-1-13
0007
e-mail
8
i
happy [not]
i : happy [not]
1-1-13
0007
e-mail
8
rep
rude
rep : rude
1-1-13
0007
e-mail
8
flight
cancel
flight : cancel
Original Structured Data
www.decideo.fr/bruley
Newly Structured Data
Provided by Attensity
Team Power
www.decideo.fr/bruley