Intelligent Interactions with Search Results
Download
Report
Transcript Intelligent Interactions with Search Results
Intelligent Interactions with
Search Results
Getting Beyond Those Blue Results Lists
(or Smart Text)
Tom Reamy
Chief Knowledge Architect
KAPS Group
http://www.kapsgroup.com
Program Chair – Text Analytics World
Agenda
Introduction: Search and Structure: Smart Text
–
Smart Text– foundation of text analytics
Adding Structure to Unstructured Text
–
Dynamic Sections and more, Better Relevancy Calculations
– Complex Document Summaries, Deeper Personalization
Case Study
–
Publishing: Processing 700K Proposals
Beyond Search – Building on the Foundation
Conclusions
2
Introduction: KAPS Group
Knowledge Architecture Professional Services – Network of Consultants
Applied Theory – Faceted & emotion taxonomies, natural categories
Services:
– Strategy – IM & KM - Text Analytics, Social Media, Integration
– Taxonomy/Text Analytics, Social Media development, consulting
– Text Analytics Quick Start – Audit, Evaluation, Pilot
Partners – Smart Logic, Expert Systems, SAS, SAP, IBM, FAST,
Concept Searching, Attensity, Clarabridge, Lexalytics
Clients: Genentech, Novartis, Northwestern Mutual Life, Financial
Times, Hyatt, Home Depot, Harvard Business Library, British Parliament,
Battelle, Amdocs, FDA, GAO, World Bank, Dept. of Transportation, etc.
Program Chair – Text Analytics World – March 29-April 1 - SF
Presentations, Articles, White Papers – www.kapsgroup.com
Current – Book – Text Analytics: How to Conquer Information Overload,
Get Real Value from Social Media, and Add Smart Text to Big Data
3
Introduction:
Elements of Smart Text - Text Analytics
Text Mining – NLP, statistical, predictive, machine learning
Extraction – entities – known and unknown, concepts, events
–
Disambiguation - Ford
Fact Extraction - ontology, relationships of entities
Sentiment Analysis - Positive Negative – products, companies,
Auto-categorization
–
–
–
–
–
Training sets, Terms
Rules – simple – position in text (Title, body, url)
Boolean– Full search syntax – AND, OR, NOT
Advanced – DIST(#), ORDDIST#, PARAGRAPH, SENTENCE
Based on taxonomy/ontology
4
Enterprise Text Analytics
Search is still #1 = 30-50% of applications
New Standard Search – facets (more and more metadata), autocategorization built on taxonomies, clustering
Trend = Text Analytics/Search as Semantic Infrastructure
–
Platform for Info Apps (Search-based applications)
SharePoint – Major focus of TA companies – fix problems with
taxonomy/folksonomy
–
Hybrid workflow – Publish document -> TA analysis -> suggestions
for categorization, entities, metadata -> present to author
External information = more automation, extraction – precision
more important
5
Enterprise Text Analytics
Adding Structure to Unstructured Content
Beyond Documents – categorization by corpus, by page, sections
or even sentence or phrase
Documents are not unstructured – variety of structures
Text indicators to define sections of the document
– Objectives, Abstract, Purpose, Aim – all the “same” section
– Sections – Specific - “Abstract” to Function “Evidence”
– Start of section is easy – where does it end?
Experiment – clusters / vocabulary to define section
– Textual complexity, level of generality
6
Enterprise Text Analytics
Categorization and Beyond
Need to develop flexible categorization and taxonomy – tweets to
200 page PDF
Rules or sample documents?
– Need more precision and granularity than documents can do
– Training sets – not as easy as thought
– Applications require sophisticated rules, not just
categorization by similarity
Separate logic of the rules from the text
– Stable rules, changing text
Scores – relevancy with thresholds
– Not just frequency of words
7
8
Enterprise Text Analytics
Document Type Rules
(START_2000, (AND, (OR, _/article:"[Abstract]",
_/article:"[Methods]“), (OR,_/article:"clinical trial*",
_/article:"humans",
(NOT, (DIST_5, (OR,_/article:"approved", _/article:"safe",
_/article:"use", _/article:"animals"),
If the article has sections like Abstract or Methods
AND has phrases around “clinical trials / Humans” and not words
like “animals” within 5 words of “clinical trial” words – count it and
add up a relevancy score
Primary issue – major mentions, not every mention
– Combination of noun phrase extraction and categorization
– Results – virtually 100%
9
Case Study
Publishing Project: Reed Construction Data
700,000 Proposals – Wide Variation
Process Proposals – extract data – 30-50 types
Current Manual Process – Internal Teams
–
Expensive and Slow
Structure Variety of Unstructured Documents
–
–
Generate Table of Contents
Generate Sections and Capture Text
Semi-automatic extract Key Information
Save Time & Money, Flexible Hiring, New Offerings
10
Publishing Project: Example Rules
Automated Table of Content
11
12
Publishing Project: Example Rules
Key Data Extraction
Bid Dates/Times
Roles (Architect, Designer, etc.) – names and addresses, etc.
Project Attributes – Cost, Invitation Number, Parking, etc.
Some Easy, Some Hard – Address!
Example:
–
ARCHITECT: MICHEAL KIM ARCHITECTURE
– 1 HOLDEN STREET BROOKLINE, MA 02445
– P: (617) 739-6925 F: (772) 325-2991
Technique – create broad and stable templates, variation in the
text
13
14
15
Publishing Project: Example Rules
Key Project Data
16
Publishing Project: Process & Approach
17
Smart Search: Metadata, Metadata, Metadata
Basic Facets: Date, People, Organization, Content-Type
Advanced Facets: Materials, Methods, Project Attributes, etc.
– Context dependent
Deep personalization
– Selection of facets by role, community, task, content
Smart Summarization
– Better conceptual description
– Complex summaries – key data, document sections, etc.
Smart Search – beyond simple relevancy
18
Building on the Foundation: Applications
Pronoun Analysis: Fraud Detection; Enron Emails
Patterns of “Function” words reveal wide range of insights
Function words = pronouns, articles, prepositions, conjunctions,
etc.
– Used at a high rate, short and hard to detect, very social,
processed in the brain differently than content words
Areas: sex, age, power-status, personality – individuals and
groups
Lying / Fraud detection: Documents with lies have:
– Fewer, shorter words, fewer conjunctions, more positive
emotion words
– More use of “if, any, those, he, she, they, you”, less “I”
Current research – 76% accuracy in some context
19
Building on the Foundation: Social Media
Beyond Simple Sentiment
Beyond Good and Evil (positive and negative)
–
Degrees of intensity, complexity of emotions and documents
Importance of Context – around positive and negative words
Rhetorical reversals – “I was expecting to love it”
– Issues of sarcasm, (“Really Great Product”), slanguage
–
Essential – need full categorization and concept extraction
New Taxonomies – Appraisal Groups – “not very good”
– Supports more subtle distinctions than positive or negative
Emotion taxonomies - Joy, Sadness, Fear, Anger, Surprise, Disgust
–
New Complex – pride, shame, confusion, skepticism
20
Building on the Foundation: Applications
Behavior Prediction – Telecom Customer Service
Problem – distinguish customers likely to cancel from mere threats
Basic Rule / Intention
–
(START_20, (AND, (DIST_7,"[cancel]", "[cancel-what-cust]"),
–
(NOT,(DIST_10, "[cancel]", (OR, "[one-line]", "[restore]", “[if]”)))))
Examples:
–
customer called to say he will cancell his account if the does not stop receiving
a call from the ad agency.
– cci and is upset that he has the asl charge and wants it off or her is going to
cancel his act
More sophisticated analysis of text and context in text
Combine text analytics with Predictive Analytics and traditional behavior
monitoring for new applications
21
Building on the Foundation:
Current Applications
Survey Analysis – Add analysis of free text
Automated Essay Scoring – Second Generation
–
Beyond words (polysyllabic) to meaning
Story Telling – Data Heavy, Sports, Finance
–
90% of news machine written by 2025, books?
Legal Review / eDiscovery
–
TA- categorize and filter to smaller, more relevant set
– Payoff is big – One firm with 1.6 M docs – saved $2M
Voice of the Customer / Employee / Voter
–
Analysis of Blogs, Tweets, Social Networks
– Early Identify problems with products and services
– Customer Relationship & Brand Management, Fraud Detection
22
Smart Text : New Directions - Integration
Deep Integration – Text Analytics
New Forms of Rules – Combine Text Mining and Text Analytics
Incorporate clusters – CLUSTER Operator
– Like SENTENCE but more flexible, dynamic
–
More Dynamic Sections
–
–
–
Build up from “Categorization” of sentences – based on co-reference
Smaller units – Appraisal Taxonomies for Subjects, Build Larger
Units
Complex Units – Collections of Paragraphs based on meaning
Sentence Level Sentiment Techniques for Subjects
–
Smarter Relevancy – not frequency – develop new scoring
Graph Database
–
Enrich queries, improve relevancy
– Graph traversal – browse mechanism.
– Text Analytics to fill the database – “real” semantics
23
Conclusions
Text Analytics can feed/extend Big Data and Cognitive Science
applications
–
Discover structure in (un)structured text
– Apply text analytics to sections of document – new kinds of relevancy
– Creating multiple views into data inside text – smart search results –
interactive (facets plus)
– Modular design – better search, new applications, Watson
Future: Cognitive Computing:
–
Learns, discover patterns based on context, highly integrated,
meaning-based, highly interactive
– Text Analytics adds depth of meaning
Future – Women, Fire, and Dangerous Things
–
Text Analytics and Cognitive Science = Metaphor Analysis, deep
language understanding, common sense?
24
Questions?
Tom Reamy
[email protected]
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com