Intelligent Interactions with Search Results

Transcript Intelligent Interactions with Search Results

Intelligent Interactions with
Search Results
Getting Beyond Those Blue Results Lists
(or Smart Text)
Tom Reamy
Chief Knowledge Architect
KAPS Group
http://www.kapsgroup.com
Program Chair – Text Analytics World
Agenda
 Introduction: Search and Structure: Smart Text
–
Smart Text– foundation of text analytics
 Adding Structure to Unstructured Text
–
Dynamic Sections and more, Better Relevancy Calculations
– Complex Document Summaries, Deeper Personalization
 Case Study
–
Publishing: Processing 700K Proposals
 Beyond Search – Building on the Foundation
 Conclusions
2
Introduction: KAPS Group
 Knowledge Architecture Professional Services – Network of Consultants
 Applied Theory – Faceted & emotion taxonomies, natural categories
Services:
– Strategy – IM & KM - Text Analytics, Social Media, Integration
– Taxonomy/Text Analytics, Social Media development, consulting
– Text Analytics Quick Start – Audit, Evaluation, Pilot
 Partners – Smart Logic, Expert Systems, SAS, SAP, IBM, FAST,
Concept Searching, Attensity, Clarabridge, Lexalytics
 Clients: Genentech, Novartis, Northwestern Mutual Life, Financial
Times, Hyatt, Home Depot, Harvard Business Library, British Parliament,
Battelle, Amdocs, FDA, GAO, World Bank, Dept. of Transportation, etc.
 Program Chair – Text Analytics World – March 29-April 1 - SF
 Presentations, Articles, White Papers – www.kapsgroup.com
 Current – Book – Text Analytics: How to Conquer Information Overload,
Get Real Value from Social Media, and Add Smart Text to Big Data
3
Introduction:
Elements of Smart Text - Text Analytics
 Text Mining – NLP, statistical, predictive, machine learning
 Extraction – entities – known and unknown, concepts, events
–
Disambiguation - Ford
 Fact Extraction - ontology, relationships of entities
 Sentiment Analysis - Positive Negative – products, companies,
 Auto-categorization
–
–
–
–
–
Training sets, Terms
Rules – simple – position in text (Title, body, url)
Boolean– Full search syntax – AND, OR, NOT
Advanced – DIST(#), ORDDIST#, PARAGRAPH, SENTENCE
Based on taxonomy/ontology
4
Enterprise Text Analytics
 Search is still #1 = 30-50% of applications
 New Standard Search – facets (more and more metadata), autocategorization built on taxonomies, clustering
 Trend = Text Analytics/Search as Semantic Infrastructure
–
Platform for Info Apps (Search-based applications)
 SharePoint – Major focus of TA companies – fix problems with
taxonomy/folksonomy
–
Hybrid workflow – Publish document -> TA analysis -> suggestions
for categorization, entities, metadata -> present to author
 External information = more automation, extraction – precision
more important
5
Enterprise Text Analytics
Adding Structure to Unstructured Content
 Beyond Documents – categorization by corpus, by page, sections
or even sentence or phrase
 Documents are not unstructured – variety of structures
 Text indicators to define sections of the document
– Objectives, Abstract, Purpose, Aim – all the “same” section
– Sections – Specific - “Abstract” to Function “Evidence”
– Start of section is easy – where does it end?
 Experiment – clusters / vocabulary to define section
– Textual complexity, level of generality
6
Enterprise Text Analytics
Categorization and Beyond
 Need to develop flexible categorization and taxonomy – tweets to
200 page PDF
 Rules or sample documents?
– Need more precision and granularity than documents can do
– Training sets – not as easy as thought
– Applications require sophisticated rules, not just
categorization by similarity
 Separate logic of the rules from the text
– Stable rules, changing text
 Scores – relevancy with thresholds
– Not just frequency of words
7
8
Enterprise Text Analytics
Document Type Rules
 (START_2000, (AND, (OR, _/article:"[Abstract]",
_/article:"[Methods]“), (OR,_/article:"clinical trial*",
_/article:"humans",
 (NOT, (DIST_5, (OR,_/article:"approved", _/article:"safe",
_/article:"use", _/article:"animals"),
 If the article has sections like Abstract or Methods
 AND has phrases around “clinical trials / Humans” and not words
like “animals” within 5 words of “clinical trial” words – count it and
add up a relevancy score
 Primary issue – major mentions, not every mention
– Combination of noun phrase extraction and categorization
– Results – virtually 100%
9
Case Study
Publishing Project: Reed Construction Data
 700,000 Proposals – Wide Variation
 Process Proposals – extract data – 30-50 types
 Current Manual Process – Internal Teams
–
Expensive and Slow
 Structure Variety of Unstructured Documents
–
–
Generate Table of Contents
Generate Sections and Capture Text
 Semi-automatic extract Key Information
 Save Time & Money, Flexible Hiring, New Offerings
10
Publishing Project: Example Rules
Automated Table of Content
11
12
Publishing Project: Example Rules
Key Data Extraction





Bid Dates/Times
Roles (Architect, Designer, etc.) – names and addresses, etc.
Project Attributes – Cost, Invitation Number, Parking, etc.
Some Easy, Some Hard – Address!
Example:
–
ARCHITECT: MICHEAL KIM ARCHITECTURE
– 1 HOLDEN STREET BROOKLINE, MA 02445
– P: (617) 739-6925 F: (772) 325-2991
 Technique – create broad and stable templates, variation in the
text
13
14
15
Publishing Project: Example Rules
Key Project Data
16
Publishing Project: Process & Approach
17
Smart Search: Metadata, Metadata, Metadata
 Basic Facets: Date, People, Organization, Content-Type
 Advanced Facets: Materials, Methods, Project Attributes, etc.
– Context dependent
 Deep personalization
– Selection of facets by role, community, task, content
 Smart Summarization
– Better conceptual description
– Complex summaries – key data, document sections, etc.
 Smart Search – beyond simple relevancy
18
Building on the Foundation: Applications
Pronoun Analysis: Fraud Detection; Enron Emails
 Patterns of “Function” words reveal wide range of insights
 Function words = pronouns, articles, prepositions, conjunctions,
etc.
– Used at a high rate, short and hard to detect, very social,
processed in the brain differently than content words
 Areas: sex, age, power-status, personality – individuals and
groups
 Lying / Fraud detection: Documents with lies have:
– Fewer, shorter words, fewer conjunctions, more positive
emotion words
– More use of “if, any, those, he, she, they, you”, less “I”
 Current research – 76% accuracy in some context
19
Building on the Foundation: Social Media
Beyond Simple Sentiment
 Beyond Good and Evil (positive and negative)
–
Degrees of intensity, complexity of emotions and documents
 Importance of Context – around positive and negative words
Rhetorical reversals – “I was expecting to love it”
– Issues of sarcasm, (“Really Great Product”), slanguage
–
 Essential – need full categorization and concept extraction
 New Taxonomies – Appraisal Groups – “not very good”
– Supports more subtle distinctions than positive or negative
 Emotion taxonomies - Joy, Sadness, Fear, Anger, Surprise, Disgust
–
New Complex – pride, shame, confusion, skepticism
20
Building on the Foundation: Applications
Behavior Prediction – Telecom Customer Service
 Problem – distinguish customers likely to cancel from mere threats
 Basic Rule / Intention
–
(START_20, (AND, (DIST_7,"[cancel]", "[cancel-what-cust]"),
–
(NOT,(DIST_10, "[cancel]", (OR, "[one-line]", "[restore]", “[if]”)))))
 Examples:
–
customer called to say he will cancell his account if the does not stop receiving
a call from the ad agency.
– cci and is upset that he has the asl charge and wants it off or her is going to
cancel his act
 More sophisticated analysis of text and context in text
 Combine text analytics with Predictive Analytics and traditional behavior
monitoring for new applications
21
Building on the Foundation:
Current Applications
 Survey Analysis – Add analysis of free text
 Automated Essay Scoring – Second Generation
–
Beyond words (polysyllabic) to meaning
 Story Telling – Data Heavy, Sports, Finance
–
90% of news machine written by 2025, books?
 Legal Review / eDiscovery
–
TA- categorize and filter to smaller, more relevant set
– Payoff is big – One firm with 1.6 M docs – saved $2M
 Voice of the Customer / Employee / Voter
–
Analysis of Blogs, Tweets, Social Networks
– Early Identify problems with products and services
– Customer Relationship & Brand Management, Fraud Detection
22
Smart Text : New Directions - Integration
Deep Integration – Text Analytics
 New Forms of Rules – Combine Text Mining and Text Analytics
Incorporate clusters – CLUSTER Operator
– Like SENTENCE but more flexible, dynamic
–
 More Dynamic Sections
–
–
–
Build up from “Categorization” of sentences – based on co-reference
Smaller units – Appraisal Taxonomies for Subjects, Build Larger
Units
Complex Units – Collections of Paragraphs based on meaning
 Sentence Level Sentiment Techniques for Subjects
–
Smarter Relevancy – not frequency – develop new scoring
 Graph Database
–
Enrich queries, improve relevancy
– Graph traversal – browse mechanism.
– Text Analytics to fill the database – “real” semantics
23
Conclusions
 Text Analytics can feed/extend Big Data and Cognitive Science
applications
–
Discover structure in (un)structured text
– Apply text analytics to sections of document – new kinds of relevancy
– Creating multiple views into data inside text – smart search results –
interactive (facets plus)
– Modular design – better search, new applications, Watson
 Future: Cognitive Computing:
–
Learns, discover patterns based on context, highly integrated,
meaning-based, highly interactive
– Text Analytics adds depth of meaning
 Future – Women, Fire, and Dangerous Things
–
Text Analytics and Cognitive Science = Metaphor Analysis, deep
language understanding, common sense?
24
Questions?
Tom Reamy
[email protected]
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com

Intelligent Interactions with Search Results

Transcript Intelligent Interactions with Search Results

Directory