TBC 2015 Text anayltics- Hlava1

Download Report

Transcript TBC 2015 Text anayltics- Hlava1

Text Analytics in Action:
Using Text Analytics as a Toolset
TBC 4:15 p.m. - 5:00 p.m.
Marjorie Hlava
Semantic enrichment / Semantic
Fingerprinting
Abstract
• Big data inferences are increasingly used to mine huge
heaps of data.
• The applications are endless.
• However, those inferences do not work well when
many lines go to a single bubble. The lines and
relationships must be drawn between concepts, not
simply between words.
• Using the text analytics is a powerful tool, but it is a
means to an end, not the end itself.
• The important work is in the interpretation of the data.
• This session outlines a highly accurate and efficient
approach and provides a case study of the application.
Outline of the talk
• Using text analytics in term extraction
– 3 examples
– Pattern recognition
– String tagging
– Taxonomy control
• Achieving Synonymy
• Now what do I do with it?
Term clouds
• Good place to start
• Show concept landscape
• Basis =
– Levenshtein distances
– N-grams
• Redundant concepts, separately shown
• No disambiguation
• Not direct XML tagging
Sample
article
Normal text extraction
Near conceptual synonyms
Nonsensical suggestions
Small Taxonomy
Near synonym, conceptual duplicate
Refined presentation
Dependent concepts
Ontological dependencies
Achieving Synonymy
•
•
•
•
Find like concepts
Merge the terms
Choose a preferred form
Build term record
– Hierarchy
– Equivalence
– Associative
Overview, Upload 7K documents,
search for text string, add a tag, “Columbia”
“Colombian” – no stemming
Same document – different terms
Colombiana – record overlap
“FARC” – No Synonymy
“People’s Armed Forces of Colombia”, i.e.,
FARC, lacks synonymy, some doc overlap
Tag suite, no
hierarchy, no
equivalence,
no combining
tags for
synonymy
Disambiguation
Bridge
Structure
Bridge
Dentistry
Bridge
Game
Bridge
Concept
Now what do I do with it?
• Tag documents
– Consistently
– Even depth of treatment
– Full breadth of conceptual area
•
•
•
•
•
Insert concepts in full text or as linked data
Implement in search
Use for internal statistics and analysis
Track industry trends
Create semantic fingerprints
The AIP Thesaurus
Hierarchy
Term
Record
The AIP Thesaurus: Rulebase
This article is about (among other things)
degenerate stars.
The text string “degenerate stars” occurs zero
times in the text of the article.
But since the rulebase is tuned to understand
that when certain other words appear near
the text “star”or “stars” it was correctly indexed.
The AIP Thesaurus: Rulebase
If the word “star” or “stars” appears in
the same sentence as “degenerate” or
“compact” MAI applies the term
“Degenerate stars” instead of
just using “Stars”
The AIP Thesaurus: Applications
Listing of the AIP Thesaurus terms in
JATS. Includes the term, keyword-ID,
weight, code.
Inline tagged terms (denoted by the highlighting). The keyword ID
(kwd1.4) corresponds with the name in the previous screenshot.
HTML Header
Copyright © 2013 Access Innovations, Inc.
7. Content Recommender
Selected Article Search “thin film sputtering”
Grants available
Upcoming conferences on this topic
More Articles on the same topic
Authors working in this space
Taxonomy Driven Search Presentation
Auto-completion using the
taxonomy
Guide the user
Navigate
the full
taxonomy
“tree”
BROWSE
Thesaurus
Term Record
view
Taxonomy
view
Copyright © 2005 - Access Innovations,
Inc.
Suggested taxonomy descriptors
Visualization Strategies
Visualization
Software
Matrix
34
Pattern Analysis
Domain Associations
Pattern Analysis
Gap Analyses
Summary
•
•
•
•
•
•
Taxonomy tool box
Text extraction / mining for terms
Gather synonyms
Disambiguate terms
Look for gaps and over coverage
Map all conceptual groupings
– Hierarchical, Associative, Equivalence
• Apply to content
• Leverage knowledge of the collection
Thank you
Marjorie M.K. Hlava, President
Access Innovations
505-998-0800
[email protected]
The Semantic Enrichment Company
About Access Innovations
Access Innovations are experts in content creation, enrichment, and
conversion services. We provide services to semantically enrich and tag raw
text into highly structured data. We deliver clean, well-formed, metadataenriched content so our clients can reuse, repurpose, store, and find their
knowledge assets. We go beyond the standards to build taxonomies and
other data control structures as a solid foundation for your information.
Our services and software allow organizations to use and present their
information to both internal and external constituents by leveraging search,
presentation, and e-commerce. We change search to found!
Quick Facts
• Founded in 1978
• Headquartered in Albuquerque, NM
• Privately held
• Delivered more than 2000 engagements