Taxonomy Development Workshop
Download
Report
Transcript Taxonomy Development Workshop
Smart Text
How to Turn Big Text into Big
Data
Tom Reamy
Chief Knowledge Architect
KAPS Group
http://www.kapsgroup.com
Program Chair – Text Analytics World
Taxonomy Boot Camp, KMWorld: Washington DC
Internet Librarian: Monterey, CA
KAPS Group: General
Knowledge Architecture Professional Services – Network of Consultants
Partners – Expert System, SAS, SAP, IBM, FAST, Smart Logic,
Concept Searching, Attensity, Clarabridge, Lexalytics,
Strategy – IM & KM - Text Analytics, Social Media, Integration
Services:
– Taxonomy/Text Analytics development, consulting, customization
– Text Analytics Fast Start – Audit, Evaluation, Pilot
– Social Media: Text based applications – design & development
Clients:
–
Genentech, Novartis, Northwestern Mutual Life, Financial Times,
Hyatt, Home Depot, Harvard Business Library, British Parliament,
Battelle, Amdocs, FDA, GAO, etc.
Applied Theory – Faceted taxonomies, complexity theory, natural
categories, emotion taxonomies
Presentations, Articles, White Papers – http://www.kapsgroup.com
2
Agenda
Introduction: Big Text and Big Data
Pharma: Semantic Search Application
– Project Components & Approach
– Extraction Rules
Publishing: Processing 700K Proposals
– Adding Structure to Unstructured Text
– Text into Data
Conclusions
3
Big Text and Big Data
Big Text is Bigger than Big Data
–
80% -> 90% of business information (Social Media)
Big Data tells you WHAT
–
Smart Text tells you WHY
Big Data – Data Munging = 50-80% of Data Scientist Time
–
Variety of Formats // Ambiguity of Human Language
Ontology / Fact Extraction – Pulmonary ISA Disease
–
Chronic obstructive pulmonary disease, obstructive pulmonary disease, Copd, copd,
COPD, Asthma (Asthema) , Emphysema, etc., etc.
Semi-Automatic Hybrid Solutions
–
AI not here yet (again)
4
Pharma: Project
Agile Methodology
Goal – evaluate text analysis technologies ability to:
– Replace manual annotation of scientific documents –
automated or semi-automated
– Discover new entities and relationships
– Provide users with self-service capabilities
Goal – feasibility and effort level
5
Components – Technology, Resources
Cambridge Semantics, Linguamatics, SAS Enterprise Content
Categorization
– Initial integration – passing results as XML
Content – scientific journal articles
Taxonomy – Mesh – select small subset
Access to a “customer” – critical for success
6
Three rounds - Iterations
Visualization – faceted search, sort by date, author, journal
– Cambridge Semantics
Round 1 – PDF from their database
– Needed to create additional structure and metadata
– No such thing as unstructured content
Round 2 & 3 – XML with full metadata from PubMed
Entity Recognition – Species, Document Type, Study Type, Drug
Names, Disease Names, Adverse Events
7
Components & Approach
Rules or sample documents?
– Need more precision and granularity than documents can do
– Training sets – not as easy as thought
First Rules – text indicators to define sections of the document
– Objectives, Abstract, Purpose, Aim – all the “same” section
– Experiment – clusters / vocabulary to define section
Separate logic of the rules from the text
– Stable rules, changing text
Scores – relevancy with thresholds
– Not just frequency of words
8
Document Type Rules
(START_2000, (AND, (OR, _/article:"[Abstract]",
_/article:"[Methods]“, _/article:"[Objective]",
_/article:"[Results]", _/article:"[Discussion]“, (OR,
_/article:"clinical trial*", _/article:"humans",
(NOT, (DIST_5, (OR,_/article:"approved", _/article:"safe",
_/article:"use", _/article:"animals"),
Clinical Trial Rule:
If the article has sections like Abstract or Methods
AND has phrases around “clinical trials / Humans” and not words
like “animals” within 5 words of “clinical trial” words – count it and
add up a relevancy score
9
Rules for Drug Names and Diseases
Primary issue – major mentions, not every mention
– Combination of noun phrase extraction and categorization
– Results – virtually 100%
Taxonomy of drug names and diseases
Capture general diseases like thrombosis and specific types like
deep vein, cerebral, and cardiac
Combine text about arthritis and synonyms with text like “Journal
of Rheumatology”
10
11
Rules for Drug Names and Diseases
(OR, _/article/title:"[clonidine]",
(AND, _/article/mesh:"[clonidine]",_/article/abstract:"[clonidine]"),
(MINOC_2, _/article/abstract:"[clonidine]")
(START_500, (MINOC_2,"[clonidine]")))
Means – any variation of drug name in title – high score
Any variation in Mesh Keywords AND in abstract – high score
Any variation in Abstract at least 2x – good score
Any variation in first 500 words at least 2x – suspect
12
Rules for Drug Names and Diseases
Results:
– Wide Range by type -- 70-100% recall and precision
Focus mostly on precision – difficult to test recall
One deep dive area indicated that 90%+ scores for both precision
and recall could be built with moderate level of effort
Not linear effort – 30% accuracy does not mean 1/3 done
13
Conclusion
Project was a success!
Useful results – as defined by the customer
Reasonable and doable effort level – both for initial development
and maintenance
Essential Success Factors
– Rules not documents, training sets (starting point)
– Full platform for disambiguation of noun phrase extraction,
major-minor mention
– Separation of logic and text
“Semantic” Search works!
– If you do it smart!
14
Publishing Project: Reed Construction Data
700,000 Proposals – Wide Variation
Process Proposals – extract data – 30-50 types
Current Manual Process – Internal Teams
–
Expensive and Slow
Structure Variety of Unstructured Documents
–
–
Generate Table of Contents
Generate Sections and Capture Text
Extract Key Information
Save Time & Money, Flexible Hiring, New Offerings
15
Publishing Project: Components:
Technology, Resources
Initial Attempt – failed target, too expensive to complete
KAPS Group and SAS – Enterprise Content Categorization
–
Team of 4 – mostly part time
Reed Data Resources – 3 part time +, Current team of
proposal processors – develop test documents
4 Months – majority of time/effort on Key Data Extraction
Sections – by Construction codes & text, Automated Table
of Contents
16
Publishing Project: Example Rules
Automated Table of Content
17
Publishing Project: Example Rules
Automated Table of Content
(AND,
(OR,
(ORD,"[SectionHeaderTags]","[Division01B_RegEx]","[TechnicalSpecPhrases]",
(ORDDIST_3,"[SectionBodyPart]","[SectionBodyDesc]"
)),
(ORD,"[Division01B_RegEx]","[TechnicalSpecPhrases]",
(ORDDIST_3,"[SectionBodyPart]","[SectionBodyDesc]"
__Division01BRegEx
00[0-9][0-9][0-9],
00[ _-]?[0-9][0-9][ _-]?[0-9][0-9],
00[ _-]?[0-9][0-9][ _-]?[0-9][0-9][\.][0-9][0-9], ))))
Abandonment, Abatement, Abbreviations, Above-Grade, Aboveground, Abrasion-Resistant,
Abrasive, Absorption, AC, Acceleration, etc - ~2,000 terms
Section Header Tags – “Section, Division, Document”
18
Publishing Project: Example Rules
Key Data Extraction
Bid Dates/Times
Roles (Architect, Designer, etc.) – names and addresses, etc.
Project Attributes – Cost, Invitation Number, Parking, etc.
Some Easy, Some Hard – Address!
Example
ARCHITECT:
MICHEAL KIM ARCHITECTURE
1 HOLDEN STREET
BROOKLINE, MA 02445
P: (617) 739-6925
F: (772) 325-2991
19
Publishing Project: Process & Approach
20
Publishing Project: Example Rules
Key Project Data
21
Publishing Project: Example Rules
Key Project Data
22
Conclusion: Lessons Learned
Development requires lots of content, testers, regular meetings
Best Pattern Rule Development = develop a few rules to
production level, then adapt to other areas
Hybrid Solutions are best (AI not here yet)
Biggest Problem = Human Creativity
Best Solution = Human Creativity
But – successful project!
Foundation laid for Semi-automated text processing, new data
Next Steps – refine, add, refine, new, refine, refine
23
Summary
Text Analytics: Platform & Foundation for Applications
Semantic Search and (Semi)-Automated Business Processes
AND – Sentiment Analysis-Social Media, Fraud Detection,
eDiscovery, Expertise location & analysis, behavior prediction
Data/Fact Extraction can feed/extend Big Data and Semantic
Technology applications
Interested?
– Text Analytics World, San Francisco March 30-April 1
• (Call for Speakers Now)-textanalyticsworld.com
New Book coming: Text Analytics: Everything You Need to
Know to Conquer Information Overload, Mine Social Media for
Real Value, and Turn Big Text into Big Data
24
Questions?
Tom Reamy
[email protected]
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
www.TextAnalyticsWorld.com March 30-April 1, San Francisco