Taxonomy Development Workshop

Download Report

Transcript Taxonomy Development Workshop

Smart Text
How to Turn Big Text into Big
Data
Tom Reamy
Chief Knowledge Architect
KAPS Group
http://www.kapsgroup.com
Program Chair – Text Analytics World
Taxonomy Boot Camp, KMWorld: Washington DC
Internet Librarian: Monterey, CA
KAPS Group: General
 Knowledge Architecture Professional Services – Network of Consultants
 Partners – Expert System, SAS, SAP, IBM, FAST, Smart Logic,
Concept Searching, Attensity, Clarabridge, Lexalytics,
 Strategy – IM & KM - Text Analytics, Social Media, Integration
 Services:
– Taxonomy/Text Analytics development, consulting, customization
– Text Analytics Fast Start – Audit, Evaluation, Pilot
– Social Media: Text based applications – design & development
 Clients:
–
Genentech, Novartis, Northwestern Mutual Life, Financial Times,
Hyatt, Home Depot, Harvard Business Library, British Parliament,
Battelle, Amdocs, FDA, GAO, etc.
 Applied Theory – Faceted taxonomies, complexity theory, natural
categories, emotion taxonomies
Presentations, Articles, White Papers – http://www.kapsgroup.com
2
Agenda
 Introduction: Big Text and Big Data
 Pharma: Semantic Search Application
– Project Components & Approach
– Extraction Rules
 Publishing: Processing 700K Proposals
– Adding Structure to Unstructured Text
– Text into Data
 Conclusions
3
Big Text and Big Data
 Big Text is Bigger than Big Data
–
80% -> 90% of business information (Social Media)
 Big Data tells you WHAT
–
Smart Text tells you WHY
 Big Data – Data Munging = 50-80% of Data Scientist Time
–
Variety of Formats // Ambiguity of Human Language
 Ontology / Fact Extraction – Pulmonary ISA Disease
–
Chronic obstructive pulmonary disease, obstructive pulmonary disease, Copd, copd,
COPD, Asthma (Asthema) , Emphysema, etc., etc.
 Semi-Automatic Hybrid Solutions
–
AI not here yet (again)
4
Pharma: Project
 Agile Methodology
 Goal – evaluate text analysis technologies ability to:
– Replace manual annotation of scientific documents –
automated or semi-automated
– Discover new entities and relationships
– Provide users with self-service capabilities
 Goal – feasibility and effort level
5
Components – Technology, Resources
 Cambridge Semantics, Linguamatics, SAS Enterprise Content
Categorization
– Initial integration – passing results as XML
 Content – scientific journal articles
 Taxonomy – Mesh – select small subset
 Access to a “customer” – critical for success
6
Three rounds - Iterations
 Visualization – faceted search, sort by date, author, journal
– Cambridge Semantics
 Round 1 – PDF from their database
– Needed to create additional structure and metadata
– No such thing as unstructured content
 Round 2 & 3 – XML with full metadata from PubMed
 Entity Recognition – Species, Document Type, Study Type, Drug
Names, Disease Names, Adverse Events
7
Components & Approach
 Rules or sample documents?
– Need more precision and granularity than documents can do
– Training sets – not as easy as thought
 First Rules – text indicators to define sections of the document
– Objectives, Abstract, Purpose, Aim – all the “same” section
– Experiment – clusters / vocabulary to define section
 Separate logic of the rules from the text
– Stable rules, changing text
 Scores – relevancy with thresholds
– Not just frequency of words
8
Document Type Rules
 (START_2000, (AND, (OR, _/article:"[Abstract]",
_/article:"[Methods]“, _/article:"[Objective]",
 _/article:"[Results]", _/article:"[Discussion]“, (OR,
 _/article:"clinical trial*", _/article:"humans",
 (NOT, (DIST_5, (OR,_/article:"approved", _/article:"safe",
_/article:"use", _/article:"animals"),
 Clinical Trial Rule:
 If the article has sections like Abstract or Methods
 AND has phrases around “clinical trials / Humans” and not words
like “animals” within 5 words of “clinical trial” words – count it and
add up a relevancy score
9
Rules for Drug Names and Diseases
 Primary issue – major mentions, not every mention
– Combination of noun phrase extraction and categorization
– Results – virtually 100%
 Taxonomy of drug names and diseases
 Capture general diseases like thrombosis and specific types like
deep vein, cerebral, and cardiac
 Combine text about arthritis and synonyms with text like “Journal
of Rheumatology”
10
11
Rules for Drug Names and Diseases




(OR, _/article/title:"[clonidine]",
(AND, _/article/mesh:"[clonidine]",_/article/abstract:"[clonidine]"),
(MINOC_2, _/article/abstract:"[clonidine]")
(START_500, (MINOC_2,"[clonidine]")))




Means – any variation of drug name in title – high score
Any variation in Mesh Keywords AND in abstract – high score
Any variation in Abstract at least 2x – good score
Any variation in first 500 words at least 2x – suspect
12
Rules for Drug Names and Diseases
 Results:
– Wide Range by type -- 70-100% recall and precision
 Focus mostly on precision – difficult to test recall
 One deep dive area indicated that 90%+ scores for both precision
and recall could be built with moderate level of effort
 Not linear effort – 30% accuracy does not mean 1/3 done
13
Conclusion
 Project was a success!
 Useful results – as defined by the customer
 Reasonable and doable effort level – both for initial development
and maintenance
 Essential Success Factors
– Rules not documents, training sets (starting point)
– Full platform for disambiguation of noun phrase extraction,
major-minor mention
– Separation of logic and text
 “Semantic” Search works!
– If you do it smart!
14
Publishing Project: Reed Construction Data
 700,000 Proposals – Wide Variation
 Process Proposals – extract data – 30-50 types
 Current Manual Process – Internal Teams
–
Expensive and Slow
 Structure Variety of Unstructured Documents
–
–
Generate Table of Contents
Generate Sections and Capture Text
 Extract Key Information
 Save Time & Money, Flexible Hiring, New Offerings
15
Publishing Project: Components:
Technology, Resources
 Initial Attempt – failed target, too expensive to complete
 KAPS Group and SAS – Enterprise Content Categorization
–
Team of 4 – mostly part time
 Reed Data Resources – 3 part time +, Current team of
proposal processors – develop test documents
 4 Months – majority of time/effort on Key Data Extraction
 Sections – by Construction codes & text, Automated Table
of Contents
16
Publishing Project: Example Rules
Automated Table of Content
17
Publishing Project: Example Rules
Automated Table of Content
 (AND,












(OR,
(ORD,"[SectionHeaderTags]","[Division01B_RegEx]","[TechnicalSpecPhrases]",
(ORDDIST_3,"[SectionBodyPart]","[SectionBodyDesc]"
)),
(ORD,"[Division01B_RegEx]","[TechnicalSpecPhrases]",
(ORDDIST_3,"[SectionBodyPart]","[SectionBodyDesc]"
__Division01BRegEx
00[0-9][0-9][0-9],
00[ _-]?[0-9][0-9][ _-]?[0-9][0-9],
00[ _-]?[0-9][0-9][ _-]?[0-9][0-9][\.][0-9][0-9], ))))
Abandonment, Abatement, Abbreviations, Above-Grade, Aboveground, Abrasion-Resistant,
Abrasive, Absorption, AC, Acceleration, etc - ~2,000 terms
Section Header Tags – “Section, Division, Document”
18
Publishing Project: Example Rules
Key Data Extraction




Bid Dates/Times
Roles (Architect, Designer, etc.) – names and addresses, etc.
Project Attributes – Cost, Invitation Number, Parking, etc.
Some Easy, Some Hard – Address!
 Example






ARCHITECT:
MICHEAL KIM ARCHITECTURE
1 HOLDEN STREET
BROOKLINE, MA 02445
P: (617) 739-6925
F: (772) 325-2991
19
Publishing Project: Process & Approach
20
Publishing Project: Example Rules
Key Project Data
21
Publishing Project: Example Rules
Key Project Data
22
Conclusion: Lessons Learned
 Development requires lots of content, testers, regular meetings
 Best Pattern Rule Development = develop a few rules to
production level, then adapt to other areas
 Hybrid Solutions are best (AI not here yet)
 Biggest Problem = Human Creativity
 Best Solution = Human Creativity
 But – successful project!
 Foundation laid for Semi-automated text processing, new data
 Next Steps – refine, add, refine, new, refine, refine
23
Summary
 Text Analytics: Platform & Foundation for Applications
 Semantic Search and (Semi)-Automated Business Processes
 AND – Sentiment Analysis-Social Media, Fraud Detection,
eDiscovery, Expertise location & analysis, behavior prediction
 Data/Fact Extraction can feed/extend Big Data and Semantic
Technology applications
 Interested?
– Text Analytics World, San Francisco March 30-April 1
• (Call for Speakers Now)-textanalyticsworld.com
 New Book coming: Text Analytics: Everything You Need to
Know to Conquer Information Overload, Mine Social Media for
Real Value, and Turn Big Text into Big Data
24
Questions?
Tom Reamy
[email protected]
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
www.TextAnalyticsWorld.com March 30-April 1, San Francisco