AnalyzingText_SQLServer2014_Rx

Download Report

Transcript AnalyzingText_SQLServer2014_Rx

Analyzing Text with SQL Server
2014, R, AND Azure ML
Dejan Sarka
Introduction
• Dejan Sarka
– [email protected],
[email protected], @DejanSarka
– Data Scientist
– MCT, SQL Server MVP
– 30 years of data modeling,
data mining and data
quality
• 13 books
• ~10 courses
2
Agenda
•
•
•
•
•
•
Text Mining
Using R
Azure ML NER
Full-Text Search
Statistical Semantic Search
Conclusion
Introduction to Text Mining
• Text mining = analysis of text
• Text mining cannot be done with a single
tool in SQL Server
• Use SSIS for preparation
• Use SSAS Data Mining for in-depth
analysis
– Can use SSAS BISM for manual analysis
– Can use SSRS for presentation
Text Mining in SSIS
• SSIS integrates text mining into the data flow
– This advanced feature can source data just like
any other component, and the results can be
handled just like the output from any other
component
– Routed, sorted, aggregated, cleansed,
transformed and loaded...
– Source can be a database table with a comments
field; it could be XML (RSS for example) or any
other text source
• The two components to explore are Term
Extraction and Term Lookup
Term Extraction (1)
• Term Extraction enables retrieving the key terms
from a Unicode string or text column (DT_WSTR or
DT_NTEXT)
– It uses its own dictionary and linguistic information
about English
– Can be used with other languages, but results are
worse
• It can extract nouns only, noun phrases only, or
both nouns and noun phrases
– Articles and pronouns are not extracted
• It breaks text into sentences, and sentences into
words
• It normalizes capitalization of words
• It also stems nouns to extract the singular form
Term Extraction (2)
• The Term Extraction transformation scores
each term
– It scores terms based on a number of factors,
including its English grammar and syntax
– The output includes only two columns - the
extracted terms and the score
• Score can be TF (term frequency) or TFIDF
(term frequency / inverse document
frequency)
– TFIDF = TF * LN(n of docs with term / TF)
– TFIDF lowers the score for terms that appear in
many documents
– Emphasizes terms that are frequent in lower
number of documents
Term Extraction (3)
• Can use exclusion terms
– If you analyze text from specific area only, some
terms (e.g. “SQL Server” in texts about SQL
Server) can become noisy
– The transformation skips extraction of exclusion
terms
– Must be stored in a SQL Server or Access table
• It is a blocking transformation
– Need to consume complete upstream data before
releasing any row downstream
• You should test it with different options to get
the result that suits your needs
Term Lookup (1)
• The Term Lookup transformation matches
terms extracted from text in a transformation
input column with terms in a reference table
– Counts how many times a term appears in a
document
– Before performing lookup, it extracts words using
the same method as the Term Extraction
transformation
– If a document contains terms that overlap in the
reference set, it returns only one lookup result
– E.g., term “Microsoft Windows Vista SP1” is in
document, “Microsoft Windows” and “Windows
Vista SP1” both in reference set, only “Microsoft
Windows” is returned
Term Lookup (2)
• Reference set is a set of terms in a lookup
table
– Usually result of Term Extraction
– Can edit it manually
– Must be stored in a SQL Server or Access
table
• It is a semi-blocking transformation
– Holds up records in the Data Flow for a period
of time before passes memory buffers
downstream
Further Analysis
• Mine Term Lookup results
– Clustering algorithm groups documents in clusters
based on similarity of term occurrences
– Association Rules detects cross-correlation
between key words and phrases
– Classification algorithms, like Decision Trees, can
use key words and phrases to predict the class of
a document
• UDM cubes and reports can present Term
Lookup results
• You can also present Term Extraction results
directly
Problems
• The only supported language for Term
Extraction and Term Lookup is English
– They have no clue about syntax and grammar
in other languages
• Both transformations do not support
custom delimiters
– Makes them even less useful for non-English
languages
Analyzing Text in R (1)
• Define “corpus” – set of documents to analyze
• Read the documents from several sources in
several supported formats
– R data frame, directory, URI, R vector, XML
– CSV, DOC, PDF, XML
• Use different transformations to prepare the text
for analysis
– Change special characters, lowercase, remove
punctuation, remove stopwords, strip space
– Use stemming
13
Analyzing Text in R (2)
• Create document term matrix
• Calculate term frequencies
– Remove sparse terms
– Find most frequent terms
• Find associations
• Use different plots
–
–
–
–
Term frequency
Word clouds
Term length frequency
Letter frequency
14
Analyzing Text in R (3)
15
Azure ML NER
• Named Entity Recognition
– Identifies proper names
– Classifies names to categories
• Categories can be universal or local
– Person, location,…
– DBMS, programming language,…
• Azure ML NER accepts two inputs
– Story – texts to analyze
– Custom Resources – local linguistic resources
16
With FTS/SSS, Search for:
•
•
•
•
•
•
•
•
Simple terms, i.e. one or more specific words or phrases
Prefix terms, terms the words or phrases begin with
Generation terms, meaning inflectional forms of words
Proximity terms, or words or phrases close to another
word or phrase
Thesaurus terms, or synonyms of a word
Weighted terms, words or phrases using values with your
custom weight
Statistical semantic search, or key phrases in a document
Similar documents, where similarity is defined by
semantic key phrases
17
FTS/SSS Components (1)
• Install FTS/SS with Setup
• Install document filters
– Download Office 2007 / 2010 filters an load them
– Columns of data type VARBINARY,
VARBINARY(MAX), IMAGE, or XML require additional
type column in which you store the file extension
• Word breakers and stemmers perform languagespecific linguistic analysis on all full-text data
– Can use English if a language is not supported
18
FTS/SS Components (2)
• Can prevent indexing noise words by creating
stoplists of stopwords
• FTS finds synonyms in thesaurus files
– Each language has an associated XML thesaurus file path SQL_Server_install_path\Microsoft SQL
Server\MSSQL12.MSSQLSERVER\MSSQL\FTDATA\
– You can manually edit each thesaurus file and load it
• Can search on document properties
– Searchable properties depend on the document filter
– Can create search property list
19
FTS/SS Catalogs and Indexes
• FT indexes are stored in FT catalogs
– A full-text catalog is a virtual object, a
container for full-text indexes
– As a virtual object, it does not belong to any
filegroup
• A FT Index is a physical object
• For semantic search, install the Semantic
Language Statistics Database
• Can test FTS parsing and stemming with
sys.dm_fts_parser
20
The CONTAINS Predicate (1)
• Search for exact or fuzzy matches
– SELECT…FROM…WHERE CONTAINS(FTcolumn,
‘SearchWord1’) – simple term
– SELECT…FROM…WHERE CONTAINS(FTcolumn,
‘SearchWord1 OR SearchWord2’) – simple term with
a logical operator
– SELECT…FROM…WHERE CONTAINS(FTcolumn,
‘”SearchWord1 SearchWord2”’) – phrase term
– SELECT…FROM…WHERE CONTAINS(FTcolumn,
‘”SearchWord1*”’) – prefix term
21
The CONTAINS Predicate (2)
– SELECT…FROM…WHERE CONTAINS(FTcolumn,
‘NEAR(SearchWord1, SearchWord2)’) – simple
proximity term
– SELECT…FROM…WHERE CONTAINS(FTcolumn,
‘NEAR((SearchWord1, SearchWord2), distance)’) –
proximity term with distance
– SELECT…FROM…WHERE CONTAINS(FTcolumn,
‘NEAR((SearchWord1, SearchWord2), distance, flag)’)
– proximity term with distance and order of words (flag
= True | False)
22
The CONTAINS Predicate (3)
– SELECT…FROM…WHERE CONTAINS(FTcolumn,
‘FORMSOF(INFLECTIONAL, SearchWord1)’) generation term
– SELECT…FROM…WHERE CONTAINS(FTcolumn,
‘FORMSOF(THESAURUS, SearchWord1)’) generation term with synonyms
– SELECT…FROM…WHERE CONTAINS(FTcolumn,
‘ISABOUT(SearchWord1 weight(w1), SearchWord2
weight(w2))’) - weighted term (not useful for
CONTAINS)
– SELECT…FROM…WHERE
CONTAINS(PROPERTY(Ftcolumn, ‘PropertyName’),
‘SearchWord1’) - property search
23
The FREETEXT Predicate
• The FREETEXT predicate is less specific
and thus return more rows than the
CONTAINS predicate
– SELECT…FROM…WHERE
FREETEXT(FTcolumn, ‘SearchWord1
SearchWord2’) - you are searching for rows
where the FTcolumn includes any of the
inflectional forms and any of the defined
synonyms of the words SearchWord1 and
SearchWord2
24
FTS Functions
• The CONTAINSTABLE and FREETEXTTABLE
functions return two columns: KEY and RANK
– The KEY column is the unique key
– RANK - a value between 0 and 1000 telling you how
well a row matches your search criteria
• The number is always relative to a query
• The calculation takes into account term frequency, number of
words in a document, proximity terms, weight, number of
indexed rows, …
• Different calculation for the CONTAINSTABLE and for the
FREETEXTTABLE, as the later does not support majority of
the parameters
25
SSS Functions
• SEMANTICKEYPHRASETABLE
( table, { column | (column_list) | * } [ , source_key ] ) returns a table with key phrases associated with the fulltext indexed column from the column_list
• SEMANTICSIMILARITYDETAILSTABLE
( table, source_column, source_key, matched_column,
matched_key ) - returns a table with key phrases that are
common across two documents
• SEMANTICSIMILARITYTABLE
( table, { column | (column_list) | * }, source_key ) returns a table with documents scored by semantic
similarity to the searched document specified with the
source_key parameter
26
Problems
• FTS supports > 50 languages
– Including Slovak and Slovenian
• SS supports 15 languages
– Excluding Slovak and Slovenian
• A SS “term” consists of a single word
• So how to analyze texts in Slovak or
Slovenian?
a) Not with SQL Server tools
b) Custom application with sys.dm_fts_parser
c) Use a translator and then SQL Server tools
So What Does “kuraci” Mean?
Really?
Q&A
• Come to SQL Saturday Ljubljana,
December 12th, 2015!
• Thank you!
30