Presentazione di PowerPoint

Download Report

Transcript Presentazione di PowerPoint

Quality 2014
Wien, June 2-5 2014
Use of web scraping and text mining
techniques in the Istat survey on
“Information and Communication
Technology in enterprises”
Giulio Barcaroli(*), Alessandra Nurra(*), Marco Scarnò(**),
Donato Summa(*)
(*) Italian National Institute of Statistics (Istat)
(**) Cineca
Quality 2014
The “ICT in enterprises” survey
 In Italy, the survey investigates on a universe of
211,851 enterprises with at least 10 employees, by
means of a sampling survey involving 19,186 of them
(2011).
 In the 2013 round of the survey, 8,687 indicated their
website (45% of sampling respondent units).
 The access to the indicated websites in order to
gather information directly within them, gives different
opportunities.
Quality 2014
The “ICT in enterprises” survey
Action
Target
1
Substitute the traditional collection
technique questionnaire-based, with
an Internet as Data Source new one,
for all suitable questions
Reduction of respondent burden
2
Integrate the information collected
Increase of accuracy of estimates
via questionnaire with the information
collected via IaD
3
Collect additional information
Quality 2014
Increase the offer of statistical
information
The “ICT in enterprises” survey
Quality 2014
Quality 2014
Predictive approach vs Content Analysis
We assume that our target is to increase the accuracy of estimates by making
use of data originating by the Internet as auxiliary data.
This particular case is based on the use of textual data as auxiliary data.
Texts are a “perfect” example of unstructured data, that is one of the
characteristics of most Big Data.
First, the usual model-based approach will be followed, requiring the
prediction of values at unit level: under this approach, the target is to
maximise the correctness of classification for each unit in the reference
population.
Next, a different approach will be illustrated, where the prediction of values at
unit level is no more required and the target becomes to directly maximise the
accuracy at the aggregate level (estimates accuracy).
Quality 2014
Predictive approach
In a predictive approach, the subset of data related to sampled respondent
units can be considered as the labeled data, and supervisioned learning
methods can be applied.
In other words, the subset of 8,687 enterprises that indicated to have a
website or a home page, and also responded to questions [B8a : B8g], can be
considered as the training and test set by means of which different models
can be estimated in order to predict answers to [B8a : B8g] questions for the
whole reference population.
Texts
(websites
content)
Text and
data mining
Survey
Microdata
Quality 2014
Model
Predictive approach
In our case, we can apply one among the supervisioned learning methods:
•
•
•
•
•
•
•
Classification Trees;
“ensembles” (Bootstrap Aggregating, Adaptive Boosting, Random Forests);
Supervised Latent Dirichlet Allocation for classification (SLDA);
Neural Networks;
Logistic Regression;
Support Vector Machines;
Naïve Bayes.
Quality 2014
Evaluation of predictive models
From the error matrix it is possible to compute the following indicators:
Indicator
Expression
Meaning
Accuracy
(precision)
(TP+TN) / Total
Rate of correctly classified cases
Sensitivity
(true positives
rate)
TP / (TP + FN)
Rate of positive cases correctly
classified
Specificity
(true negatives
rate)
TN / (FP+TN)
Rate of negative cases correctly
classified
Quality 2014
Evaluation of predictive models
Application of different learners to predict question B8a “Online ordering or
reservation or booking (Yes/No)”
Quality 2014
Evaluation of predictive models
In general, when the misclassification cases are not balanced in absolute
terms, the result is that the distribution of predicted values can be
significantly different from the distribution of observed cases.
From these results, Naïve Bayes predictor can be considered as the most
convenient, because even if its precision (78%) is the lowest, though
sensitivity is the highest, specificity is good, and the alignment of observed
and predicted proportion is perfect.
Quality 2014
Evaluation of predictive models
Application of Naïve Bayes to predict all questions in section B8
Question B8:"indicate if the Website
have any of the following facilities"
Precision
Performance of Naive Bayes
Observed Predicted
Sensitivity Specificity
proportion proportion
a) Online ordering or reservation or booking
(web sales functionality)
0.78
0.50
0.86
0.21
0.21
b) Tracking or status of orders placed
0.82
0.49
0.85
0.18
0.11
c) Description of goods or services, price lists
0.62
0.44
0.79
0.48
0.32
0.74
0.41
0.78
0.09
0.23
0.86
0.53
0.87
0.05
0.14
0.59
0.57
0.64
0.68
0.51
0.69
0.52
0.78
0.35
0.33
d) Personalized content in the website for
regular/repeated visitors
e) Possibility for visitors to customize or
design online goods or services
f) A privacy policy statement, a privacy seal
or a website safety certificate
g) Advertisement of open job positions or
online job application
Quality 2014
Content analysis
Quality 2014
Content analysis performance …
In order to verify the
robustness of the Content
Analysis, we iterated 40
times the selection of a
training set from survey data
(each time producing an
estimate of the proportion of
web sales functionality), in
correspondence to different
rates of training set on the
total (from 10% to 90%).
The results show
correctness of the method
until 30% of training rate, but
a great variability of
estimates for every rate.
Quality 2014
… compared to Naïve Bayes
The same exercise has
been carried out for
Naive Bayes.
The results show a
minimum bias (in the
order of one or two
percentage points), but a
much lower variability.
Quality 2014
Future work
The experimented approach will be improved and extended in different
directions:
1. with reference to the population of interest: we will consider the URLs of
all the units belonging to the Business Register, and perform a mass
scraping of related websites (in this case also experimenting more
properly the high volume problems related to Big Data), considering the
whole sampling subset of websites as a training set, so to obtain a
model that can be applied the whole population. The aim is to produce
estimates under a full predictive approach, reducing the sampling errors
at the cost of introducing additional bias (both components of MSE
should be evaluated);
2. with reference to the content of the questionnaire: the results obtained
with the set of variables contained in the “B8” section of the
questionnaire, will be evaluated also with the other suitable variables in
the questionnaire (e-recruitment, e-procurement, use of social networks,
etc.).
Thank you for your attention
Contacts
[email protected]
[email protected]
[email protected]
[email protected]
Quality 2014