WP2_Istat_experience

Download Report

Transcript WP2_Istat_experience

Essnet on Big Data
WP2 - Webscraping / Enterprise Characteristics
Sharing of previous experiences on scraping
Istat’s experience
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
Sharing of previous experiences on scraping - Istat’s experience
We can distinguish two different kinds of web scraping:
1. specific web scraping, when both structure and content of websites
to be scraped are perfectly known, and crawlers just have to
replicate the behaviour of a human being visiting the website and
collecting the information of interest. Typical areas of application:
data collection for price consumer indices (ONS, CBS, Istat);
2. generic web scraping, when no a priori knowledge on the content
is available, and the whole website is scraped and subsequently
processed in order to infer information of interest: this is the case
of the “ICT usage in enterprises” pilot.
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
Sharing of previous experiences on scraping - Istat’s experience
So far, in Istat there have been a number of web scraping
experiences, involving:
1. prices related to a set of goods and services;
2. agritourism farms’ portals;
3. scraping of enterprises websites in the Enterprises ICT survey.
In the first two cases, specific scraping has been employed, while the
generic one has been used for the enterprises websites.
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
Sharing of previous experiences on scraping - Istat’s experience
Web scraping data acquisition for Harmonized Index
of Consumer Prices (European project “Multipurpose
price statistics”)
It is currently performed on two groups of products:
1. Consumer electronics
2. Airfares
These prices are collected by simulation of on line purchases.
Applications making use of iMacros.
Next: use of open source software as “rvest” (R package for web
scraping).
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
4
Sharing of previous experiences on scraping - Istat’s experience
Use of web scraping for agritourism data collection
In Italy there are about 20,000 agritourism farms (AF).
An annual survey is carried out to collect information on them.
The same information, and more, can be obtained by directly
accessing and scraping their sites on the web.
Instead of accessing each single website, a limited set of «hubs»
would be scraped, each one containing information related to
many AFs. In this case, «specific» scraping would be used.
A relevant problem here would be the treatment of duplications
and incoherent information, as the same AF can be present in
more than one hub.
A critical step is related to «record linkage», as it is necessary to
refer correctly the information collected in different hubs to a
given AF.
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
Sharing of previous experiences on scraping - Istat’s experience
ICT usage in enterprises
The web questionnaire is used to collect information on the
characteristics of the websites owned or used by the enterprises:
In a first phase, the aim of the experiment was to predict values of
questions from B8a to B8g using machine learning techniques applied
to texts (text mining) scraped from the websites.
Particular effort was dedicated to question B8a (“Web sales facilities”
or “e-commerce”).
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
Sharing of previous experiences on scraping - Istat’s experience
ICT usage in enterprises
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
Sharing of previous experiences on scraping - Istat’s experience
ICT usage: web scraping
Different solutions for the web scraping will be investigated.
For instance, in Istat experiments we have already tested:
1. the Apache suite Nutch/Solr (https://nutch.apache.org) for
crawling, content extraction, indexing and searching results is a
highly extensible and scalable open source web crawler;
2. HTTrack (http://www.httrack.com/ ), a free and open source
software tool that permits to “mirror” locally a web site, by
downloading each page that composes its structure. In technical
terms it is a web crawler and an offline browser;
3. JSOUP (http://jsoup.org ) permits to parse and extract the
structure of a HTML document. It has been integrated in a specific
step of the ADaMSoft system (http://adamsoft.sourceforge.net),
this latter selected as already including facilities that allow to
handle huge data sets and textual information.
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
Sharing of previous experiences on scraping - Istat’s experience
ICT usage: web scraping
These techniques will be evaluated by taking into account:
1. efficiency: number of websites actually scraped on the total and
execution performance;
2. effectiveness: completeness and richness of collected text that
can influence the quality levels of prediction.
Solution
# websites reached
Average
number of
webpages
per site
Time
spent
Type of
Storage
Storage
dimensions
Nutch
7020 / 8550=82,1%
15,2
32,5
hours
Binary files
on HDFS
2,3 GB (data)
HTTrack
7710 / 8550=90,2%
43,5
6,7
days
HTML files on
file system
JSOUP
7835/8550=91,6%
68
11
hours
HTML
ADaMSoft
compressed
binary files
5,6 GB
(index)
16, 1 GB
500MB
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
UNDER
TESTING:
ad hoc solution
based on Jsoup
and Jcrawler
Sharing of previous experiences on scraping - Istat’s experience
ICT usage: text mining
2013 and 2014 rounds of the survey have both been used in the experiment.
For all respondents declaring to own a website, their website have been
scraped, and collected texts submitted to classical text mining procedures in
order to build a “matrix terms/documents”.
Different learners have been applied, in order to predict values of target
variables (for instance, “e-commerce (yes/no)”) on the basis or relevant terms
individuated in the websites.
The relevance of the terms (and consequent selection of 1,200 out of 50,000)
has been based on the importance of each term measured in the analysis of
correspondence.
2013 data have been used as “train” dataset, while 2014 data have been
used as “test” dataset.
The performance of each learner has been evaluated by means of the usual
quality indicators:
• accuracy: rate of correctly classified cases on the total;
• sensitivity: rate of correctly classified positive cases on total positive cases;
• specificity: rate of correctly classified negative cases on total negative
cases.
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
Sharing of previous experiences on scraping - Istat’s experience
ICT usage: e-commerce prediction
Quality Indicators for e-commerce
Learner
Accuracy
Sensitivity
Specificity
Proportion of
e-commerce
(observed)
Proportion of
e-commerce
(predicted)
GLM (Logistic)
0.69
0.68
0.69
0.19
0.22
Random Forest
0.79
0.63
0.83
0.19
0.25
Neural Network
0.70
0.62
0.72
0.19
0.20
Boosting
0.67
0.66
0.67
0.19
0.22
Bagging
0.82
0.38
0.92
0.19
0.19
Naïve Bayes
0.75
0.55
0.79
0.19
0.21
LDA
0.66
0.71
0.65
0.19
0.28
RPART (Tree)
0.82
0.25
0.95
0.19
0.16
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
Sharing of previous experiences on scraping - Istat’s experience
ICT usage: e-commerce prediction
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
Sharing of previous experiences on scraping - Istat’s experience
ICT usage: prediction of other variables
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
Sharing of previous experiences on scraping - Istat’s experience
So far, the pilot explored the possibility to replicate the
information collected by the questionnaire using the scraped
content of the website and applying the best predictor
(reduction of respondent burden).
A more relevant possibility is to combine survey data and Big
Data in order to improve the quality of the estimates.
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
Sharing of previous experiences on scraping - Istat’s experience
The aim is to adopt a full predictive approach with a
combined use of data:
1. all the websites owned by the whole population of
enterprises are individuated and their content collected
by web scraping (= Big Data);
2. survey data (the “truth ground”) are combined with Big
data in order to establish relations (models) between the
values of target variables and the terms collected in
corresponding scraped websites;
3. estimated models obtained in step 2 are applied to the
whole set of texts obtained in step 1 in order to produce
estimates related to the target variables.
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March
Sharing of previous experiences on scraping - Istat’s experience
Thank you for your attention
ESSnet on Big Data – Wp2 “Webscraping / Enterprise Characteristics” – Kickoff meeting Rome 23-24 March