IT development overview
Download
Report
Transcript IT development overview
Big Data activities at SURS
Statistical Office of the Republic of
Slovenia
DIME/ITDG meeting, February 2016
Aim of official statistics
Support data users:
- Government
- Politicians and legislators
- Markets
- The public
- The media
- International community
2
Data Sources
- Surveys
- Administrative sources
- Big Data
3
Big Data – Possible usage
-
New statistics
New (or combined) sources for existing statistics
Validation („Benchmarking“) of data and statistics
Different mode of data collection
„Flash“ statistic
Faster release of statistics
AAPOR Report on Big Data (2015): “Surveys and Big
Data are complementary data sources, not competing data
sources”.
4
Current activities at SURS
• Analysis of different types of Big Data and possibilities of their
usage in regular statistical production (mobile positioning data,
scanned price data, web scraping, etc.)
• IT infrastructure
• Partnership with stakeholders (data owners, academia, etc.)
• Active participation in different international task forces (Eurostat BD
Task Force, UNECE BD Task Force) and projects (ESSNET grant
pilots)
5
Statistical model and new sources
Web scraping system for
identifying job advertisements
7
Process of creating the
collection tool
Spider: The aim of Spider is to take a company website and
find all webpages (sub links) on this website that relate to
employment.
Downloader: The task of Downloader is simply to download
the content of the saved URL links (problems with the pdf files
and https).
Splitter: The aim of Splitter is to split the content of the certain
URL into different documents.
Determinator: The aim of Determinator is to detect the JV
ads in the documents from Splitter.
Classifier: The aim of Classifier is to classify the detected
JV, for example by occupation, deadline, address, region.
Process of creating the
collection tool
Two different approaches of detecting the JV
ads are currently being carried out:
• Usage of "decision tree" on the content of
downloaded URLs
• Usage of the list of common key words and
phrases (whitelist and blacklist of words) in order
to detect the JV ads from the content of
downloaded URLs
Job Ads Statistics - initial
results
Number of enterprises
which advertise Job
vacancies
in percentage (%)
by Slovenian regions
JOB VACANCIES: accuracy of scrapped data compared to
survey data
by Slovenian regions
Scrapped
Survey
40%
40% 35%
30%
20%
10%
0%
17%
6% 5%
8%
2%
3%
5%
0%
13%
1%11%
7%
6%
1%
14%
2%
0%
5%
5%
8%
4%
3%
10
Mobile positioning and
statistical derivatives
Mobile operators
- 4 mobile network operators
- 3 service providers
- 3 re-sellers
- first 4 are primary data providers
- all network operators and service providers could
be/are important!
11
Mobile data
• For the investigation purposes, SURS had access to
data from the second largest mobile operator in Slovenia
• Data from April to October 2014 (1 billion records)
• Three variables
-
Anonymized IMEI,
- Time of event (outgoing call, outgoing SMS, connecting
to internet using mobile phone)
- Coordinates of antennas
12
Density of people in Ljubljana
during the day
Daytime
Density of people in Ljubljana
during the night
14
BD activities in 2016 (1)
In the February the set of
workshops will be organized with
the subject matter statisticians.
Goal: brainstorm the ideas and
preparation a business cases for
usage of BD in different domains
of statistical production
15
BD activities in 2016 (2)
Deepen cooperation with
Slovenian universities:
Goal:
• Education of colleagues
• Usage of data mining (and collection)tools
developed by Slovenian faculties
http://orange.biolab.si/ or http://newsfeed.ijs.si/
• Cooperation in projects
16
BD activities in 2016 (3)
• Active part in ESSNET BD project ( one of WP
leaders)
• Organization of Eurostat Big Data Workshop in
Slovenia and contribution in ethical review and
ethical guidelines which is to be prepared this
year.
• Continuation of ongoing work in local projects
(Job vacancies data from enterprise websites
and CEMODE)
17
Open questions
• Access to data (legal issues, partnership, etc.)
• Big data are used for different purposes (different
definitions)
• There is no control of the collection process
• Data could change or even extinct
• Public perception
• IT and methodological skills
• IT infrastructure
• Quality of data
18