Diapositive 1
Download
Report
Transcript Diapositive 1
Machine Learning
Documentation Initiative
Workshop on the Modernisation of Statistical Production
Topic iii) Innovation in technology and methods driving opportunities for modernisation
Kenneth Chu and Claude Poirier
Geneva, Switzerland, 15-17 April 2015
What is Machine Learning (ML)
Application of artificial intelligence in which
algorithms use available information to process
(or assist the processing of) statistical data
Coding
Editing
Linkage
Collection
• 20 applications were reported.
2
Statistics Canada • Statistique Canada
2016-04-06
Why should we consider ML ?
Relatively new discipline of computer science
• No needs for probabilistic models
• Less stringent for the BIG Data era
NSOs should all explore the use of ML
3
Statistics Canada • Statistique Canada
2016-04-06
Classes of ML
SUPERVISED ML
Ex.1: Logistic regression [statistics]
• Training data: Binary response (0:1) and predictors
• Maximum likelihood leads to model parameters
• Resulting model is used to predict responses
Ex.2: Support Vector Machines [non-statistics]
• Training data: Binary response (0:1) and predictors
• Hyperplanes in the space of predictors separate responses
• SVM optimisation problem comes from geometry
Decision trees, neural networks, Bayesian networks
4
Statistics Canada • Statistique Canada
2016-04-06
Classes of ML
UNSUPERVISED ML
Ex.1: Principal Component Analysis [statistics]
• PCA summarizes a set of data by finding orthogonal
sub-spaces that represent most of the variation
• There is no longer a response variable in the setting
Ex.2: Cluster Analysis [non-statistics]
• CA seeks to determine grouping in given data
• Again, there are no response variables in the setting
5
Statistics Canada • Statistique Canada
2016-04-06
Applications
Automated Coding
• Bayesian classifier (Germany): Occupation coding
• CASCOT (United Kingdom): Occupation coding
• Indexing utility (Ireland): Individual consumption
• SVM (New Zealand): Occupation and Qualification
6
Statistics Canada • Statistique Canada
2016-04-06
Applications
Data Editing
• Bayesian Networks (Eurostat): Voting intentions
• Classification Trees (Portugal): Foreign trade data
• Cluster Analysis (USA): Census of agriculture
• CART (New Zealand): Census of population
• Random Forests (New Zealand): Donor imputation
• Association Analysis (New Zealand): Edit rules
7
Statistics Canada • Statistique Canada
2016-04-06
Applications
Record Linkage
• Neither like coding, nor editing
• Quality of linkages depends on pre-processing more
than matching
• No applications of Machine Learning in official
statistics were listed
8
Statistics Canada • Statistique Canada
2016-04-06
Applications
Other areas – Data collection
• Classification Tree (USA): Non-response prediction
• Classification Tree (USA): Reporting errors
• Naïve Bayes text mining (Italy): Web scraping
• K-nearest neighbours (Hungary): Tax audit
• Image Processing (Canada): Remote sensing
9
Statistics Canada • Statistique Canada
2016-04-06
Concluding remarks
Several machine learning applications
Gap in the area of record linkage
Attention required outside statistical paradigms
Next: Applying Machine Learning on BIG Data
• Will this be possible only on a case-by-case basis?
10
Statistics Canada • Statistique Canada
2016-04-06
Thank you
Merci
For more information,
please contact:
Pour plus d’information,
veuillez contacter :
[email protected]
11
Statistics Canada • Statistique Canada
2016-04-06