Transcript SIC 2007

Implementing Coding Tools for
a New Classification
John Perry, UK Office for National Statistics
Operation 2007 - The players:
• In the UK:
The Standard Industrial Classification of
Economic Activities (SIC)
(current version SIC (2003)
• In Europe:
NACE, the Nomenclature générale des
activités économiques dans les
Communautés européens
(current version NACE Rev 1.1)
• In the UN:
ISIC, the International Standard
Industrial Classification of all Economic
Activities
(current version ISIC Rev 3.1)
The UK SIC
• is a 5 digit classification system
• is required, by EU legislation, to be identical to
NACE down to and including the 4 digit Class level
• contains a national 5th digit level which does not
exist in NACE
The Results – changes in structure
SIC 2003
SIC 2007
NACE Classes
514
615
NACE Classes not
split
414
537
UK Sub Class
splits
285
191
Total Sub Classes
699
728
ACTR as an aid to coding
• ACTR – Automatic Coding by Text Recognition
• Developed by Statistics Canada
• ONS standard tool for coding, initially industry and
occupation
• Replaces Precision Data Coder for industry coding
• Determines a code from a text description
• Extent of automation of process is controlled by
parameters
Knowledge Bases – SIC2003
• ACTR relies heavily on indexes of standard
descriptions:
– Business descriptions from responses to the Business
Register Survey
– Published index for the SIC2003
– The short descriptions for each SIC2003 code
– Standard descriptions for construction industry statistics
– Trade code descriptions for PAYE (Pay As You Earn
Tax) employers
– Farm type descriptions
• With a total of > 30,000 standard descriptions
How ACTR works
• Each input description is converted to a standard form
• This is compared with the standard forms of descriptions
held in the knowledge base
• The closeness is presented as a score between 0 and 10
• The system has rules to determine whether the score is
sufficient to confirm a match:
– Requires a score of more than 7.5 to code automatically (our setting
which may differ for other data sets)
– Lower scores are passed through interactive coding
• Coding does not depend on the order in which the
knowledge bases are checked
Extract from Business Register Survey
Questionnaire
ACTR Process
• Supplied text: Horticultural services
• HORTICULTURAL SERVICE
• Best fit index entry: Sales and service of
horticultural machinery
• HORTICULTURAL MACHINERY SALE SERVICE
• Score is 6.911 (out of 10)
• ACTR prefers SIC 2003 code: 51880 (Wholesale of
agricultural machinery and accessories)
Interactive coding
• Scores below 7.5 are passed to clerical staff for
•
•
•
•
coding interactively
The system presents options in descending order
of score
If none of the choices appear good, staff modify the
description
Once a decision is made, the person coding
confirms the choice
The index description is then held on the IDBR.
Introducing the SIC2007 (NACE Rev 2)
• New index files:
– SIC2007 headings
– SIC2007 index
• Initially code forward from the SIC2003 using
bridging codes – these are codes for each
knowledge base entry that link the SIC2003 and
SIC2007
• Later will change to code backwards from the
SIC2007
• Eventually dual coding will cease
Impact of ACTR on IDBR at Micro Level
• Existing SIC 2003 is 01120 (Growing of vegetables
etc)
• The preferred ACTR SIC 2003 is 51880
(Wholesale of agricultural machinery and
accessories)
• The SIC 2007 comes from the bridging code
– SIC 2003: 51880
– Bridging code: MTOLR
– SIC 2007: 46610
• SIC 2003 code will change but only when agreed
Conversion to SIC2007
• ACTR will deal with units that have a suitable
business description
• Conversion tables will deal with:
– Units with descriptions that ACTR is unable to code
(vague descriptions)
– Units without a description
– Units supplied through administrative sources (existing
VAT traders, PAYE employers, Registered Companies)
Creation of Conversion Tables
• Tables have been created to convert units from
SIC2003 to SIC2007:
–
–
–
–
–
Using ACTR bridging codes
Coding existing data through ACTR
Producing cross-tabulation of SIC2003 to SIC2007
Allocating on a probability basis rounded to nearest 5%
Validate relationships against the acceptable range of
industries
• Best fit tables also produced for users who cannot
accommodate probability based conversion
Coding
process
Impact on the IDBR at the Macro Level
• Impact on SIC 2003 is only on those reporting units
that have business descriptions for local units,
where ACTR can code.
–
–
–
–
–
ACTR codes
ACTR does not code
No business description
Administrative data only
Total local units
620,000
210,000
340,000
1,660,000
2,830,000
• SIC 2007 comes from the bridging codes only
where ACTR codes – otherwise SIC 2007 comes
from conversion from SIC 2003
A
AGRICULTURE, HUNTING AND FORESTRY
B
FISHING
C
MINING AND QUARRYING
D
MANUFACTURING
E
ELECTRICITY, GAS AND WATER SUPPLY
F
CONSTRUCTION
G
WHOLESALE AND RETAIL TRADE; REPAIR OF MOTOR VEHICLES
H
HOTELS AND RESTAURANTS
I
TRANSPORT, STORAGE AND COMMUNICATION
J
FINANCIAL INTERMEDIATION
K
REAL ESTATE, RENTING AND BUSINESS ACTIVITIES
L
PUBLIC ADMINISTRATION AND DEFENCE; COMPULSORY SOCIAL
SIC 2003
M
EDUCATION
N
HEALTH AND SOCIAL WORK
O
OTHER COMMUNITY, SOCIAL AND PERSONAL SERVICE ACTIVITIES
P
PRIVATE HOUSEHOLDS EMPLOYING STAFF AND UNDIFFERENTIATED
Q
EXTRA-TERRITORIAL ORGANISATION AND BODIES
Impact at SIC 2003 broad industry level
(provisional counts)
Section
Starting stock
In
Out
Net Change
A&B
167,000
0.5%
0.6%
-0.1%
C, D and E
180,000
5.9%
5.2%
+0.7%
F
260,000
1.4%
0.9%
+0.5%
G
530,000
2.4%
2.5%
-0.1%
H
188,000
2.3%
1.6%
+0.7%
I
116,000
2.7%
2.4%
+0.3%
J
58,000
6.5%
3.3%
+3.2%
K
872,000
1.2%
1.3%
-0.1%
L
29,000
10.4%
11.1% -0.7%
432,000
2.9%
3.8%
M, N and O
-0.9%
A
Agriculture, Forestry And Fishing
B
Mining And Quarrying
C
Manufacture
D
Electricity, Gas, Steam And Air Conditioning Supply
E
Water Supply; Sewage, Waste Management And Remediation Activities
F
Construction
G
Wholesale And Retail Trade; Repair Of Motor Vehicles And Motorcycles
H
Transportation And Storage
I
Accommodation And Food Service Activities
J
Information And Communication
K
Financial And Insurance Activities
L
Real Estate Activities
M
Professional, Scientific And Technical Activities
N
Administrative And Support Service Activities
O
Public Administration And Defence; Compulsory Social Security
P
Education
Q
Human Health And Social Work Activities
R
Arts, Entertainment And Recreation
S
Other Service Activities
T
Activities Of Households
SIC 2007
Correspondence between SIC 2003 and
SIC 2007 for local units coded by ACTR
SIC2007
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
U
A
2147
.
2
.
.
.
.
.
.
.
.
.
.
851
.
.
.
.
.
.
SIC2003
B
C
473
.
.
684
.
18
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
D
E
2
.
29525
.
203
48
.
.
.
1819
.
.
.
.
.
.
.
.
138
.
.
.
.
807
856
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
F
.
.
.
.
.
27719
.
.
.
.
.
.
.
6
.
.
.
.
.
.
G
.
.
2
.
.
.
157840
180
.
.
.
.
.
.
.
.
.
.
1376
.
H
I
J
.
.
.
.
.
.
.
.
55729
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20437
.
2911
.
.
.
6142
.
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28198
.
.
.
.
.
.
.
.
.
K
L
.
.
23
.
.
3046
.
.
.
8264
855
18285
34010
44819
.
23
.
.
320
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19552
.
.
.
.
.
M
N
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34575
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2440
.
.
.
62873
.
.
.
O
.
.
.
.
2557
.
.
.
.
1499
.
.
4
111
.
389
.
24662
25549
.
Q
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
Implementation timetable
December
2006
NACE published
January 2007
SIC 2007 is published on NS website
February 2007 Development and tuning of data coder (ACTR) –
first release on 2007 basis, subject to revision
June 2007
Re-coding using ACTR
August 2007
New release of ACTR, using SIC 2007 index
November
2007
SIC 2007 Index published (consistent with ACTR
August 2007)
January 2008
SIC 2007 fully implemented on the Register
2008 ????
ACTR SIC 2003 overwrites historic SIC 2003
Conclusions
• The ACTR tool delivers considerable savings in terms of
cost and burden on businesses compared to traditional
survey approaches.
• The knowledge base is portable (i.e. independent of the
coding engine), enabling sharing this with any interested
parties, e.g. administrative data suppliers, to increase the
consistency of coding.
• The use of bridging codes permits simultaneous coding to
multiple classification systems, essential if periods of dualcoding are required.
• The knowledge base approach can help to inform the
development of future versions of a classification, by
providing a reference frame of business activity
descriptions.