The Weighting Strategy of the Canadian Community Health Survey

Download Report

Transcript The Weighting Strategy of the Canadian Community Health Survey

The Weighting Strategy
of the Canadian Community Health
Survey
Cathlin Sarafin
Methodologist
Statistics Canada
March 25, 2008
Outline
Introduction
Methodology
 The Canadian Community Health Survey (CCHS)


The Multiple Frames
The Weighting Strategy of the CCHS
Methodology Recruitment Process
Introduction
Methodology Structure:

You


Your Unit


Recruits are called Junior Methodologists
2 to 7 Methodologists supervised by one Senior Methodologist
Your Section
3 to 6 units working on related projects, managed by a Chief

Your Division

A division has roughly 100 people, usually all together on one
floor of the building
Introduction
Every person has their own responsibilities
Senior Methodologist outlines tasks
Discuss options and approaches as a team
Introduction
Survey Methodology:







Frame creation
Sampling
Questionnaire design
Data collection methods
Data processing
Edit and imputation
Weighting and estimation







Variance estimation
Data quality indicators
Record linkage
Time series
Data analysis
Disclosure control
Research and development
The CCHS
Collects general health information on the
Canadian population
Estimates produced for more than 120
Health Regions (HRs) across Canada
Produces estimates on:
Health Risk Factors
 Health Status
 Health Care Services

The CCHS
The CCHS was introduced in 2000

Data was collected every second year for a total
sample size of 130,000 per year
It was redesigned in 2007
Data is now collected continuously for a total
sample size of ≈ 65,000 respondents per year
 Annual files are released
 Multi-year files will be produced starting in 2009

The CCHS
A cross-sectional survey
 Survey a specific population for
a given period of time
A longitudinal survey
 Survey a specific population
repeatedly over time
The CCHS
Target population:
Individuals living in private dwellings aged 12
years old and over
 Exclusions: those living on Indian Reserves
and Crown Lands, residents of institutions, fulltime members of the Canadian Forces and
residents of some remote areas

CCHS covers ~98% of the Canadian
population
The CCHS
Has a complex, multi-stage, dual frame
design
Area frame (49%)
 Telephone list frame (50%)
 Random digit dialing (RDD) frame (1%)

The telephone frames compliment the area
frame in most HRs
The Area Frame
Units are geographical areas

Target sampling units are not listed
Based on Labour Force Survey (LFS) design



6 rotation groups
Stratified probability proportional to size sample of
clusters
Systematic sample of dwellings


Random selection of a start
Probabilistic sample of one individual per household
The Area Frame
LFS Sample Selection
2. Clusters selected within strata (PPS
sampling) 1st stage
3. Dwellings selected within clusters
(systematic sampling) 2nd stage
4. People selected within responding
dwellings 3rd stage
Province XYZ
Stratum #2 Stratum #1
1. Each province is divided into
geographic strata
  

 








The Area Frame
Why use such a design?


Stratification:

Better coverage of the entire region of interest

Increases precision
Clustering:

Efficient for interviewing (less travel, less costly)

Decreases precision
The Area Frame
The CCHS selection process:
The LFS provides a list of available starts
(systematic samples) within each cluster
 The clusters are mapped to the CCHS HRs
 A random selection of starts is chosen within
a HR
 Probabilistic sample of one individual per
household

The Area Frame
2-phase sample
1st phase is the LFS sample of starts within
the LFS strata
 2nd phase is the CCHS sample of starts within
the HRs

The Area Frame
Why use the LFS?
No adequate list of addresses available
 Costly to create and maintain such a frame
 LFS has good coverage of target population
 It is a monthly sample conducted at Statistics
Canada


Continually updated
The Telephone Frame
List of telephone numbers from across Canada
Created using InfoDirect© files
Stratified by HR
SRSWOR sample of phone numbers
Probabilistic sample of one individual per
household
The RDD Frame
Phone numbers are grouped into banks
Banks are assigned to a HR
Computer randomly generates the last 2
numbers
Probabilistic sample of one individual per
household
Dual Frame Design
Multiple frames are used to:
Improve the coverage of the target population
 Reduce costs

Area Frame
Covers target population
 Costly to implement

Listing costs
 Face-to-face interview costs

Dual Frame Design
Telephone Frame

Only covers population with listed phone
numbers
Undercoverage may bias the estimates
 Growing problem with the increasing popularity of
cell phones


Less costly to implement

Calls made from regional offices
Dual Frame Design
RDD Frame

Inefficient

Results in a large amount of out-of-scope numbers
Used alone for 2 northern regions

LFS is not adequate for these 2 regions
Used as a complement to the area frame in
Whitehorse and Yellowknife

Quality of telephone frame is considered poor
in these regions
The Weighting Strategy of the CCHS
Area Frame
Telephone Frame
A0 – Initial weight
T0 - Initial weight
A1 – Sub-cluster adjustment
T1 - Number of collection periods
A2 - Stabilization
T2 - Out-of-scope numbers
A3 - Out-of-scope dwellings
T3 - Household nonresponse
A4 - Household nonresponse
T4 - Multiple phone lines
Combined Frame
I1 - Integration
I2 – Person selection
I3 – Person nonresponse
I4 - Winsorization
I5 - Calibration
Final CCHS Weight
6
Sampling Weights
Number of people in the population
represented by the interviewed person

Ex: wi = 500
Can be broken down into 3 major steps:
Design weights
 Nonresponse adjustment
 Calibration

Design Weights
Weights determined by the design of the survey
They are the inverse of the inclusion probability

A person selected according to a sampling fraction of
1% will have a weight of 1/0.01 = 100
The design weights in the CCHS are calculated
separately for each frame
Sampling fractions differ between HRs, therefore
design weights are not uniform
List Frame Design Weights
The sample is stratified by HR, so weights
are calculated within HR
It is an SRSWOR of phone numbers
Probability of selection within HR g is
ng
i 
Ng
Area Frame Design Weights
The LFS is redesigned every 10 years

A sample 20 year sample plan created
The LFS provides a list of available starts

Typically consists of 40 columns and 6 rows
per LFS stratum
 Each
row represents a rotation group
 Each column represents a monthly LFS sample
One LFS sample
Area Frame Design Weights
LFS
Stratum
Rotation Cluster Start
Cluster
Start
Cluster Start
50
1
1
1
1
2
1
3
50
2
2
4
2
5
3
6
50
3
7
8
7
9
7
10
50
4
6
1
6
2
4
3
50
5
9
4
9
5
9
6
50
6
5
16
5
12
5
13
Area Frame Design Weights
The LFS provides a weight for one LFS sample

A weight for every start in one column
This weight is used to assign a weight to all
available starts
Wlfs  R
Ws 
S
The weights are then redistributed to the CCHS
selected starts within each HR
Nonresponse Adjustments
The design weights are corrected for total
nonresponse (NR)

All the variables for the respondent are missing

Complete refusal

Unable to contact the respondent

Respondent absent for the duration of the survey

language barrier

Information obtained is unusable
Nonresponse Adjustments
There are 2 types of NR in the CCHS
Household level
 Person level

The weights of the nonrespondents have to
be redistributed to the respondents

Form groups based on auxiliary information
NR Adjustments
There are several methods available for the
creation of response homogeneity groups
(RHGs)
The CCHS uses the scoring method
Logistic regression is used to obtain a
probability of response ( pˆ ) for every unit
 Groups are formed based on the values of p
ˆ

NR Adjustments
Logistic Regression Models
Variables include geographic information,
process data and socio-economic indicators
 Variables derived from process data include:
 Number of attempts
 Time/day of attempt
 Called on weekday/weekend

NR Adjustments
Initial groups are formed using a clustering
algorithm in SAS
These groups are then collapsed to ensure:


A response rate of at least 50%
At least 20 observations
The adjustment within each RHG is
n
a NR 
WD i
i 1
r
WD i
i 1
Integration of Frames
Area Frame
No phone line
Unlisted phone number
Telephone Frame
Listed phone number
Integration of Frames
Area Frame Population = A
Telephone Frame
Sample = =SAB
Population
Sample = SB
SA
SB
ˆ
ˆ
ˆ
Yint  YAB  1  YAB
Integration
Integration factor:



A number between 0 and 1
For CCHS it is based on sample size
nA
nA  nB
Integration
Parameter of interest:

Unbiased estimates


 
SA
SB
ˆ
ˆ
E YAB  E YAB  
Integration
Composite estimation

SA
SB
ˆ
ˆ
E YAB  1   YAB
 

 
SA
SB
ˆ
ˆ
  E YAB  1   E YAB

Integration of Frames
Possible to integrate only the overlapping
populations covered by the 2 frames
SA
SA
SB
ˆ
ˆ
ˆ
ˆ
Yint  YA  YAB  1   YAB
Problem identifying the overlapping portion for
the area frame due to nonresponse

Possible to impute these cases
Integration of Frames
Area Frame
SA
SAU
Telephone Frame
SB
SAB
Integration of Frames
Logistic regression is used to assign a
probability of belonging to the noncommon part SA
The final integration method is
SA
SA
SB
ˆ
ˆ
ˆ
ˆ
Yint  pYA  (1  p)YAB  1   YAB
Calibration
Weights are adjusted to match population
projection counts
Based on the Census
 Adjusted to account for births, deaths, immigration
and emigration

The rounded average of the monthly projection
counts is used within each post-stratum
Calibration
Why is calibration used?
Gives confidence when estimating totals
 Improves precision of the estimates



If auxiliary variables are well correlated to the survey
variables
Adjusts for coverage inadequacies when the survey
population differs from the target population
Calibration
In the CCHS

All post-strata with at least 20 observations are
calibrated at the HR by age by sex level
 HR: 120 across Canada
 Age groups: 12-19, 20-29, 30-44, 45-64 and 65+
 Sex: Male and Female
Calibration
Post-strata
Post-strata
= =Prov
HR by
by
age
age2by
sex
by sex
sex
Example:
HR
Females
Age
Group
12-19
Males
Number of
Observations
15
Age Group
12-19
Number of
Observations
25
20-29
40
20-29
40
30-44
53
30-44
53
45-64
18
45-64
22
65+
31
65+
31
Final Weights
Master: Contains all variables for all respondents
Share: Contains all variables for the subset of people
who agreed to share (subset of records)
PUMF: Contains a subset of variables for all
respondents (subset of variables)
Dummy: Contains a subset of records from the
master file. Scrambled data used for testing and
remote access purposes
Bootstrap: Created for variance estimation purposes
Special Requests: linkage, different geographies, etc.
Methodology
Typical tasks:
Write computer programs to solve problems or
explore data
 Attend meetings
 Write documentation
 Present our work at seminars
 Work on different committees

Methodology
Working Conditions
Permanent job
 Continuous learning:

Computer courses
 Statistics and methodology courses
 Language courses
 Seminars, conferences and publications

Methodology
All methodologists work at the Head Office in Ottawa
Recruitment
Our recruitment campaign takes place each fall
Detailed presentations at the Universities by early
October
It is a 3 step process:

On-line application



Written Exam


Starts in September
Deadline in mid-October
Early November
Interview

January
Recruitment
Who can apply?
Persons residing in Canada and Canadian
citizens residing abroad
 Preference will be given to Canadian citizens

Bilingualism

No preference is given to those who speak both
English and French
For more information please contact
www.statcan.ca

Under:
About Us
 Employment opportunities
 Mathematical statisticians (MA)

Email: [email protected]
Telephone: 1-888-321-3089
Thank you
[email protected]
Canadian Community Health Survey

[email protected]