The Weighting Strategy of the Canadian Community Health Survey
Download
Report
Transcript The Weighting Strategy of the Canadian Community Health Survey
The Weighting Strategy
of the Canadian Community Health
Survey
Cathlin Sarafin
Methodologist
Statistics Canada
March 25, 2008
Outline
Introduction
Methodology
The Canadian Community Health Survey (CCHS)
The Multiple Frames
The Weighting Strategy of the CCHS
Methodology Recruitment Process
Introduction
Methodology Structure:
You
Your Unit
Recruits are called Junior Methodologists
2 to 7 Methodologists supervised by one Senior Methodologist
Your Section
3 to 6 units working on related projects, managed by a Chief
Your Division
A division has roughly 100 people, usually all together on one
floor of the building
Introduction
Every person has their own responsibilities
Senior Methodologist outlines tasks
Discuss options and approaches as a team
Introduction
Survey Methodology:
Frame creation
Sampling
Questionnaire design
Data collection methods
Data processing
Edit and imputation
Weighting and estimation
Variance estimation
Data quality indicators
Record linkage
Time series
Data analysis
Disclosure control
Research and development
The CCHS
Collects general health information on the
Canadian population
Estimates produced for more than 120
Health Regions (HRs) across Canada
Produces estimates on:
Health Risk Factors
Health Status
Health Care Services
The CCHS
The CCHS was introduced in 2000
Data was collected every second year for a total
sample size of 130,000 per year
It was redesigned in 2007
Data is now collected continuously for a total
sample size of ≈ 65,000 respondents per year
Annual files are released
Multi-year files will be produced starting in 2009
The CCHS
A cross-sectional survey
Survey a specific population for
a given period of time
A longitudinal survey
Survey a specific population
repeatedly over time
The CCHS
Target population:
Individuals living in private dwellings aged 12
years old and over
Exclusions: those living on Indian Reserves
and Crown Lands, residents of institutions, fulltime members of the Canadian Forces and
residents of some remote areas
CCHS covers ~98% of the Canadian
population
The CCHS
Has a complex, multi-stage, dual frame
design
Area frame (49%)
Telephone list frame (50%)
Random digit dialing (RDD) frame (1%)
The telephone frames compliment the area
frame in most HRs
The Area Frame
Units are geographical areas
Target sampling units are not listed
Based on Labour Force Survey (LFS) design
6 rotation groups
Stratified probability proportional to size sample of
clusters
Systematic sample of dwellings
Random selection of a start
Probabilistic sample of one individual per household
The Area Frame
LFS Sample Selection
2. Clusters selected within strata (PPS
sampling) 1st stage
3. Dwellings selected within clusters
(systematic sampling) 2nd stage
4. People selected within responding
dwellings 3rd stage
Province XYZ
Stratum #2 Stratum #1
1. Each province is divided into
geographic strata
The Area Frame
Why use such a design?
Stratification:
Better coverage of the entire region of interest
Increases precision
Clustering:
Efficient for interviewing (less travel, less costly)
Decreases precision
The Area Frame
The CCHS selection process:
The LFS provides a list of available starts
(systematic samples) within each cluster
The clusters are mapped to the CCHS HRs
A random selection of starts is chosen within
a HR
Probabilistic sample of one individual per
household
The Area Frame
2-phase sample
1st phase is the LFS sample of starts within
the LFS strata
2nd phase is the CCHS sample of starts within
the HRs
The Area Frame
Why use the LFS?
No adequate list of addresses available
Costly to create and maintain such a frame
LFS has good coverage of target population
It is a monthly sample conducted at Statistics
Canada
Continually updated
The Telephone Frame
List of telephone numbers from across Canada
Created using InfoDirect© files
Stratified by HR
SRSWOR sample of phone numbers
Probabilistic sample of one individual per
household
The RDD Frame
Phone numbers are grouped into banks
Banks are assigned to a HR
Computer randomly generates the last 2
numbers
Probabilistic sample of one individual per
household
Dual Frame Design
Multiple frames are used to:
Improve the coverage of the target population
Reduce costs
Area Frame
Covers target population
Costly to implement
Listing costs
Face-to-face interview costs
Dual Frame Design
Telephone Frame
Only covers population with listed phone
numbers
Undercoverage may bias the estimates
Growing problem with the increasing popularity of
cell phones
Less costly to implement
Calls made from regional offices
Dual Frame Design
RDD Frame
Inefficient
Results in a large amount of out-of-scope numbers
Used alone for 2 northern regions
LFS is not adequate for these 2 regions
Used as a complement to the area frame in
Whitehorse and Yellowknife
Quality of telephone frame is considered poor
in these regions
The Weighting Strategy of the CCHS
Area Frame
Telephone Frame
A0 – Initial weight
T0 - Initial weight
A1 – Sub-cluster adjustment
T1 - Number of collection periods
A2 - Stabilization
T2 - Out-of-scope numbers
A3 - Out-of-scope dwellings
T3 - Household nonresponse
A4 - Household nonresponse
T4 - Multiple phone lines
Combined Frame
I1 - Integration
I2 – Person selection
I3 – Person nonresponse
I4 - Winsorization
I5 - Calibration
Final CCHS Weight
6
Sampling Weights
Number of people in the population
represented by the interviewed person
Ex: wi = 500
Can be broken down into 3 major steps:
Design weights
Nonresponse adjustment
Calibration
Design Weights
Weights determined by the design of the survey
They are the inverse of the inclusion probability
A person selected according to a sampling fraction of
1% will have a weight of 1/0.01 = 100
The design weights in the CCHS are calculated
separately for each frame
Sampling fractions differ between HRs, therefore
design weights are not uniform
List Frame Design Weights
The sample is stratified by HR, so weights
are calculated within HR
It is an SRSWOR of phone numbers
Probability of selection within HR g is
ng
i
Ng
Area Frame Design Weights
The LFS is redesigned every 10 years
A sample 20 year sample plan created
The LFS provides a list of available starts
Typically consists of 40 columns and 6 rows
per LFS stratum
Each
row represents a rotation group
Each column represents a monthly LFS sample
One LFS sample
Area Frame Design Weights
LFS
Stratum
Rotation Cluster Start
Cluster
Start
Cluster Start
50
1
1
1
1
2
1
3
50
2
2
4
2
5
3
6
50
3
7
8
7
9
7
10
50
4
6
1
6
2
4
3
50
5
9
4
9
5
9
6
50
6
5
16
5
12
5
13
Area Frame Design Weights
The LFS provides a weight for one LFS sample
A weight for every start in one column
This weight is used to assign a weight to all
available starts
Wlfs R
Ws
S
The weights are then redistributed to the CCHS
selected starts within each HR
Nonresponse Adjustments
The design weights are corrected for total
nonresponse (NR)
All the variables for the respondent are missing
Complete refusal
Unable to contact the respondent
Respondent absent for the duration of the survey
language barrier
Information obtained is unusable
Nonresponse Adjustments
There are 2 types of NR in the CCHS
Household level
Person level
The weights of the nonrespondents have to
be redistributed to the respondents
Form groups based on auxiliary information
NR Adjustments
There are several methods available for the
creation of response homogeneity groups
(RHGs)
The CCHS uses the scoring method
Logistic regression is used to obtain a
probability of response ( pˆ ) for every unit
Groups are formed based on the values of p
ˆ
NR Adjustments
Logistic Regression Models
Variables include geographic information,
process data and socio-economic indicators
Variables derived from process data include:
Number of attempts
Time/day of attempt
Called on weekday/weekend
NR Adjustments
Initial groups are formed using a clustering
algorithm in SAS
These groups are then collapsed to ensure:
A response rate of at least 50%
At least 20 observations
The adjustment within each RHG is
n
a NR
WD i
i 1
r
WD i
i 1
Integration of Frames
Area Frame
No phone line
Unlisted phone number
Telephone Frame
Listed phone number
Integration of Frames
Area Frame Population = A
Telephone Frame
Sample = =SAB
Population
Sample = SB
SA
SB
ˆ
ˆ
ˆ
Yint YAB 1 YAB
Integration
Integration factor:
A number between 0 and 1
For CCHS it is based on sample size
nA
nA nB
Integration
Parameter of interest:
Unbiased estimates
SA
SB
ˆ
ˆ
E YAB E YAB
Integration
Composite estimation
SA
SB
ˆ
ˆ
E YAB 1 YAB
SA
SB
ˆ
ˆ
E YAB 1 E YAB
Integration of Frames
Possible to integrate only the overlapping
populations covered by the 2 frames
SA
SA
SB
ˆ
ˆ
ˆ
ˆ
Yint YA YAB 1 YAB
Problem identifying the overlapping portion for
the area frame due to nonresponse
Possible to impute these cases
Integration of Frames
Area Frame
SA
SAU
Telephone Frame
SB
SAB
Integration of Frames
Logistic regression is used to assign a
probability of belonging to the noncommon part SA
The final integration method is
SA
SA
SB
ˆ
ˆ
ˆ
ˆ
Yint pYA (1 p)YAB 1 YAB
Calibration
Weights are adjusted to match population
projection counts
Based on the Census
Adjusted to account for births, deaths, immigration
and emigration
The rounded average of the monthly projection
counts is used within each post-stratum
Calibration
Why is calibration used?
Gives confidence when estimating totals
Improves precision of the estimates
If auxiliary variables are well correlated to the survey
variables
Adjusts for coverage inadequacies when the survey
population differs from the target population
Calibration
In the CCHS
All post-strata with at least 20 observations are
calibrated at the HR by age by sex level
HR: 120 across Canada
Age groups: 12-19, 20-29, 30-44, 45-64 and 65+
Sex: Male and Female
Calibration
Post-strata
Post-strata
= =Prov
HR by
by
age
age2by
sex
by sex
sex
Example:
HR
Females
Age
Group
12-19
Males
Number of
Observations
15
Age Group
12-19
Number of
Observations
25
20-29
40
20-29
40
30-44
53
30-44
53
45-64
18
45-64
22
65+
31
65+
31
Final Weights
Master: Contains all variables for all respondents
Share: Contains all variables for the subset of people
who agreed to share (subset of records)
PUMF: Contains a subset of variables for all
respondents (subset of variables)
Dummy: Contains a subset of records from the
master file. Scrambled data used for testing and
remote access purposes
Bootstrap: Created for variance estimation purposes
Special Requests: linkage, different geographies, etc.
Methodology
Typical tasks:
Write computer programs to solve problems or
explore data
Attend meetings
Write documentation
Present our work at seminars
Work on different committees
Methodology
Working Conditions
Permanent job
Continuous learning:
Computer courses
Statistics and methodology courses
Language courses
Seminars, conferences and publications
Methodology
All methodologists work at the Head Office in Ottawa
Recruitment
Our recruitment campaign takes place each fall
Detailed presentations at the Universities by early
October
It is a 3 step process:
On-line application
Written Exam
Starts in September
Deadline in mid-October
Early November
Interview
January
Recruitment
Who can apply?
Persons residing in Canada and Canadian
citizens residing abroad
Preference will be given to Canadian citizens
Bilingualism
No preference is given to those who speak both
English and French
For more information please contact
www.statcan.ca
Under:
About Us
Employment opportunities
Mathematical statisticians (MA)
Email: [email protected]
Telephone: 1-888-321-3089
Thank you
[email protected]
Canadian Community Health Survey
[email protected]