Energy Statistics
Download
Report
Transcript Energy Statistics
Overview of the Federal Statistical
System
› Agencies
› Types of survey data collected
Challenges
› Statistical Disclosure and confidentiality
› Implications
Headed by a Chief Statistician
Decentralized System in the United States
› 13 Agencies with a statistics oriented mission
› Statistical Agencies are located throughout
various agencies in the Federal Government
Examples: Census (Commerce Department),
Energy Information Administration
(Department of Energy), Bureau of Labor
Statistics (Department of Labor)
Where do the numbers come from?
Survey data
Regulations by OMB
› Response rates
› Legal obligations
› Confidentiality
Confidential Information Protection and
Statistical Efficiency Act of 2002(CIPSEA)places the onus on federal employees to limit
disclosure
› Took over 4 years to implement (Anderson and
Seltzer)
3 ways to reduce within agencies:
› 1) Limiting identifiability of survey materials
within the organization
› 2) restricting access to data
› 3) restricting the contents that may be
released
Statistical Disclosure- “the identification of an
individual (or of an attribute) through the
matching of survey data with information
available outside of the survey” (Groves, et.al)
The federal government identifies three different
types of disclosure:
› Identity: inappropriate attribution of information to a
data subject, whether an individual or an organization.
› Attribute: data subject is identified from a released file
sensitive information about a data subject is revealed
through the released file
› Inferential: the released data make it possible to
determine the value of some characteristic of an
individual more accurately than otherwise would have
been possible (FCSM)
Need to provide information
› FOIA requests, Subpoenas
Satisfy requests for multiple clients. Must
keep track of all withheld information
Maintain utility of data while preserving
confidentiality
“Programming nightmare” to keep track
of the relationship between variables,
tables, and hierarchy
Specific Strategies
Data Swapping
Noise
Combining Cells
Rounding
Cell Suppression
Exchange of reported data values
across data records (Fienberg, Steele,
Makov, 1996)
Number
4
Child
County
HH Edu.
HH Income
Race
Sex
Pete
Alpha
High
61 W
M
Alfonso
Beta
Very High
61 W
M
Number
Child
County
HH Edu
HH
Income
Race
Sex
4
Alfonso
Alpha
Very
High
61
W
M
Assign a multiplying factor, or noise factor
to all data
› For example: the value of a randomly
generated variable might be added to each
value in a dataset
“protect individual establishments without
compromising the quality of our estimates”
Pro: More data can be published, less
complicated, less time consuming
Problem: perturbing ALL data, non-sensitive
and sensitive alike
How is this done: Use Multipliers
› The standard is to perturb data by about 10%
› Use multipliers ranging from .9 to 1.1
› Must preserve trend in data- otherwise useless for
client’s analysis
› Use distributions to control variance (examples)
Before Tabulation Strategies: Data Swapping;
Data Perturbation (Noise)
Tables of Frequencies
› Percent of population with certain characteristics
› With outside knowledge- respondents with unique
characteristics can be identified
› Sensitive information: identified by threshold
Tables of magnitude data
› Aggregate data, such as income of individuals,
revenues of companies
› Extreme values
› Sensitive information: identified by linear sensitivity
measure
Changing to values of outlier cases,
since outliers are more likely to be
sample or population uniques
Top coding- taking the largest values on
a variable and giving them the same
code value in dataset
› For example- place all companies
producing more than 100,000 barrels of oil
per day in one category
Non-uniques are unperturbed
Similar to noise. Cells are rounded,
random decision is made whether to
round up or down
› Example: x -r = 5q
Round values to the a multiple of 5
Where q = non negative integer
r = remainder
X = cell value,
Rounded up, 5 x (q+1) probability of r/5
Rounded down, 5 x q probability of (1-r/5)
n-k rule
p% rule
If upper or lower estimates for the respondent’s
value are closer to the reported value than
some prespecified percentage (p) of the total
cell value, the cell is sensitive (Groves, 372).
Assumptions: Any respondent can
estimate the contribution of another
respondent within 100% of its value
The second largest responded can use
their reported value and attempt to
estimate the largest reported value, X1
A cell is sensitive if:
S>0
where S = x1 - 100/p * (T – x2 - x1)
For a given cell with N respondents,
arrange the data in order from large to
small: X1>X2>…>Xn>0
Consider the cell 18,177.
N=3; X1 = 17,000; X2 = 1,000; X3 = 177; p=15
If a small number (n) of the respondents contribute a large
percentage (k) to the total cell value then the cell is sensitive
(Groves 372)
We are publishing production data of how
many barrels a day of crude oil each
refinery produces. This is secret information.
If our competitors found out, it could be
detrimental to our business.
There are 4 collectors in the state with
collections of 100, 50, 25, and 5 respectively
Find out if this information should be
released or not using the n-k rule with (2,
85). The P Percent rule (p=35%)?
Using the P Percent rule, this cell is sensitive.
However, it is not sensitive by the n-k rule
System of equations:
P%: Z2 > 100 – 1.35Z1
(n,k): Z2 > 85 – Z1
Variable Constraints
Z2 < Z1
Z1 + Z2 < 100
(55.56, 27.27)
Primary Suppressions: The sensitive Cell
Complementary/Secondary Suppressions:
Additional withheld data to ensure that the
primary suppressions cannot be derived by
linear combination
Goal: Minimize information lost. This is
accomplished by selecting smallest possible
cell values for complementary cell
suppression
Problem: Often requires a substantial
amount of data to be withheld. Potential
for errors may lead to the release of
confidential data
Small Tables:
› Manual suppression
› Computerized audit procedures
Large Tables:
› Much more complex, especially with related
tables and hierarchical data
› Consistency
Let’s return to a previous example: Sales
Revenue
We determined that we must the cell
must be suppressed. How do we
accomplish this?
High levels of security and suppression
protect data are necessary as data guides
real life policy issues.
Quality of this data is dependent on not
only a high response rate, but accurate
responses
Producing data is a function of “public
trust”
However, the point of data collection is its
use and analysis. The tradeoff between
confidentiality and utilization must be
examined
Patriot Act 2001 (Anderson & Seltzer)
Section 508: Disclosure information from
National Center for Education Statistics
Surveys
Justice Department is able to obtain and
use for investigation and prosecution
reports, records, and information (including
individually identifiable information)
The Patriot Act overrides the 1994 National
Center for Education Statistics Act that
protections confidentiality
Second War Powers Act (1942-1947)
Repealed confidentiality protects of Title 13
governing the US Census Bureau (Anderson &
Seltzer)
Japanese Americans and Internment camps (USA
Today)
2004 data on Arab-Americans (NYT)
› Released number of Arab-Americans per zip
code
› Categorized by country of origin: Egyptian,
Iraqi, Jordanian, Lebanese, Moroccan,
Palestinian, Syrian and two general
categories, "Arab/Arabic" and "Other Arab."
› Data obtained from a sample (the long form
of the census)
…the next time you fill out a survey, think
about where your information may (or
may not) be used.
Clemetson, Lynette. “Homeland Secuirty given data on ArabAmericans.” New York Times. July 30, 2004.
http://www.nytimes.com/2004/07/30/politics/30census.html
El Nasser, Haya. “Papers show Census role in WWII Camps.” USA Today.
March 30, 2007. http://www.usatoday.com/news/nation/2007-03-30census-role_N.htm
“DoD releases FY 2010 Budget Proposal.” US Department of Defense.
May 7, 2009.
http://www.defenselink.mil/releases/release.aspx?releaseid=12652
Seltzer, William and Margo Anderson. “NCES and the Patriot Act.” Paper
prepared for the Joint Statistical Meetings. 2002.
http://www.uwm.edu/~margo/govstat/jsm.pdf
Evans, Timothy, Laura Zayatz, and John Slanta. “Using Noise for
Disclosure Limitation of Establishment Tabular Data.” US Census Bureau.
1996. http://www.census.gov/prod/2/gen/96arc/iiaevans.pdf
“Statistical Programs of the US Government.” Office of Management
and Budget. 2009.
http://www.whitehouse.gov/omb/assets/information_and_regulatory_af
fairs/09statprog.pdf
Sullivan, Colleen. “An Overview of
Disclosure Principles.” US Census Bureau.
1992.
http://www.2010census.biz/srd/papers/pdf/
rr92-09.pdf
“Statistical Policy Working Paper: Report on
Statistical Disclosure Methodology.” Federal
Committee on Statistical Methodology.
2005. http://www.fcsm.gov/workingpapers/SPWP22_rev.pdf
Groves, Robert et. al. Survey Methodology.
Hoboken, NJ: John Wiley & Sons. 2004.
http://jpc.cylab.cmu.edu/journal/2009/v
ol01/issue01/issue01.pdf
http://www.census.gov/srd/sdc/papers.
html
http://www.census.gov/srd/sdc/abowdwoodcock2001-appendix-only.pdf