Census Disclosure¶
Helper functions related to disclosure.
See the procedures handbook for information about the logistics/procedure of release.
See the methods handbook for resources on disclosure avoidance methods (e.g. rounding, noise, local-area estimates).
- oi_tools.drb.MINIMUM_CELL_SIZES: Mapping[Literal['national', 'state', 'substate', 'zip'], int] = {'national': 3, 'state': 10, 'substate': 20, 'zip': 100}¶
The minimum number of observations for a valid cell, taken from section V.A of the procedures handbook.
- oi_tools.drb.create_drb_folders(
- df: pl.DataFrame,
- output_folder: Path | str,
- output_filename: str,
- *,
- count_columns: IntoPolarsSelector = [],
- proportion_columns: IntoPolarsSelector = [],
- other_columns: IntoPolarsSelector = [],
- allow_nulls: bool = False,
- sample_size_column: IntoPolarsExpression = 'n',
- overwrite: bool = False,
Round the specified columns and save both raw and rounded CSVs for release.
Takes an input data frame and creates the following folder structure under
output_folder:{output_folder}/ raw/{output_filename}.csv # unmodified data to_disclose/{output_filename}.csv # rounded/masked dataColumns are rounded according to the rules described in
round_and_mask_columns().- Parameters:
df (pl.DataFrame) – Input DataFrame. Should contain no null values unless
allow_nulls=True.output_folder (Path | str) – Root directory for output (e.g.
drb/2026_01_01/data). Sub-directoriesraw/andto_disclose/are created automatically.output_filename (str) – Base name (without extension) for the output CSV files.
count_columns (IntoPolarsSelector) – Columns containing unweighted counts to be rounded per section V.B.3 of the handbook (see
round_count_column()).proportion_columns (IntoPolarsSelector) – Columns containing proportions or ratios to be rounded per section V.B.4 of the handbook (see
round_proportion_column()).other_columns (IntoPolarsSelector) – All other estimate columns (e.g. weighted means, regression coefficients) to be rounded to 4 significant figures per section V.B.1 of the handbook.
sample_size_column (IntoPolarsExpression) – Column containing the sample size.
overwrite (bool) – If
False(default), raiseFileExistsErrorwhen either output file already exists.allow_nulls (bool) – If
False(default), assert that neither the input nor the rounded DataFrame contains null values.
- Returns:
Total number of non-null estimates in the output DataFrame.
- Return type:
int
- Raises:
AssertionError – If
allow_nulls=Falseand nulls are found in the input or output DataFrame.FileExistsError – If
overwrite=Falseand an output file already exists.
- oi_tools.drb.round_and_mask_columns(
- df: DataFrame,
- *,
- count_columns: Collection[str] | Selector = [],
- proportion_columns: Collection[str] | Selector = [],
- other_columns: Collection[str] | Selector = [],
- n: str | Expr | int | float = 'n',
- geographic_level: Literal['national', 'state', 'substate', 'zip'] = 'national',
Mask small cells and round the columns of a DataFrame for disclosure.
Adheres to the rules defined in sections V.A and V.B of the disclosure methods handbook:
Small cells are censored according to Section V.A (see
mask_small_cells()).Counts are rounded according to section V.B.3 (see
round_count_column()).Ratio/proportions are rounded according to section V.B.4 (see
round_proportion_column()).Other estimates (including regression output and weighted means/variances) are rounded to four significant figures (see section V.B.1).
- Parameters:
df (DataFrame) – Input DataFrame to round and mask.
count_columns (Collection[str] | Selector) – Columns containing unweighted counts (rounded per section V.B.3).
proportion_columns (Collection[str] | Selector) – Columns containing proportions or ratios (rounded per section V.B.4).
other_columns (Collection[str] | Selector) – All other estimate columns (rounded to 4 significant figures).
n (str | Expr | int | float) – The sample size used to determine the rounding rule.
geographic_level (Literal['national', 'state', 'substate', 'zip']) – Geographic level of the estimates, used to determine the minimum cell size threshold (see
MINIMUM_CELL_SIZES).
- Returns:
A copy of
dfwith the specified columns masked and rounded.- Return type:
pl.DataFrame
- oi_tools.drb.mask_small_cells(
- x: Expr,
- n: Expr,
- geographic_level: Literal['national', 'state', 'substate', 'zip'],
Mask cells that fail the cell-size cutoffs.
From section V.A of the disclosure methods handbook:
For Title 26 counts and estimates from Internal Revenue Service (IRS) data and commingled data (from the Census Bureau and the IRS), we enforce the following thresholds based on IRS requirements:
At least 3 entities (unique firms, persons, or households) for national estimates
At least 10 entities for state-level estimates
At least 20 entities for substate-level estimates, except for zip codes
At least 100 entities for ZIP code-level estimates
- Parameters:
x (Expr) – The expression to mask.
n (Expr) – The sample size used to determine the suppression rule.
geographic_level (Literal['national', 'state', 'substate', 'zip']) – Geographic level used to look up the minimum cell size in
MINIMUM_CELL_SIZES.
- Returns:
xwherenmeets the threshold,nullotherwise.- Return type:
pl.Expr
- oi_tools.drb.round_count_column(
- col: Expr,
- n: Expr,
Round a count column to the appropriate precision.
From section V.B.3 of the disclosure methods handbook:
The rounding rule for unweighted counts is as follows:
If N is less than 15, report N < 15
If N is between 15 and 99, round to the nearest 10
If N is between 100-999, round to the nearest 50
If N is between 1,000-9,999, round to the nearest 100
If N is between 10,000-99,999, round to the nearest 500
If N is between 100,000-999,999, round to the nearest 1,000
If N is 1,000,000 or more, round to four significant digits as described earlier.
- Parameters:
col (Expr) – The count expression to round.
n (Expr) – The sample size used to determine the rounding rule.
- Returns:
A Polars expression with rounded values. Returns null when n < 15.
- Return type:
pl.Expr
- oi_tools.drb.round_proportion_column(
- col: Expr,
- n: Expr,
Round a non-count column to the appropriate number of significant figures.
From section V.B.4 of the disclosure methods handbook:
For thresholds based on an unweighted denominator (D) and an unrounded unweighted proportion (P):
If 15 <= D < 100 then P should be rounded to 1 significant digit
Else if D < 1,000 then P should be rounded to no more than 2 significant digits
Else if D < 10,000 then P should be rounded to no more than 3 significant digits
Else if D >= 10,000 then P should be rounded to no more than 4 significant digits.
- Parameters:
col (Expr) – The proportion/ratio expression to round.
n (Expr) – Denominator used to determine the number of significant digits.
- Returns:
A Polars expression with rounded values. Returns null when n < 15.
- Return type:
pl.Expr