Census Disclosure

Helper functions related to disclosure.

oi_tools.drb.MINIMUM_CELL_SIZES: Mapping[Literal['national', 'state', 'substate', 'zip'], int] = {'national': 3, 'state': 10, 'substate': 20, 'zip': 100}

The minimum number of observations for a valid cell, taken from section V.A of the procedures handbook.

oi_tools.drb.create_drb_folders(
df: pl.DataFrame,
output_folder: Path | str,
output_filename: str,
*,
count_columns: IntoPolarsSelector = [],
proportion_columns: IntoPolarsSelector = [],
other_columns: IntoPolarsSelector = [],
allow_nulls: bool = False,
sample_size_column: IntoPolarsExpression = 'n',
overwrite: bool = False,
) int

Round the specified columns and save both raw and rounded CSVs for release.

Takes an input data frame and creates the following folder structure under output_folder:

{output_folder}/
  raw/{output_filename}.csv          # unmodified data
  to_disclose/{output_filename}.csv  # rounded/masked data

Columns are rounded according to the rules described in round_and_mask_columns().

Parameters:
  • df (pl.DataFrame) – Input DataFrame. Should contain no null values unless allow_nulls=True.

  • output_folder (Path | str) – Root directory for output (e.g. drb/2026_01_01/data). Sub-directories raw/ and to_disclose/ are created automatically.

  • output_filename (str) – Base name (without extension) for the output CSV files.

  • count_columns (IntoPolarsSelector) – Columns containing unweighted counts to be rounded per section V.B.3 of the handbook (see round_count_column()).

  • proportion_columns (IntoPolarsSelector) – Columns containing proportions or ratios to be rounded per section V.B.4 of the handbook (see round_proportion_column()).

  • other_columns (IntoPolarsSelector) – All other estimate columns (e.g. weighted means, regression coefficients) to be rounded to 4 significant figures per section V.B.1 of the handbook.

  • sample_size_column (IntoPolarsExpression) – Column containing the sample size.

  • overwrite (bool) – If False (default), raise FileExistsError when either output file already exists.

  • allow_nulls (bool) – If False (default), assert that neither the input nor the rounded DataFrame contains null values.

Returns:

Total number of non-null estimates in the output DataFrame.

Return type:

int

Raises:
  • AssertionError – If allow_nulls=False and nulls are found in the input or output DataFrame.

  • FileExistsError – If overwrite=False and an output file already exists.

oi_tools.drb.round_and_mask_columns(
df: DataFrame,
*,
count_columns: Collection[str] | Selector = [],
proportion_columns: Collection[str] | Selector = [],
other_columns: Collection[str] | Selector = [],
n: str | Expr | int | float = 'n',
geographic_level: Literal['national', 'state', 'substate', 'zip'] = 'national',
) DataFrame

Mask small cells and round the columns of a DataFrame for disclosure.

Adheres to the rules defined in sections V.A and V.B of the disclosure methods handbook:

  • Small cells are censored according to Section V.A (see mask_small_cells()).

  • Counts are rounded according to section V.B.3 (see round_count_column()).

  • Ratio/proportions are rounded according to section V.B.4 (see round_proportion_column()).

  • Other estimates (including regression output and weighted means/variances) are rounded to four significant figures (see section V.B.1).

Parameters:
  • df (DataFrame) – Input DataFrame to round and mask.

  • count_columns (Collection[str] | Selector) – Columns containing unweighted counts (rounded per section V.B.3).

  • proportion_columns (Collection[str] | Selector) – Columns containing proportions or ratios (rounded per section V.B.4).

  • other_columns (Collection[str] | Selector) – All other estimate columns (rounded to 4 significant figures).

  • n (str | Expr | int | float) – The sample size used to determine the rounding rule.

  • geographic_level (Literal['national', 'state', 'substate', 'zip']) – Geographic level of the estimates, used to determine the minimum cell size threshold (see MINIMUM_CELL_SIZES).

Returns:

A copy of df with the specified columns masked and rounded.

Return type:

pl.DataFrame

oi_tools.drb.mask_small_cells(
x: Expr,
n: Expr,
geographic_level: Literal['national', 'state', 'substate', 'zip'],
) Expr

Mask cells that fail the cell-size cutoffs.

From section V.A of the disclosure methods handbook:

For Title 26 counts and estimates from Internal Revenue Service (IRS) data and commingled data (from the Census Bureau and the IRS), we enforce the following thresholds based on IRS requirements:

  • At least 3 entities (unique firms, persons, or households) for national estimates

  • At least 10 entities for state-level estimates

  • At least 20 entities for substate-level estimates, except for zip codes

  • At least 100 entities for ZIP code-level estimates

Parameters:
  • x (Expr) – The expression to mask.

  • n (Expr) – The sample size used to determine the suppression rule.

  • geographic_level (Literal['national', 'state', 'substate', 'zip']) – Geographic level used to look up the minimum cell size in MINIMUM_CELL_SIZES.

Returns:

x where n meets the threshold, null otherwise.

Return type:

pl.Expr

oi_tools.drb.round_count_column(
col: Expr,
n: Expr,
) Expr

Round a count column to the appropriate precision.

From section V.B.3 of the disclosure methods handbook:

The rounding rule for unweighted counts is as follows:

  • If N is less than 15, report N < 15

  • If N is between 15 and 99, round to the nearest 10

  • If N is between 100-999, round to the nearest 50

  • If N is between 1,000-9,999, round to the nearest 100

  • If N is between 10,000-99,999, round to the nearest 500

  • If N is between 100,000-999,999, round to the nearest 1,000

  • If N is 1,000,000 or more, round to four significant digits as described earlier.

Parameters:
  • col (Expr) – The count expression to round.

  • n (Expr) – The sample size used to determine the rounding rule.

Returns:

A Polars expression with rounded values. Returns null when n < 15.

Return type:

pl.Expr

oi_tools.drb.round_proportion_column(
col: Expr,
n: Expr,
) Expr

Round a non-count column to the appropriate number of significant figures.

From section V.B.4 of the disclosure methods handbook:

For thresholds based on an unweighted denominator (D) and an unrounded unweighted proportion (P):

  • If 15 <= D < 100 then P should be rounded to 1 significant digit

  • Else if D < 1,000 then P should be rounded to no more than 2 significant digits

  • Else if D < 10,000 then P should be rounded to no more than 3 significant digits

  • Else if D >= 10,000 then P should be rounded to no more than 4 significant digits.

Parameters:
  • col (Expr) – The proportion/ratio expression to round.

  • n (Expr) – Denominator used to determine the number of significant digits.

Returns:

A Polars expression with rounded values. Returns null when n < 15.

Return type:

pl.Expr