Tutorial 1: ACS PUMS Microdata Analysis¶
This tutorial demonstrates analyzing person-level microdata from the American Community Survey (ACS) Public Use Microdata Sample (PUMS).
Goal: Get age and sex data for adults in California and Texas, then create weighted frequency tables stratified by state.
Setup¶
In [ ]:
Copied!
import os
from cendat import CenDatHelper
from dotenv import load_dotenv
# Load your API key from environment
load_dotenv()
cdh = CenDatHelper(years=[2022], key=os.getenv("CENSUS_API_KEY"))
import os
from cendat import CenDatHelper
from dotenv import load_dotenv
# Load your API key from environment
load_dotenv()
cdh = CenDatHelper(years=[2022], key=os.getenv("CENSUS_API_KEY"))
Step 1: Find and Select the PUMS Product¶
In [ ]:
Copied!
# Search for the ACS 1-year PUMS product
# The \b ensures we match the exact endpoint, not subpaths
cdh.list_products(patterns=r"acs/acs1/pums\b")
cdh.set_products()
# Search for the ACS 1-year PUMS product
# The \b ensures we match the exact endpoint, not subpaths
cdh.list_products(patterns=r"acs/acs1/pums\b")
cdh.set_products()
Step 2: Select Geography and Variables¶
In [ ]:
Copied!
# For PUMS, geography is simpler—we just need "state"
cdh.set_geos(values="state", by="desc")
# Select the variables we need:
# - SEX: Person's sex
# - AGEP: Person's age
# - ST: State code
# - PWGTP: Person weight (crucial for microdata!)
cdh.set_variables(names=["SEX", "AGEP", "ST", "PWGTP"])
# For PUMS, geography is simpler—we just need "state"
cdh.set_geos(values="state", by="desc")
# Select the variables we need:
# - SEX: Person's sex
# - AGEP: Person's age
# - ST: State code
# - PWGTP: Person weight (crucial for microdata!)
cdh.set_variables(names=["SEX", "AGEP", "ST", "PWGTP"])
Step 3: Get Data¶
In [ ]:
Copied!
# Fetch data for California (06) and Texas (48)
response = cdh.get_data(
within={"state": ["06", "48"]}
)
# Fetch data for California (06) and Texas (48)
response = cdh.get_data(
within={"state": ["06", "48"]}
)
Step 4: Analyze with Tabulate¶
The tabulate() method creates Stata-style frequency tables with proper weighting:
In [ ]:
Copied!
# Age distribution by sex, stratified by state
# Only adults (AGEP > 17), using person weights
response.tabulate(
"SEX", "AGEP",
strat_by="ST",
weight_var="PWGTP",
where="AGEP > 17"
)
# Age distribution by sex, stratified by state
# Only adults (AGEP > 17), using person weights
response.tabulate(
"SEX", "AGEP",
strat_by="ST",
weight_var="PWGTP",
where="AGEP > 17"
)
Step 5: Convert to DataFrame¶
In [ ]:
Copied!
# For further analysis, convert to a DataFrame
df = response.to_polars(concat=True, destring=True)
print(df.head())
# For further analysis, convert to a DataFrame
df = response.to_polars(concat=True, destring=True)
print(df.head())