Detailed code run

Here is a deep dive in to the xagg functionality.

[1]:
import xagg as xa
import xarray as xr
import numpy as np
import geopandas as gpd

Intro

We’ll be aggregating a gridded dataset onto a set of shapefiles, using an extra set of weights. Specifically, we’ll use: - gridded: month-of-year average temperature projections for the end-of-century from a climate model (CCSM4) - shapefiles: US counties - additional weights: global gridded population density (GPW, 30 min resolution)

This is a setup that you may for example use if projecting the impact of temperature on some human variable (temperature vs. mortality, for example) for which you have data at the US county level. Since your mortality data is likely at the county level, you need to aggregate the gridded climate model output data to counties - i.e., what is the average temperature over each county? This code will calculate which pixels overlap each county - and by how much - allowing an area-averaged value for monthly temperature at the county level.

However, you also care about where people live - so you’d like to additionally weight your temperature estimate by a population density dataset. This code easily allows such additional weights. The resultant output is a value of temperature for each month at each county, averaged by both the overlap of individual pixels and the population density in those pixels. (NB: GPWv4 just averages a political unit’s population over a pixel grid, so it might not be the best product in this particular use case, but is used as a sample here)

Let’s get started.

[2]:
# Load some climate data as an xarray dataset
ds = xr.open_dataset('../../data/climate_data/tas_Amon_CCSM4_rcp85_monthavg_20700101-20991231.nc')
[5]:
# Load US counties shapefile as a geopandas GeoDataFrame
gdf = gpd.read_file('../../data/geo_data/UScounties.shp')
[7]:
# Load global gridded population data from GPW
ds_pop = xr.open_dataset('../../data/pop_data/pop2000.nc')

NB: the GPW file above has been pre-processed, by subsampling to ``raster=0`` (the 2000 population), and renaming the primary variable to ``pop`` for ease of use.

Calculating area weights between a raster grid and polygons

First, xagg has to figure out how much each pixel overlaps each polygon. This process requires a few steps:

  1. Get everything in the right format.

    • Gridded data comes in all shapes and sizes. xagg is ready to deal with most common grid naming conventions - so no matter if your lat and lon variables are called ‘Latitude’ and ‘Longitude’ or ‘y’ and ‘x’ or many options in between, as long as they’re in xarray Datasets or DataArrays, they’ll work.

    • Behind the scenes, longitude values are also forced to -180:180 (from 0:360, if applicable), just to make sure everything is operating in the same coordinate system.

  2. Build polygons for each pixel

    • To figure out how much each pixel overlaps each polygon, pixel polygons have to be constructed. If your gridded variable already has “lat_bnds” and “lon_bnds” (giving the vertices of each pixel) explicitly included in the xr.Dataset, then those are used. If none are found, “lat_bnds” and “lon_bnds” are constructed by assuming the vertices are halfway between the coordinates in degrees.

    • If an additional weighting is used, the weighting dataset and your gridded data have to be homogenized at this stage. By default, the weighting dataset is regridded to your gridded data using xesmf. Future versions will also allow regridding the gridded data to the weighting dataset here(it’s already accounted for in some of the functions, but not all).

    • To avoid creating gigantic geodataframes with pixel polygons, the dataset is by default subset to a bounding box around the shapefiles first. In the aggregating code below, this subsetting is taken into account, and the input ds into xa.aggregate is matched to the original source grid on which the overlaps were calculated.

  3. Calculate area overlaps between each pixel and each polygon

    • Now, the overlap between each pixel and each polygon is calculated. Using geopandas’ excellent polygon boolean operations and area calculations, the intersection between the raster grid and the polygon is calculated. For each polygon, the coordinates of each pixel that intersects it is saved, as is the relative area of that overlap (as an example, if you had a county the size and shape of one pixel, but located half in one pixel and half in the other pixel, those two pixels would be saved, and their relative area would be 0.5 each). Areas are calculated using the WGS84 geoid.

[8]:
# Calculate overlaps
weightmap = xa.pixel_overlaps(ds,gdf,weights=ds_pop.pop)
creating polygons for each pixel...
regridding weights to data grid...
/Users/kevinschwarzwald/opt/anaconda3/envs/test/lib/python3.9/site-packages/xarray/core/dataarray.py:746: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  return key in self.data
/Users/kevinschwarzwald/opt/anaconda3/envs/test/lib/python3.9/site-packages/xesmf/frontend.py:466: FutureWarning: ``output_sizes`` should be given in the ``dask_gufunc_kwargs`` parameter. It will be removed as direct parameter in a future version.
  dr_out = xr.apply_ufunc(
calculating overlaps between pixels and output polygons...
success!

Aggregating gridded data to the polygons using the area weights (and other weights) calculated above

Now that we know which pixels overlap which polygons and by how much (and what the value of the population weight for each pixel is), it’s time to aggregate data to the polygon level. xagg will assume that all variables in the original ds that have lat and lon coordinates should be aggregated. These variables may have extra dimensions (3-D variables (i.e. lon x lat x time) are supported; 4-D etc. should be supported but haven’t been tested yet - the biggest issue may be in exporting).

Since we included an additional weighting grid, this dataset is included in weightmap from above and is seamlessly integrated into the weighting scheme.

[9]:
# Aggregate
aggregated = xa.aggregate(ds,weightmap)
adjusting grid... (this may happen because only a subset of pixels were used for aggregation for efficiency - i.e. [subset_bbox=True] in xa.pixel_overlaps())
grid adjustment successful
aggregating tas...
all variables aggregated to polygons!

Converting aggregated data

Now that the data is aggregated, we want it in a useable format.

Supported formats for converting include: - xarray Dataset (using .to_dataset()) - Grid dimensions from the original dataset are replaced with a single dimensions for polygons - by default called “poly_idx” (change this with the loc_dim=... option). Aggregated variables keep their non-grid dimensions unchanged; with their grid dimension replaced as above. - All original fields from the geodataframe are kept as poly_idx x 1 variables. - pandas Dataframe (using .to_dataframe()) - All original fields from the geodataframe are kept; the aggregated variables are added as separate columns. If the aggregated variables have a 3rd dimension, they are reshaped long - with procedurally generated column names (just [var]0, [var]1, … for now).

(the “raw” form of the geodataframe used to create these can also be directly accessed through aggregated.agg)

[10]:
# Example as a dataset
ds_out = aggregated.to_dataset()
ds_out
[10]:
<xarray.Dataset>
Dimensions:     (month: 12, pix_idx: 3141)
Coordinates:
  * pix_idx     (pix_idx) int64 0 1 2 3 4 5 6 ... 3135 3136 3137 3138 3139 3140
  * month       (month) int64 1 2 3 4 5 6 7 8 9 10 11 12
Data variables:
    NAME        (pix_idx) object 'Lake of the Woods' 'Ferry' ... 'Broomfield'
    STATE_NAME  (pix_idx) object 'Minnesota' 'Washington' ... 'Colorado'
    STATE_FIPS  (pix_idx) object '27' '53' '53' '53' ... '02' '02' '02' '08'
    CNTY_FIPS   (pix_idx) object '077' '019' '065' '047' ... '240' '068' '014'
    FIPS        (pix_idx) object '27077' '53019' '53065' ... '02068' '08014'
    tas         (pix_idx, month) float64 264.0 268.9 274.0 ... 283.5 276.4 270.4
[11]:
# Example as a dataframe
df_out = aggregated.to_dataframe()
df_out
[11]:
NAME STATE_NAME STATE_FIPS CNTY_FIPS FIPS tas0 tas1 tas2 tas3 tas4 tas5 tas6 tas7 tas8 tas9 tas10 tas11
0 Lake of the Woods Minnesota 27 077 27077 263.978006 268.887868 274.012152 283.158717 290.630598 297.850779 302.038199 300.327744 293.465816 283.815233 275.141634 266.054430
1 Ferry Washington 53 019 53019 271.780440 275.618485 276.934183 279.826777 286.621100 293.757010 299.056368 297.131708 289.844308 281.633456 276.714475 272.242004
2 Stevens Washington 53 065 53065 273.217250 276.940380 278.414225 281.319652 287.817911 294.926457 300.903109 299.304529 292.245363 283.273956 278.063277 273.666181
3 Okanogan Washington 53 047 53047 271.831071 275.586124 276.689357 279.324166 285.771338 292.635899 297.756402 295.956748 289.177685 281.440422 276.654779 272.275232
4 Pend Oreille Washington 53 051 53051 272.092353 275.888818 277.346070 280.446389 287.268406 294.357705 299.851527 297.965815 290.622763 282.058301 276.996473 272.484589
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3136 Skagway-Hoonah-Angoon Alaska 02 232 02232 273.605147 275.477240 276.792992 279.194054 283.764519 288.635583 290.038044 289.689574 286.058608 280.958010 276.904109 274.048888
3137 Yukon-Koyukuk Alaska 02 290 02290 264.534558 264.088869 267.621423 273.426228 281.649435 289.319370 288.936030 286.209616 280.807791 273.683875 266.722855 265.538685
3138 Southeast Fairbanks Alaska 02 240 02240 263.919168 263.899079 266.771514 272.144709 279.739283 287.625174 287.933732 285.436821 279.768225 272.377963 265.702026 264.414302
3139 Denali Alaska 02 068 02068 265.049599 264.794849 268.193156 273.612534 281.097223 288.917064 288.898311 286.233612 280.504391 273.142605 266.534916 265.575153
3140 Broomfield Colorado 08 014 08014 270.803864 273.430206 275.955505 280.790070 287.303619 292.830048 297.615662 297.646820 292.368988 283.544708 276.383606 270.444855

3141 rows × 17 columns

Exporting aggregated data

For reproducability and code simplicity, you will likely want to save your aggregated data. In addtion, many researchers use multiple languages or software packages as part of their workflow; for example, STATA or R for regression analysis, or QGIS for spatial analysis, and need to be able to transfer their work to these other environments.

xagg has built-in export functions that allow the export of aggregated data to: - NetCDF - csv (for use in STATA, R) - shapefile (for use in GIS applications)

Export to netCDF

The netCDF export functionality saves all aggregated variables by replacing the grid dimensions (lat, lon) with a single location dimension (called poly_idx, but this can be changed with the loc_dim= argument).

Other dimensions (e.g. time) are kept as they were originally in the grid variable.

Fields in the inputted polygons (e.g., FIPS codes for the US Counties shapefile used here) are saved as additional variables. Attributes from the original xarray structure are kept.

[ ]:
# Export to netcdf
aggregated.to_netcdf('file_out.nc')

Export to .csv

The .csv output functionality saves files in a polygon (rows) vs. variables (columns) format. Each aggregated variable and each field in the original inputted polygons are saved as columns. Named attributes in the inputted netcdf file are not included.

Currently .csvs are only saved “wide” - i.e., a lat x lon x time variable tas, aggregated to location x time, would be reshaped wide so that each timestep is saved in its own column, named tas0, tas1, and so forth.

[ ]:
# Export to csv
aggregated.to_csv('file_out.csv')

Export to shapefile

The shapefile export functionality keeps the geometry of the originally input polygons, and adds the aggregated variables as fields.

Similar to .csv export above, if aggregated variables have a dimension beyond their location dimensions (e.g., time), each step in that dimension is saved in a separate field, named after the variable and the integer of the index along that dimension (e.g., tas0, tas1, etc. for a variable tas).

Named attributes in the inputted netcdf file are not included.

[ ]:
# Export to csv
aggregated.to_csv('file_out.shp')