Metadata-Version: 2.4
Name: geocif
Version: 0.4.776
Summary: Models to visualize and forecast crop conditions and yields
Author-email: Ritvik Sahajpal <ritvik@umd.edu>
License: MIT
Project-URL: Homepage, https://ritviksahajpal.github.io/yield_forecasting/
Keywords: geocif
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: boruta>=0.4.3
Requires-Dist: catboost>=1.2.8
Requires-Dist: fiona
Requires-Dist: gdal==3.10.2; sys_platform == "win32"
Requires-Dist: gdal==3.11.0; sys_platform != "win32"
Requires-Dist: pyeogpr>=2.4.7
Requires-Dist: pyproj
Requires-Dist: rasterio
Requires-Dist: rtree
Requires-Dist: shap>=0.48.0
Requires-Dist: numba>=0.59
Requires-Dist: shapiq>=1.0
Requires-Dist: shapely
Requires-Dist: optuna
Requires-Dist: xarray>=2026.2.0
Requires-Dist: pooch>=1.8.0
Requires-Dist: arrow>=1.4.0
Requires-Dist: icclim>=7.0.4
Requires-Dist: geoprepare>=0.6.275
Requires-Dist: logzero>=1.7.0
Requires-Dist: geopandas>=1.1.2
Requires-Dist: tabpfn>=6.4.1
Requires-Dist: tabicl>=2.0.2
Requires-Dist: statsmodels>=0.14.6
Requires-Dist: palettable>=3.3.3
Requires-Dist: seaborn>=0.13.2
Requires-Dist: scikit-misc>=0.5.2
Requires-Dist: setuptools<81
Requires-Dist: choix>=0.3.4
Requires-Dist: scienceplots>=2.0.0
Requires-Dist: cartopy>=0.22
Requires-Dist: Rbeast>=0.1.20
Requires-Dist: scikit-learn>=1.4
Requires-Dist: bottleneck>=1.3
Requires-Dist: mapclassify>=2.5
Requires-Dist: pymannkendall>=1.4
Requires-Dist: pangres>=4.0
Requires-Dist: kneed>=0.8
Requires-Dist: lifelines>=0.27
Requires-Dist: esda>=2.5
Requires-Dist: libpysal>=4.10
Requires-Dist: crepes>=0.6
Requires-Dist: cubist>=1.0
Requires-Dist: mapie>=0.8
Requires-Dist: merf>=1.0
Requires-Dist: ngboost>=0.5
Requires-Dist: tabpfn-extensions>=0.4
Requires-Dist: treeple>=0.10
Requires-Dist: ydf>=0.6
Requires-Dist: BorutaShap>=1.0
Requires-Dist: arfs>=2.0
Requires-Dist: feature-engine>=1.6
Requires-Dist: mrmr-selection>=0.2
Requires-Dist: sklearn-genetic-opt>=0.10
Requires-Dist: stabl>=0.0.1
Requires-Dist: cachetools>=5.0
Requires-Dist: geopy>=2.0
Requires-Dist: scikit-image>=0.21
Requires-Dist: matplotlib<3.11
Provides-Extra: dashboard
Requires-Dist: panel>=1.4.0; extra == "dashboard"
Requires-Dist: hvplot>=0.10.0; extra == "dashboard"
Requires-Dist: holoviews>=1.18; extra == "dashboard"
Provides-Extra: aquacrop
Requires-Dist: aquacrop>=3.0; extra == "aquacrop"
Provides-Extra: aquacrop-calibration
Requires-Dist: aquacrop>=3.0; extra == "aquacrop-calibration"
Requires-Dist: pygmo>=2.19; extra == "aquacrop-calibration"
Requires-Dist: SALib>=1.5; extra == "aquacrop-calibration"
Provides-Extra: spatial
Requires-Dist: pysal>=2.6; extra == "spatial"
Provides-Extra: gee
Requires-Dist: earthengine-api>=1.0; extra == "gee"
Requires-Dist: geemap>=0.30; extra == "gee"
Provides-Extra: narrative
Requires-Dist: pymupdf>=1.23; extra == "narrative"
Requires-Dist: pdfplumber>=0.10; extra == "narrative"
Requires-Dist: reportlab>=4.0; extra == "narrative"
Requires-Dist: anthropic>=0.30; extra == "narrative"
Provides-Extra: shap-fast
Requires-Dist: fasttreeshap>=0.1; extra == "shap-fast"
Provides-Extra: powershap
Requires-Dist: powershap>=0.0.10; extra == "powershap"
Provides-Extra: geospann
Requires-Dist: geospaNN>=0.1; extra == "geospann"
Provides-Extra: desreg
Requires-Dist: desReg>=0.1; extra == "desreg"
Dynamic: license-file

# geocif

MIKES EDITS

[![image](https://img.shields.io/pypi/v/geocif.svg)](https://pypi.python.org/pypi/geocif)

**Models to visualize and forecast crop conditions and yields**

Generate Climatic Impact-Drivers (CIDs) from Earth Observation (EO) data, build ML yield forecasting models, and produce agmet condition monitoring plots.

[Climatic Impact-Drivers for Crop Yield Assessment at NASA Harvest](https://www.loom.com/share/5c2dc62356c6406193cd9d9725c2a6a9)

-   Free software: MIT license
-   Documentation: https://ritviksahajpal.github.io/yield_forecasting/


## Setup

### Requirements

- Python 3.11+
- [uv](https://docs.astral.sh/uv/getting-started/installation/)

### Install

```bash
cd geocif                   # project root (where pyproject.toml lives)
uv sync                     # creates .venv and installs all dependencies
```

On **Windows**, uv automatically pulls pre-built geospatial wheels (GDAL, rasterio, fiona, shapely, pyproj, rtree) from the URLs in `[tool.uv.sources]`. On **Linux/macOS**, those entries are skipped (platform marker) and packages are installed from PyPI.

To activate the environment:

```bash
# Windows
.venv\Scripts\activate

# Linux/macOS
source .venv/bin/activate
```

### Fresh reinstall

```bash
rm -rf .venv && uv sync
```

## Config files

| File | Purpose | Used by |
|------|---------|---------|
| [`geobase.txt`](#geobasetxt) | Paths, shapefile column mappings | both |
| [`countries.txt`](#countriestxt) | Per-country config (boundary files, admin levels, seasons, crops) | both |
| [`crops.txt`](#cropstxt) | Crop masks, calendar categories (EWCM, AMIS) | both |
| [`geoextract.txt`](#geoextracttxt) | Extraction-only settings (method, threshold, parallelism) | geoprepare |
| [`geocif.txt`](#geociftxt) | Indices/ML/agmet settings, country overrides, runtime selections | geocif |

## Usage

**Order matters:** Config files are loaded left-to-right. When the same key appears in multiple files, the last file wins. The tool-specific file (`geoextract.txt` or `geocif.txt`) must be last so its `[DEFAULT]` values (countries, method, etc.) override the shared defaults in `countries.txt`.

```python
config_dir = "/path/to/config"  # full path to your config directory

cfg_geoprepare = [f"{config_dir}/geobase.txt", f"{config_dir}/countries.txt", f"{config_dir}/crops.txt", f"{config_dir}/geoextract.txt"]
cfg_geocif = [f"{config_dir}/geobase.txt", f"{config_dir}/countries.txt", f"{config_dir}/crops.txt", f"{config_dir}/geocif.txt"]
```

### geoprepare (download, extract, merge)

```python
from geoprepare import geodownload
geodownload.run([f"{config_dir}/geobase.txt"])

from geoprepare import geoextract
geoextract.run(cfg_geoprepare)

from geoprepare import geomerge
geomerge.run(cfg_geoprepare)
```

### geocif (indices, ML, agmet, analysis, experiments)

```python
from geocif import indices_runner
indices_runner.run(cfg_geocif)

from geocif import geocif_runner
geocif_runner.run(cfg_geocif)

from geocif.agmet import geoagmet
geoagmet.run(cfg_geocif)

from geocif import analysis
analysis.run(cfg_geocif)

from geocif import experiments
experiments.run(cfg_geocif, n_trials=30)

from geocif import yield_outlook
yield_outlook.run(cfg_geocif)  # uses config defaults (10 years, mean)
# yield_outlook.run(cfg_geocif, current_year=2026, n_years=10, aggregation="median")
```

### Cropmask optimizers

Two consumers of `geoprepare` extraction outputs that tune the cropland mask used downstream. Run *after* the corresponding geoprepare extractor has written its outputs.

```python
# Uniform threshold T over the region (single absolute or rank-based knob).
# Reads geoprepare.extract_sweep output:
#   ${PATHS:dir_output}/threshold_sweep/{country}/{crop}/{country}_{crop}_s{season}_sweep.csv
from geocif import threshold_optimizer
threshold_optimizer.run(cfg_geocif)

# Per-cell binary mask — independent in/out decision per cropland cell.
# Reads geoprepare.extract_cells output:
#   ${PATHS:dir_output}/cell_optimizer/{country}/{crop}/{country}_{crop}_s{season}_cells.parquet
# Writes a production-mask parquet at the same location that geoextract picks up.
from geocif import cell_optimizer
cell_optimizer.run(cfg_geocif)
```

Configure under `[THRESHOLD_OPTIMIZER]` and `[CELL_OPTIMIZER]` in `geocif.txt`. Outputs land under `${PATHS:dir_output}/ml/analysis/{date}/{threshold_sweep_summary|cell_optimizer}/`.

#### Using the optimized cell mask in production extraction

`geoprepare 0.6.273+` can apply the per-cell mask produced by `cell_optimizer` during EO extraction. Opt in per country (or in `[DEFAULT]`) in `geoextract.txt`:

```ini
[DEFAULT]
use_optimized_mask = True
```

When the flag is on, `geoprepare.extract_EO` reads
`${PATHS:dir_output}/cell_optimizer/{country}/{crop}/{country}_{crop}_s{season}_optimized_mask.parquet`
for every configured (country, crop, season) and AND-s it with the existing floor/ceiling AFI mask. Cells the optimizer marked `included=False` are dropped from the per-region aggregate even if they pass the floor/ceiling rule. Multi-season countries get the **union** across seasons — a cell is kept if any season's optimizer selected it.

**Pipeline order with the optimized mask:**
```python
geoprepare.extract_cells.run(cfg_geoprepare)   # writes per-cell parquets
geocif.cell_optimizer.run(cfg_geocif)          # writes optimized_mask.parquet
geoprepare.geoextract.run(cfg_geoprepare)      # reads optimized_mask.parquet
```

`extract_EO` aborts at startup with a missing-parquet list if `use_optimized_mask = True` for any country whose mask hasn't been produced yet — silent fallback to the floor/ceiling rule when the operator asked for the overlay would be a confusing footgun, so it doesn't.

Currently wired in `process_aef`, `process_fldas`, `process_chirps_mfc`, `process_soilgrids` (the static + monthly-forecast EO paths). The daily-EO path through `geom_extract` (NDVI, daily CHIRPS, ESI, etc.) is not yet wired — track via a future change in geoprepare.

#### Annual (leave-one-out) masks

Enable `annual_mask = True` under `[CELL_OPTIMIZER]` in `geocif.txt` to produce **one mask per historical year** instead of a single pooled mask. For each year Y, the GA trains on every OTHER year — year Y's yield never sees the cell selection — and that mask is written to a `_y{year}_optimized_mask.parquet` file alongside the pooled one. `geoprepare.extract_EO` prefers the year-specific file when extracting year Y (FLDAS / CHIRPS-MFC, which are per-year datasets) and falls back to the pooled file for forecast / current years. AEF and SoilGrids (static) always use the pooled mask.

This closes the overfitting failure mode where the pooled mask was selected with year Y's yield as part of the training data — visible in pre-0.4.747 runs as regions whose Pearson r between yield and NDVI flipped sign after selection (the GA found anti-correlated cells because R² is sign-blind).

**Cost.** Roughly `(n_years + 1) ×` the pooled-only default per region. On a country with 25 yield years that's ~26× more GA runs; expect runtime to scale accordingly. Opt in only when the data span justifies it.

Off by default. Existing configs without `annual_mask` continue to write the single pooled parquet.

### ML models

geocif supports the following model types (configured via `models` in `[DEFAULT]`):

| Model | Key | Type |
|-------|-----|------|
| CatBoost | `catboost` | Gradient boosting |
| XGBoost | `xgboost` | Gradient boosting |
| TabPFN | `tabpfn` | Prior-fitted network |
| TabICL | `tabicl` | In-context learning |
| NGBoost | `ngboost` | Natural gradient boosting |
| YDF | `ydf` | Yggdrasil decision forests |
| Oblique RF | `oblique` | Oblique random forest |
| Cubist | `cubist` | Rule-based regression |
| MERF | `merf` | Mixed effects random forest |
| Linear | `linear` | LassoCV / LogisticRegressionCV |
| GAM | `gam` | Generalized additive model |
| GeoSpaNN | `geospaNN` | Geospatial neural network |
| Median | `median` | Median baseline |
| Analog | `analog` | Analogous year baseline |

### Feature selection methods

Configured via `feature_selection` in `[ML]`:

`none`, `SelectKBest`, `BorutaPy`, `Leshy`, `gOMP`, `RFECV`, `RFE`, `lasso`, `mrmr`, `SHAP`, `stabl`, `PowerShap`, `BorutaShap`, `Genetic`, `feature_engine`, `multi`

### Cluster analysis

Optional analysis that clusters regions by their CID profiles and identifies which CIDs discriminate each cluster. Works with or without yield data — falls back to a proxy CID (e.g., AUC_NDVI) when yield is unavailable. Enabled via `[ML]`:

```ini
run_cluster_analysis = True
cluster_analysis_proxy = AUC_NDVI   ; proxy CID when yield is unavailable
cluster_analysis_max_k = 8          ; maximum clusters for silhouette selection
cluster_analysis_top_n = 20         ; top N CIDs in discrimination heatmap
cluster_analysis_variance = 0.85    ; cumulative PCA variance to retain
```

Pipeline: PCA dimensionality reduction → Ward's hierarchical clustering (silhouette-selected k) → Kruskal-Wallis + Cohen's d for CID discrimination → mutual information for CID-target association. Outputs: cluster map (choropleth), dendrogram, PCA biplot, discrimination heatmap with significance stars, target boxplot, and per-CID maps for top discriminating indices.

### Spatial neighbor features

Optional GraphSAGE-style preprocessing that computes yield-correlation-weighted averages of neighboring regions' features. Enabled via `[ML]`:

```ini
use_spatial_neighbors = True
spatial_neighbor_method = knn   ; knn or full
spatial_neighbor_k = 5          ; number of nearest neighbors
```

For each admin region, the neighbor graph is built from training data using haversine distances and Pearson yield correlations as edge weights. Neighbor-aggregated features are added as `nbr_*` columns and flow through standard feature selection.

### Experiments

The experiments runner (`geocif.experiments`) provides 6 experiments for model selection, feature importance, and hyperparameter tuning:

| # | Config name | Internal name | What it does |
|---|-------------|---------------|--------------|
| 0 | `model_comparison` | `models` | Runs each model in `comparison_models` head-to-head. Produces Bradley-Terry ranking, scatter plots, MAPE bars. Identifies best model per country (required by experiments 1 & 2). |
| 1 | `cid_ablation` | `cids` | Runs the best model once per CID Type in isolation (Cold alone, FLDAS alone, etc.). Shows which climate driver category contributes most. Produces MAPE-by-CID bar chart, region×CID heatmap, year×CID chart, CID rank over time. |
| 2 | `region_filter` | `region_filter` | Drops low-production regions and re-runs the best model to test if excluding noisy regions improves national accuracy. |
| 3 | `optuna` | `optuna` | Bayesian (TPE) search over ML hyperparameters (learning rate, depth, regularization, etc.). Produces convergence, parameter importance, and parallel coordinate plots. |
| 4 | `optuna_cid_types` | `optuna_cid_types` | Bayesian search for the best combination of CID Type categories (e.g. Rain+VI+ESI may beat using all 8 types). |
| 5 | `optuna_cid_indices` | `optuna_cid_indices` | Bayesian search for the best subset of individual CID indices (e.g. PRCPTOT + AUC_NDVI + TG90p). Capped at `max_cid_indices` per trial. |

**Dependencies:** Experiments 1 and 2 require experiment 0 first. Experiments 3–5 are independent.

Configure in `geocif.txt`:

```ini
[experiments]
run_experiments = ["model_comparison", "cid_ablation"]
comparison_models = ["catboost", "tabpfn", "tabicl"]
n_trials = 30
n_trials_cid_types = 30
n_trials_cid_indices = 60
max_cid_indices = 25
```

Run:

```python
from geocif import experiments
experiments.run(cfg_geocif)
```

### Experiments output

The experiments runner writes to a dedicated DB and analysis folder under `dir_output`:

```
{dir_output}/
└── ml/
    ├── db/
    │   └── experiments_{MMMM_DD_YYYY_HH}H.db
    │
    └── analysis/
        └── {MMMM_DD_YYYY}/
            ├── experiments/                            # Experiment 0 (model comparison)
            │   ├── experiment_metrics.csv
            │   ├── heatmap_models.png
            │   ├── boxplot_models.png
            │   ├── regional_mape_models_{country}.png
            │   ├── error_distribution_models.png
            │   └── metric_comparison.png
            │
            └── optimization/                           # Optuna hyperparameter search
                ├── optuna_trials.csv
                ├── best_params.csv
                ├── convergence.png
                ├── optimization_history.png
                ├── param_importances.png
                └── parallel_coordinate.png
```

### Outlook output

The yield outlook runner produces a diverging choropleth map showing current forecast yield as a percentage of the historical mean/median prediction per region, plus a combined CSV.

```
{dir_output}/
└── ml/
    └── analysis/
        └── {MMMM_DD_YYYY}/
            └── outlook/
                ├── yield_outlook_{country}_{crop}_{model}_{stage}_{year}.png
                └── yield_outlook_{year}.csv
```

## Config file documentation

### geobase.txt

Shared paths and dataset settings. All directory paths are derived from `dir_base`.

```ini
[PATHS]
dir_base = /gpfs/data1/cmongp1/GEO

dir_inputs = ${dir_base}/inputs
dir_logs = ${dir_base}/logs
dir_download = ${dir_inputs}/download
dir_intermed = ${dir_inputs}/intermed
dir_metadata = ${dir_inputs}/metadata
dir_condition = ${dir_inputs}/crop_condition
dir_crop_inputs = ${dir_condition}/crop_t20

dir_boundary_files = ${dir_metadata}/boundary_files
dir_crop_calendars = ${dir_metadata}/crop_calendars
dir_crop_masks = ${dir_metadata}/crop_masks
dir_images = ${dir_metadata}/images
dir_production_statistics = ${dir_metadata}/production_statistics

dir_output = ${dir_base}/outputs

[DATASETS]
datasets = ['CHIRPS', 'CPC', 'NDVI', 'ESI', 'NSIDC', 'AEF']
```

### countries.txt

Single source of truth for per-country config. Shared by both geoprepare and geocif.

```ini
[DEFAULT]
boundary_file = gaul1_asap_v04.shp
admin_level = admin_1
seasons = [1]
crops = ['maize']
category = AMIS
use_cropland_mask = False
calendar_file = crop_calendar.csv

; AMIS countries (inherit from DEFAULT, override crops if needed)
[argentina]
crops = ['soybean', 'winter_wheat', 'maize']

; EWCM countries (full per-country config)
[kenya]
category = EWCM
admin_level = admin_1
seasons = [1, 2]
use_cropland_mask = True
boundary_file = adm_shapefile.gpkg
calendar_file = EWCM_2025-04-21.xlsx
crops = ['maize']

[malawi]
category = EWCM
admin_level = admin_2
use_cropland_mask = True
boundary_file = adm_shapefile.gpkg
calendar_file = EWCM_2025-04-21.xlsx
crops = ['maize']
```

### crops.txt

Crop mask filenames and calendar category definitions.

```ini
; Crop masks
[maize]
mask = Percent_Maize.tif

[winter_wheat]
mask = Percent_Winter_Wheat.tif

[sorghum]
mask = cropland_v9.tif

; Calendar categories
[EWCM]
use_cropland_mask = True
calendar_file = EWCM_2026-01-05.xlsx
crops = ['maize', 'sorghum', 'millet', 'rice', 'winter_wheat', 'teff']
eo_model = ['aef', 'nsidc_surface', 'nsidc_rootzone', 'ndvi', 'cpc_tmax', 'cpc_tmin', 'chirps', 'chirps_gefs', 'esi_4wk']

[AMIS]
calendar_file = AMISCM_2026-01-05.xlsx
```

### geoextract.txt

Extraction-only settings for geoprepare. Loaded last so its `[DEFAULT]` overrides shared defaults.

```ini
[DEFAULT]
method = JRC
redo = False
threshold = True
floor = 20
ceil = 90
countries = ["malawi"]
forecast_seasons = [2022]

[PROJECT]
parallel_extract = True
parallel_merge = False
```

### geocif.txt

Indices, ML, and agmet settings for geocif. Country overrides go here when geocif needs different values than countries.txt (e.g., a subset of crops).

```ini
[AGMET]
eo_plot = ['ndvi', 'chirts_era5_tmax', 'chirts_era5_tmin', 'chirps', 'esi_4wk', 'nsidc_surface', 'nsidc_rootzone']
logo_harvest = harvest.png
logo_geoglam = geoglam.png

; Country overrides (only where geocif differs from countries.txt)
[ethiopia]
crops = ['winter_wheat']

[bangladesh]
crops = ['rice']
admin_level = admin_2
boundary_file = bangladesh.shp

; ML model definitions
[catboost]
ML_model = True

[analog]
ML_model = False

[ML]
model_type = REGRESSION
target = Yield (tn per ha)
feature_selection = gOMP
cluster_strategy = single
check_yield_trend = False
use_spatial_neighbors = True
spatial_neighbor_method = knn
spatial_neighbor_k = 5
lag_yield_as_feature = True
lag_years = 3
median_yield_as_feature = False
median_years = 5
include_lat_lon_as_feature = False
panel_model = True
cat_features = ["Harvest Year", "Region_ID", "Region"]
outlook_n_years = 10        ; Number of historical years for yield outlook comparison
outlook_aggregation = mean  ; mean or median
run_time_steps = latest         ; latest, current, all, or N (every Nth time period)
run_cluster_analysis = False
cluster_analysis_proxy = AUC_NDVI
cluster_analysis_max_k = 8
cluster_analysis_top_n = 20
cluster_analysis_variance = 0.85

[LOGGING]
log_level = INFO

[DEFAULT]
data_source = harvest
method = monthly_r
project_name = geocif
countries = ["kenya"]
crops = ['maize']
admin_level = admin_1
models = ['catboost']
seasons = [1]
threshold = True
floor = 20
```

### FLDAS forecast overlay

When FLDAS columns are present in the merged data (e.g. `fldas_tair_tavg_lead0` through `_lead5`), agmet plots automatically overlay forecast dots on matching panels:

| FLDAS variable | Target panel |
|---|---|
| `fldas_tair_tavg` | Temperature |
| `fldas_totalprecip_tavg` | Daily precipitation |
| `fldas_soilmoist_tavg` | Soil moisture (surface) |

Each lead time (0–5) appears as a diamond marker with decreasing opacity (lead 0 = most opaque). Dots beyond the harvest date are suppressed. No config changes are needed — detection is automatic.

## Release

To publish a new version to PyPI:

1. Bump `__version__` in `geocif/__init__.py` and `version` in `pyproject.toml`
2. Build and upload:
   ```bash
   uv build
   uvx twine upload dist/geocif-<version>*
   ```
3. Commit:
   ```bash
   git add geocif/__init__.py pyproject.toml
   git commit -m "Bump to <version>"
   ```

## Credits

This project was supported by NASA Applied Sciences Grant No. 80NSSC17K0625 through the NASA Harvest Consortium, and the NASA Acres Consortium under NASA Grant #80NSSC23M0034.
