Metadata-Version: 2.4
Name: dl4eo
Version: 0.5.0
Summary: Deep Learning for Earth Observation — automated training-dataset builder for EO segmentation tasks
Home-page: https://github.com/Sk-2103/dl4eo
Author: Saurabh Kaushik
Author-email: saurabh21.kaushik@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: GIS
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.22
Requires-Dist: rasterio>=1.3
Requires-Dist: geopandas>=0.12
Requires-Dist: shapely>=1.8
Requires-Dist: matplotlib>=3.5
Requires-Dist: joblib>=1.1
Requires-Dist: pystac-client>=0.6
Requires-Dist: planetary-computer>=0.4
Requires-Dist: fiona>=1.8
Requires-Dist: requests>=2.28
Requires-Dist: scipy>=1.8
Provides-Extra: train
Requires-Dist: torch>=2.0; extra == "train"
Requires-Dist: lightning>=2.0; extra == "train"
Requires-Dist: segmentation-models-pytorch>=0.3; extra == "train"
Requires-Dist: timm>=0.9; extra == "train"
Requires-Dist: torchmetrics>=1.0; extra == "train"
Provides-Extra: torchgeo
Requires-Dist: torchgeo>=0.5; extra == "torchgeo"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# dl4eo

**dl4eo** is a Python package for building multi-source Earth Observation training datasets and training segmentation models end-to-end. It automates the full pipeline from raw satellite data to model checkpoint:

- **Sentinel-2** (L2A, cloud-filtered, spectral indices)
- **Sentinel-1 RTC** (VV + VH, batched by date)
- **Copernicus DEM** (elevation + slope, per-scene mosaic)
- **Segmentation masks** from any vector label file
- **Train-ready PyTorch dataset** with global normalization
- **Model training** with UNet, DeepLabV3+, SegFormer, ViT, and more
- **Evaluation** with per-class IoU / F1 / Precision / Recall / Kappa + GeoTIFF prediction export

---

## Installation

```bash
# Pipeline only (no PyTorch required)
pip install dl4eo

# Pipeline + training stack
pip install dl4eo[train]
```

Requires Python ≥ 3.8.

---

## Quick Start

### 1 — Build a dataset

```python
import dl4eo

dl4eo.generate_dataset(
    base_dir="/data/glacial_lakes",
    aoi_shapefile_dir="/data/aoi/",           # folder with AOI.shp (study area polygon)
    feature_shapefile="/data/lake_boundaries.shp",  # label polygons
    date_range="2021-06-01/2021-08-31",
    cloud_cover=20,
    patch_size=256,           # pixels
    overlap=0.0,
    spectral_index="NDWI",    # NDWI | NDSI | NDVI | NDRE | EVI | None
    skip_sentinel1=False,
    skip_dem=False,
    normalize=False,          # recommended: normalize at load time via PatchDataset
    n_jobs=8,
)
```

### 2 — Quality control, splits, statistics

```python
# Filter bad patches (nodata, no foreground, constant bands)
valid = dl4eo.qc.validate("/data/glacial_lakes", min_positive_fraction=0.001)

# Create train / val / test splits
splits = dl4eo.splits.make_splits(
    "/data/glacial_lakes",
    ratios=(0.7, 0.15, 0.15),
    strategy="temporal",   # "random" | "temporal" | "spatial"
    valid_file="/data/glacial_lakes/valid_patches.txt",
)

# Global per-band statistics (training split only — no leakage)
stats = dl4eo.stats.compute("/data/glacial_lakes", split="train")
# → {"band_1": {"mean": 6032.7, "std": 3471.1, "p2": 540.0, "p98": 11752.0},
#    "band_2": {...}, ..., "_meta": {"n_files": 25, "split": "train"}}
```

### 3 — PyTorch dataset

```python
from dl4eo.io import PatchDataset
from torch.utils.data import DataLoader

ds = PatchDataset(
    "/data/glacial_lakes",
    split="train",
    split_file="/data/glacial_lakes/splits.json",
    stats_file="/data/glacial_lakes/stats.json",
    norm="zscore",    # "zscore" | "minmax" | "percentile" | None
    bands=None,       # None = all bands; or e.g. [0, 1, 2, 6, 7]
)

sample = ds[0]
# sample["image"]  →  FloatTensor [C, H, W]
# sample["mask"]   →  LongTensor  [H, W]

loader = DataLoader(ds, batch_size=16, shuffle=True, num_workers=4)
```

`PatchDataset` inherits from `torchgeo.datasets.NonGeoDataset` when torchgeo is installed, and falls back to `torch.utils.data.Dataset` otherwise.

### 4 — Train a model (one-liner)

```python
module = dl4eo.train(
    data_dir="/data/glacial_lakes",
    model="unet",            # see SUPPORTED_MODELS below
    backbone="resnet34",
    num_classes=2,
    split_strategy="temporal",
    norm="zscore",
    loss="dice_ce",          # "dice_ce" | "dice" | "ce" | "focal"
    batch_size=16,
    max_epochs=50,
    accelerator="gpu",
    devices=1,
)
# → auto-generates splits.json + stats.json if missing
# → saves best checkpoint (monitored on val/iou)
# → returns loaded SegmentationModule
```

### 5 — Evaluate and export predictions

```python
# Option A — use the module returned directly from dl4eo.train()
report = dl4eo.eval.evaluate(
    module,
    data_dir        = "/data/glacial_lakes",
    splits          = ("val", "test"),
    class_names     = ["background", "lake"],
    output_dir      = "/data/glacial_lakes/eval",
    save_predictions= True,   # writes GeoTIFFs in original CRS
)

# Option B — reload a checkpoint later
module = dl4eo.eval.load_module(
    "checkpoints/unet/best-epoch=10.ckpt",
    model       = "unet",
    backbone    = "resnet34",
    in_channels = 10,
)
report = dl4eo.eval.evaluate(module, "/data/glacial_lakes")
```

**`evaluate()` prints a formatted table and saves two files:**

```
eval/
├── predictions/
│   ├── val/   *.tif   ← single-band uint8 GeoTIFF, original CRS + transform
│   └── test/  *.tif
├── eval_report.json   ← full metrics + confusion matrix
└── eval_report.txt    ← plain-text table for logging
```

**Metrics reported per class and as mean:**
IoU · F1 · Precision · Recall · Overall Accuracy · Cohen's Kappa

### 6 — Build a model manually

```python
from dl4eo.train import build_model, SegmentationModule, SegDataModule, SUPPORTED_MODELS
import lightning as L

print(SUPPORTED_MODELS)
# ['unet', 'unet++', 'deeplabv3+', 'fpn', 'pspnet', 'linknet', 'pan', 'manet',
#  'segformer', 'vit-tiny', 'vit-small', 'vit-base']

net    = build_model("segformer", in_channels=10, num_classes=2)
module = SegmentationModule(net, num_classes=2, lr=5e-4, loss="dice_ce")

dm = SegDataModule(
    data_dir   = "/data/glacial_lakes",
    split_file = "/data/glacial_lakes/splits.json",
    stats_file = "/data/glacial_lakes/stats.json",
    batch_size = 8,
)

trainer = L.Trainer(max_epochs=100, accelerator="gpu", devices=1)
trainer.fit(module, dm)
```

---

## Pipeline stages

| Stage | Description |
|-------|-------------|
| 1 | Download Sentinel-2 L2A (STAC / Planetary Computer, cloud-filtered) |
| 2 | Preprocess S2: single-pass resample to 10 m + spectral index + stack |
| 3 | Generate patch AOIs: windowed reads, intersects user AOI polygon |
| 4 | Prepare DEM: one mosaic per scene, windowed reproject per patch |
| 5 | Prepare Sentinel-1 RTC: batched STAC search by date, VV+VH stack |
| 6 | Generate segmentation masks from label shapefile |

Normalization is intentionally excluded from the pipeline. Use `dl4eo.stats.compute()` on the training split and `PatchDataset(norm="zscore")` at load time — this avoids per-patch scale inconsistency and data leakage.

---

## Supported models

All models are trained from scratch on arbitrary input channels (no dataset-specific pretrained weights).

| Model | Family | Default backbone | Constraints |
|-------|--------|-----------------|-------------|
| `unet` | SMP | `resnet34` | — |
| `unet++` | SMP | `resnet34` | — |
| `deeplabv3+` | SMP | `resnet34` | `batch_size ≥ 2` per GPU (BatchNorm) |
| `fpn` | SMP | `resnet34` | — |
| `pspnet` | SMP | `resnet34` | `batch_size ≥ 2` per GPU (BatchNorm) |
| `linknet` | SMP | `resnet34` | — |
| `pan` | SMP | `resnet34` | input ≥ 128 px (pyramid pooling) |
| `manet` | SMP | `resnet34` | — |
| `segformer` | SegFormer | `swin_tiny_patch4_window7_224` | — |
| `vit-tiny` | ViT | `vit_tiny_patch16_224` | — |
| `vit-small` | ViT | `vit_small_patch16_224` | — |
| `vit-base` | ViT | `vit_base_patch16_224` | — |

SMP models also support ImageNet-pretrained encoders for 3-channel input: `weights="imagenet"`.

> **BatchNorm note:** `deeplabv3+` and `pspnet` will raise an error if a mini-batch
> contains only 1 sample. Ensure `len(train_set) % batch_size != 1`, or choose a
> `batch_size` that divides your training set evenly.

---

## Output structure

```
base_dir/
├── stack/               # Scene-level S2 stacks (bands + spectral index)
├── images/              # Clipped S2 patches
├── DEM/                 # Per-scene DEM mosaics + per-patch stacks
├── GRD/                 # Downloaded SAR granules (VV, VH)
├── Clipped_SAR/         # SAR reprojected to patch grid
├── stacked/             # S2 + DEM patches  (10 bands)
├── stacked_with_sar/    # S2 + DEM + SAR patches  (primary output)
├── mask/                # Binary (or multi-class) segmentation masks
├── AOI_boxes/           # Per-scene patch grid shapefiles
├── splits.json          # Train / val / test split (after dl4eo.splits)
├── stats.json           # Per-band statistics   (after dl4eo.stats)
└── valid_patches.txt    # QC-passing patch list  (after dl4eo.qc)
```

---

## Input requirements

| Parameter | Description |
|-----------|-------------|
| `aoi_shapefile_dir` | Folder containing one or more AOI `.shp` files (study area polygon) |
| `feature_shapefile` | Label vector file (e.g. lake outlines) — used for mask generation and patch filtering |
| `date_range` | `"YYYY-MM-DD/YYYY-MM-DD"` |

The AOI polygon controls which patches are generated. Only patches that intersect both the AOI and at least one label feature are kept.

---

## Dependencies

**Core** (installed automatically):
`numpy`, `rasterio`, `geopandas`, `shapely`, `fiona`, `matplotlib`, `joblib`, `pystac-client`, `planetary-computer`, `requests`, `scipy`

**Training** (`pip install dl4eo[train]`):
`torch>=2.0`, `lightning>=2.0`, `segmentation-models-pytorch>=0.3`, `timm>=0.9`, `torchmetrics>=1.0`

**Optional**:
`torchgeo>=0.5` — enables `NonGeoDataset` base class for `PatchDataset`

---

## Example use cases

- Glacial lake mapping and segmentation
- Flood extent extraction
- Multimodal image fusion (S2 + S1 + DEM)
- Patch-based dataset generation for semantic segmentation

---

## Author

Developed by [Saurabh Kaushik](https://scholar.google.com/citations?user=UBGlaXIAAAAJ)
Postdoctoral Researcher · University of Wisconsin–Madison
Earth Observation · Deep Learning · Geo-Foundational Models · Cryosphere

---

## License

MIT License

---

## Citation

If you use `dl4eo` in your research, please cite:

```bibtex
@misc{kaushik2026dl4eo,
  author       = {Saurabh Kaushik},
  title        = {{dl4eo: A Python package for multi-source Earth Observation dataset building and segmentation model training}},
  year         = {2026},
  howpublished = {\url{https://pypi.org/project/dl4eo/}},
}
```
