Metadata-Version: 2.4
Name: dl4eo
Version: 0.4.0
Summary: Deep Learning for Earth Observation — automated training-dataset builder for EO segmentation tasks
Home-page: https://github.com/Sk-2103/dl4eo
Author: Saurabh Kaushik
Author-email: saurabh21.kaushik@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: GIS
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.22
Requires-Dist: rasterio>=1.3
Requires-Dist: geopandas>=0.12
Requires-Dist: shapely>=1.8
Requires-Dist: matplotlib>=3.5
Requires-Dist: joblib>=1.1
Requires-Dist: pystac-client>=0.6
Requires-Dist: planetary-computer>=0.4
Requires-Dist: fiona>=1.8
Requires-Dist: requests>=2.28
Requires-Dist: scipy>=1.8
Provides-Extra: train
Requires-Dist: torch>=2.0; extra == "train"
Requires-Dist: lightning>=2.0; extra == "train"
Requires-Dist: segmentation-models-pytorch>=0.3; extra == "train"
Requires-Dist: timm>=0.9; extra == "train"
Requires-Dist: torchmetrics>=1.0; extra == "train"
Provides-Extra: torchgeo
Requires-Dist: torchgeo>=0.5; extra == "torchgeo"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# dl4eo

**dl4eo** is a Python package for building multi-source Earth Observation training datasets and training segmentation models end-to-end. It automates the full pipeline from raw satellite data to model checkpoint:

- **Sentinel-2** (L2A, cloud-filtered, spectral indices)
- **Sentinel-1 RTC** (VV + VH, batched by date)
- **Copernicus DEM** (elevation + slope, per-scene mosaic)
- **Segmentation masks** from any vector label file
- **Train-ready PyTorch dataset** with global normalization
- **Model training** with UNet, DeepLabV3+, SegFormer, ViT, and more

---

## Installation

```bash
# Pipeline only (no PyTorch required)
pip install dl4eo

# Pipeline + training stack
pip install dl4eo[train]
```

Requires Python ≥ 3.8.

---

## Quick Start

### 1 — Build a dataset

```python
import dl4eo

dl4eo.generate_dataset(
    base_dir="/data/glacial_lakes",
    aoi_shapefile_dir="/data/aoi/",           # folder with AOI.shp (study area polygon)
    feature_shapefile="/data/lake_boundaries.shp",  # label polygons
    date_range="2021-06-01/2021-08-31",
    cloud_cover=20,
    patch_size=256,           # pixels
    overlap=0.0,
    spectral_index="NDWI",    # NDWI | NDSI | NDVI | NDRE | EVI | None
    skip_sentinel1=False,
    skip_dem=False,
    normalize=False,          # recommended: normalize at load time via PatchDataset
    n_jobs=8,
)
```

### 2 — Quality control, splits, statistics

```python
# Filter bad patches (nodata, no foreground, constant bands)
valid = dl4eo.qc.validate("/data/glacial_lakes", min_positive_fraction=0.001)

# Create train / val / test splits
splits = dl4eo.splits.make_splits(
    "/data/glacial_lakes",
    ratios=(0.7, 0.15, 0.15),
    strategy="temporal",   # "random" | "temporal" | "spatial"
    valid_file="/data/glacial_lakes/valid_patches.txt",
)

# Global per-band statistics (training split only — no leakage)
stats = dl4eo.stats.compute("/data/glacial_lakes", split="train")
```

### 3 — PyTorch dataset

```python
from dl4eo.io import PatchDataset
from torch.utils.data import DataLoader

ds = PatchDataset(
    "/data/glacial_lakes",
    split="train",
    split_file="/data/glacial_lakes/splits.json",
    stats_file="/data/glacial_lakes/stats.json",
    norm="zscore",    # "zscore" | "minmax" | "percentile" | None
    bands=None,       # None = all bands; or e.g. [0, 1, 2, 6, 7]
)

sample = ds[0]
# sample["image"]  →  FloatTensor [C, H, W]
# sample["mask"]   →  LongTensor  [H, W]

loader = DataLoader(ds, batch_size=16, shuffle=True, num_workers=4)
```

`PatchDataset` inherits from `torchgeo.datasets.NonGeoDataset` when torchgeo is installed, and falls back to `torch.utils.data.Dataset` otherwise.

### 4 — Train a model (one-liner)

```python
module = dl4eo.train(
    data_dir="/data/glacial_lakes",
    model="unet",            # see SUPPORTED_MODELS below
    backbone="resnet34",
    num_classes=2,
    split_strategy="temporal",
    norm="zscore",
    loss="dice_ce",          # "dice_ce" | "dice" | "ce" | "focal"
    batch_size=16,
    max_epochs=50,
    accelerator="gpu",
    devices=1,
)
# → auto-generates splits.json + stats.json if missing
# → saves best checkpoint (monitored on val/iou)
# → returns loaded SegmentationModule
```

### 5 — Build a model manually

```python
from dl4eo.train import build_model, SegmentationModule, SegDataModule, SUPPORTED_MODELS
import lightning as L

print(SUPPORTED_MODELS)
# ['unet', 'unet++', 'deeplabv3+', 'fpn', 'pspnet', 'linknet', 'pan', 'manet',
#  'segformer', 'vit-tiny', 'vit-small', 'vit-base']

net    = build_model("segformer", in_channels=10, num_classes=2)
module = SegmentationModule(net, num_classes=2, lr=5e-4, loss="dice_ce")

dm = SegDataModule(
    data_dir   = "/data/glacial_lakes",
    split_file = "/data/glacial_lakes/splits.json",
    stats_file = "/data/glacial_lakes/stats.json",
    batch_size = 8,
)

trainer = L.Trainer(max_epochs=100, accelerator="gpu", devices=1)
trainer.fit(module, dm)
```

---

## Pipeline stages

| Stage | Description |
|-------|-------------|
| 1 | Download Sentinel-2 L2A (STAC / Planetary Computer, cloud-filtered) |
| 2 | Preprocess S2: single-pass resample to 10 m + spectral index + stack |
| 3 | Generate patch AOIs: windowed reads, intersects user AOI polygon |
| 4 | Prepare DEM: one mosaic per scene, windowed reproject per patch |
| 5 | Prepare Sentinel-1 RTC: batched STAC search by date, VV+VH stack |
| 6 | Generate segmentation masks from label shapefile |

Normalization is intentionally excluded from the pipeline. Use `dl4eo.stats.compute()` on the training split and `PatchDataset(norm="zscore")` at load time — this avoids per-patch scale inconsistency and data leakage.

---

## Supported models

All models are trained from scratch on arbitrary input channels (no dataset-specific pretrained weights).

| Family | Models | Default backbone |
|--------|--------|-----------------|
| SMP | `unet`, `unet++`, `deeplabv3+`, `fpn`, `pspnet`, `linknet`, `pan`, `manet` | `resnet34` |
| SegFormer | `segformer` | `swin_tiny_patch4_window7_224` |
| ViT | `vit-tiny`, `vit-small`, `vit-base` | timm ViT + patch-shuffle decoder |

SMP models also support ImageNet-pretrained encoders for 3-channel input: `weights="imagenet"`.

---

## Output structure

```
base_dir/
├── stack/               # Scene-level S2 stacks (bands + spectral index)
├── images/              # Clipped S2 patches
├── DEM/                 # Per-scene DEM mosaics + per-patch stacks
├── GRD/                 # Downloaded SAR granules (VV, VH)
├── Clipped_SAR/         # SAR reprojected to patch grid
├── stacked/             # S2 + DEM patches  (10 bands)
├── stacked_with_sar/    # S2 + DEM + SAR patches  (primary output)
├── mask/                # Binary (or multi-class) segmentation masks
├── AOI_boxes/           # Per-scene patch grid shapefiles
├── splits.json          # Train / val / test split (after dl4eo.splits)
├── stats.json           # Per-band statistics   (after dl4eo.stats)
└── valid_patches.txt    # QC-passing patch list  (after dl4eo.qc)
```

---

## Input requirements

| Parameter | Description |
|-----------|-------------|
| `aoi_shapefile_dir` | Folder containing one or more AOI `.shp` files (study area polygon) |
| `feature_shapefile` | Label vector file (e.g. lake outlines) — used for mask generation and patch filtering |
| `date_range` | `"YYYY-MM-DD/YYYY-MM-DD"` |

The AOI polygon controls which patches are generated. Only patches that intersect both the AOI and at least one label feature are kept.

---

## Dependencies

**Core** (installed automatically):
`numpy`, `rasterio`, `geopandas`, `shapely`, `fiona`, `matplotlib`, `joblib`, `pystac-client`, `planetary-computer`, `requests`, `scipy`

**Training** (`pip install dl4eo[train]`):
`torch>=2.0`, `lightning>=2.0`, `segmentation-models-pytorch>=0.3`, `timm>=0.9`, `torchmetrics>=1.0`

**Optional**:
`torchgeo>=0.5` — enables `NonGeoDataset` base class for `PatchDataset`

---

## Example use cases

- Glacial lake mapping and segmentation
- Flood extent extraction
- Multimodal image fusion (S2 + S1 + DEM)
- Patch-based dataset generation for semantic segmentation

---

## Author

Developed by [Saurabh Kaushik](https://scholar.google.com/citations?user=UBGlaXIAAAAJ)
Postdoctoral Researcher · University of Arizona
Earth Observation · Deep Learning · Geo-Foundational Models · Cryosphere

---

## License

MIT License

---

## Citation

If you use `dl4eo` in your research, please cite:

```bibtex
@misc{kaushik2026dl4eo,
  author       = {Saurabh Kaushik},
  title        = {{dl4eo: A Python package for multi-source Earth Observation dataset building and segmentation model training}},
  year         = {2026},
  howpublished = {\url{https://pypi.org/project/dl4eo/}},
}
```
