Metadata-Version: 2.4
Name: rasteret
Version: 0.3.9
Summary: Index-first GeoTIFF access layer for ML and analysis, powered by queryable Parquet indexes.
Project-URL: Repository, https://github.com/terrafloww/rasteret
Project-URL: Documentation, https://terrafloww.github.io/rasteret
Project-URL: Issues, https://github.com/terrafloww/rasteret/issues
Project-URL: Changelog, https://terrafloww.github.io/rasteret/changelog/
Author-email: Sidharth Subramaniam <sid@terrafloww.com>
License: Apache-2.0
License-File: LICENSE
Keywords: cloud-optimized,cog,geospatial,geotiff,imagery,raster,satellite,torchgeo
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: GIS
Classifier: Topic :: Scientific/Engineering :: Image Processing
Requires-Python: >=3.12
Requires-Dist: affine>=2.4.0
Requires-Dist: cachetools>=5.3.2
Requires-Dist: duckdb>=1.4.4
Requires-Dist: geoarrow-pandas>=0.1.0
Requires-Dist: geoarrow-pyarrow>=0.1.0
Requires-Dist: geopandas>=0.13
Requires-Dist: huggingface-hub>=0.23.0
Requires-Dist: imagecodecs>=2023.9.18
Requires-Dist: ipykernel>=6.31.0
Requires-Dist: nbconvert>=7.17.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: obstore>=0.8.0
Requires-Dist: pyarrow>=14.0.1
Requires-Dist: pyproj>=3.6.1
Requires-Dist: pystac-client>=0.7.5
Requires-Dist: rasterio<1.5.0,>=1.4.3
Requires-Dist: tqdm>=4.60
Requires-Dist: xarray>=2025.1.0
Requires-Dist: zstandard>=0.22.0
Provides-Extra: all
Requires-Dist: boto3>=1.34.0; extra == 'all'
Requires-Dist: datasets>=2.20.0; extra == 'all'
Requires-Dist: duckdb>=1.1.0; extra == 'all'
Requires-Dist: folium>=0.18.0; extra == 'all'
Requires-Dist: huggingface-hub>=0.23.0; extra == 'all'
Requires-Dist: mkdocs-jupyter>=0.25; extra == 'all'
Requires-Dist: mkdocs-llmstxt>=0.2; extra == 'all'
Requires-Dist: mkdocs-material>=9.5; extra == 'all'
Requires-Dist: mkdocs-section-index>=0.3; extra == 'all'
Requires-Dist: mkdocs>=1.6; extra == 'all'
Requires-Dist: mkdocstrings[python]>=0.27; extra == 'all'
Requires-Dist: planetary-computer>=1.0.0; extra == 'all'
Requires-Dist: pre-commit>=3.7.0; extra == 'all'
Requires-Dist: pytest-asyncio>=0.23.2; extra == 'all'
Requires-Dist: pytest-cov>=7.0.0; extra == 'all'
Requires-Dist: pytest-timeout>=2.3.0; extra == 'all'
Requires-Dist: pytest>=8.4.2; extra == 'all'
Requires-Dist: requests>=2.31.0; extra == 'all'
Requires-Dist: ruff==0.8.6; extra == 'all'
Requires-Dist: scikit-learn>=1.5.0; extra == 'all'
Requires-Dist: sedonadb>=0.2.0; extra == 'all'
Requires-Dist: stac-geoparquet>=0.6.0; extra == 'all'
Requires-Dist: tifffile>=2023.9.18; extra == 'all'
Requires-Dist: torchgeo>=0.9.0; (python_version >= '3.12') and extra == 'all'
Requires-Dist: xarray<2027,>=2024.1.0; extra == 'all'
Provides-Extra: aws
Requires-Dist: boto3>=1.34.0; extra == 'aws'
Provides-Extra: azure
Requires-Dist: planetary-computer>=1.0.0; extra == 'azure'
Requires-Dist: requests>=2.31.0; extra == 'azure'
Provides-Extra: dev
Requires-Dist: pre-commit>=3.7.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.2; extra == 'dev'
Requires-Dist: pytest-cov>=7.0.0; extra == 'dev'
Requires-Dist: pytest-timeout>=2.3.0; extra == 'dev'
Requires-Dist: pytest>=8.4.2; extra == 'dev'
Requires-Dist: ruff==0.8.6; extra == 'dev'
Requires-Dist: tifffile>=2023.9.18; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-jupyter>=0.25; extra == 'docs'
Requires-Dist: mkdocs-llmstxt>=0.2; extra == 'docs'
Requires-Dist: mkdocs-material>=9.5; extra == 'docs'
Requires-Dist: mkdocs-section-index>=0.3; extra == 'docs'
Requires-Dist: mkdocs>=1.6; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.27; extra == 'docs'
Provides-Extra: earthdata
Requires-Dist: requests>=2.31.0; extra == 'earthdata'
Provides-Extra: examples
Requires-Dist: datasets>=2.20.0; extra == 'examples'
Requires-Dist: duckdb>=1.1.0; extra == 'examples'
Requires-Dist: folium>=0.18.0; extra == 'examples'
Requires-Dist: huggingface-hub>=0.23.0; extra == 'examples'
Requires-Dist: scikit-learn>=1.5.0; extra == 'examples'
Requires-Dist: sedonadb>=0.2.0; extra == 'examples'
Requires-Dist: stac-geoparquet>=0.6.0; extra == 'examples'
Provides-Extra: torchgeo
Requires-Dist: torchgeo>=0.9.0; (python_version >= '3.12') and extra == 'torchgeo'
Provides-Extra: xarray
Requires-Dist: xarray<2027,>=2024.1.0; extra == 'xarray'
Description-Content-Type: text/markdown

<h1 align="center">🛰️ Rasteret</h1>

<p align="center">
  <strong>The AI practitioner's multiplier for cloud-native satellite data.</strong><br>
  <em>A high-performance rasterio/GDAL alternative for scaleable ML workflows.</em>
</p>
<p align="center">
Rasteret helps you manage and read massive satellite imagery collections with zero friction. <br>
It provides a high-performance "drop-in" backend for **TorchGeo**, **xarray**, and **NumPy** that is up to 20x faster than traditional GDAL-based workflows.
</p>

<p align="center">
  <a href="https://terrafloww.github.io/rasteret"><img src="https://img.shields.io/badge/docs-terrafloww.github.io%2Frasteret-009DD1" alt="Documentation"></a>
  <a href="https://discord.gg/86NgTB3Xa"><img src="https://img.shields.io/badge/Discord-chat-5865F2?logo=discord&logoColor=white" alt="Discord"></a>
  <a href="https://pypi.org/project/rasteret/"><img src="https://img.shields.io/pypi/v/rasteret?color=blue" alt="PyPI"></a>
  <a href="https://pypi.org/project/rasteret/"><img src="https://img.shields.io/pypi/pyversions/rasteret" alt="Python"></a>
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-Apache--2.0-blue" alt="License"></a>
</p>

---

## Why Rasteret?

Geospatial data science is often 80% "plumbing." You spend hours writing `pystac-client` loops, manual `ThreadPoolExecutor` code, and fragile CRS-alignment logic just to get a batch of pixels for your model.

**Rasteret turns those 80% into a single line of code.**

It separates the **Control Plane** (managing your scenes, labels, and splits in a local Parquet index) from the **Data Plane** (streaming pixels directly from cloud COGs).

### The "Friction" vs. "Flow" Comparison

**The Old Way (25+ lines of fragile plumbing)**:
1. Search STAC catalog ✅
2. Loop over items ✅
3. Handle pagination ✅
4. Filter by cloud cover ✅
5. **Wait 500ms per file** to parse remote TIFF headers (GDAL cold start) ❌
6. Manage `ThreadPoolExecutor` manually ❌
7. Manually stack results and align CRS ❌

**The Rasteret Way (3 lines of robust code)**:
```python
import rasteret

# 1. Load or Build your collection (Index is local, metadata is relational)
collection = rasteret.load("my_s2_experiment")

# 2. Query like a Table: "Give me the training scenes with <10% clouds"
filtered = collection.subset(split="train", cloud_cover_lt=10)

# 3. Batch Read: "Fetch aligned pixels for these 1000 polygons"
data = filtered.get_numpy(geometries=my_polygons, bands=["B04", "B08"])
```

---

## Key Features

- **🚀 20x Faster Cold Starts**: By caching tile-layout metadata locally, Rasteret jumps straight to the pixels, skipping expensive remote header parsing, which happens in every new environment.
- **📦 Seamless "Drop-in" Backends**: Boost **TorchGeo** or **xarray** performance by simply swapping the reader. No need to rewrite your training code.
- **🧬 Relational Imagery**: Store your labels, `train/val/test` splits, and custom metadata directly in the imagery index. No more separate CSVs.
- **🛠️ Zero-Config Throughput**: Automatic cloud storage presigning with `Obstore`, and custom async I/O handles the networking so you don't have to.

## Performance

Rasteret's claims are backed by rigorous, reproducible benchmarks. We measure across three dimensions: cold-start latency, cloud-native scale, and comparison against legacy "data-inside-parquet" patterns.

### 1. Cold-start comparison with TorchGeo
Same AOIs, same scenes, same sampler, same DataLoader. Rasteret eliminates the "cold start tax" by caching IFD headers in the local Parquet index.

| Scenario | rasterio/GDAL (Standard) | Rasteret (Index-First) | Speedup |
|---|---|---|---|
| Single AOI, 15 scenes | 9.08 s | 1.14 s | **8x** |
| Multi-AOI, 30 scenes | 42.05 s | 2.25 s | **19x** |
| Cross-CRS boundary | 12.47 s | 0.59 s | **21x** |

![Processing time comparison](./assets/benchmark_results.png)
![Speedup breakdown](./assets/benchmark_breakdown.png)

### 2. The Cloud vs. Edge Comparison
How does Rasteret stack up against **Google Earth Engine (GEE)** or a highly parallelized Rasterio setup for time-series extraction?

| Library | First Run (Cold) | Subsequent Runs (Hot) |
|---------|-----------------|-----------------------|
| **Rasterio** + ThreadPool | 32 s | 24 s |
| **Google Earth Engine** | 10–30 s | 3–5 s |
| **Rasteret** | **3 s** | **3 s** |

![Single request performance](./assets/single_timeseries_request.png)

### 3. HuggingFace `MajorTOM` vs. Rasteret
Recent "images-inside-Parquet" approaches (like MajorTOM) try to store image bytes in Parquet files. Rasteret keeps imagery in cloud COGs while using Parquet as a high-performance index—delivering better throughput without the data movement overhead.

| Patches | HF `datasets` (streaming) | Rasteret index+COGs | Speedup |
|---:|---:|---:|---:|
| 120 | 46.83 s | 12.09 s | **3.88x** |
| 1000 | 771.59 s | 118.69 s | **6.50x** |

![HF vs Rasteret speedup](./assets/benchmark_hf_speedup.png)

*All numbers measured on AWS us-west-2 4CPU machine (same region as data) vs. cold-start GDAL.*

---

## Technical Deep Dives

For the full architectural rationale, methodology, and reproducibility scripts, see:

- [**Full Benchmarks Guide**](https://terrafloww.github.io/rasteret/explanation/benchmark/): Methodology and results.
- [**Design Decisions**](https://terrafloww.github.io/rasteret/explanation/design-decisions.md): Why we chose Parquet + COGs
- [**Schema Contract**](https://terrafloww.github.io/rasteret/explanation/schema-contract/): The internal anatomy of a Collection.

```text
STAC API / GeoParquet  -->  Parquet Collection  -->  Tile-level byte reads
       (once)                  (queryable)             (no GDAL hot path)
```

## Quick Start

### 1. Build a Collection
```python
import rasteret

# Build from any STAC API or Parquet Metadata table
collection = rasteret.build(
    "earthsearch/sentinel-2-l2a",
    name="s2_training",
    bbox=(77.5, 12.9, 77.7, 13.1),
    date_range=("2024-01-01", "2024-06-30")
)
```

### 2. Turbocharge your ML (TorchGeo)
Rasteret provides a high-performance backend that honors the `GeoDataset` contract.

```python
from torch.utils.data import DataLoader
from torchgeo.samplers import RandomGeoSampler

# Same API as TorchGeo, much faster pixel pipe
dataset = collection.to_torchgeo_dataset(bands=["B04", "B08"], chip_size=256)

sampler = RandomGeoSampler(dataset, size=256, length=100)
loader  = DataLoader(dataset, sampler=sampler, batch_size=4)
```

### 3. Fast Xarray creation
```python
ds = collection.get_xarray(geometries=my_aoi, bands=["B04", "B08"])
ndvi = (ds.B08 - ds.B04) / (ds.B08 + ds.B04)
```

## Key Entry Points

Rasteret is built for flexibility. Choose the output format that fits your existing workflow:

| Method | Output | Purpose |
|---|---|---|
| [**`to_torchgeo_dataset()`**](https://terrafloww.github.io/rasteret/reference/integrations/torchgeo/) | `RasteretGeoDataset` | Drop-in high-performance backend for **TorchGeo** training. |
| [**`get_xarray()`**](https://terrafloww.github.io/rasteret/reference/core/collection/#rasteret.core.collection.Collection.get_xarray) | `xarray.Dataset` | Quick create Xarray for analysis. |
| [**`get_numpy()`**](https://terrafloww.github.io/rasteret/reference/core/collection/#rasteret.core.collection.Collection.get_numpy) | `numpy.ndarray` | Raw pixel arrays (`[N, C, H, W]`) directly. |
| [**`get_gdf()`**](https://terrafloww.github.io/rasteret/reference/core/collection/#rasteret.core.collection.Collection.get_gdf) | `GeoDataFrame` | Metadata and pixel arrays as a standard geopandas dataframe. |
| [**`sample_points()`**](https://terrafloww.github.io/rasteret/reference/core/collection/#rasteret.core.collection.Collection.sample_points) | `DataFrame` | Exact pixel values at points geometries with intuitive configurable fallback for nodata pixels |

---

Full documentation at **[terrafloww.github.io/rasteret](https://terrafloww.github.io/rasteret)**:

- [**Conceptual Roadmap**](https://terrafloww.github.io/rasteret/explanation/conceptual-roadmap/): Why Rasteret?
- [**Transitioning from Rasterio**](https://terrafloww.github.io/rasteret/how-to/transitioning-from-rasterio/): Side-by-side patterns.
- [**Turbocharging TorchGeo**](https://terrafloww.github.io/rasteret/how-to/turbocharging-torchgeo/): Scaling your DL loaders.
- [**Tutorials**](https://terrafloww.github.io/rasteret/tutorials/): Hands-on examples.

## License

Code: [Apache-2.0](LICENSE)
