Metadata-Version: 2.4
Name: geoxerl
Version: 0.2.0
Summary: Geospatial Species Distribution Modeling with Ensemble Learning and Reinforcement Learning-based Threshold Optimization
Author-email: Wenshun Zhang <zhangwenshun24@mails.ucas.ac.cn>
Project-URL: Homepage, https://github.com/wenshunzhang/GeoXERL
Project-URL: Repository, https://github.com/wenshunzhang/GeoXERL
Project-URL: Bug Tracker, https://github.com/wenshunzhang/GeoXERL/issues
Project-URL: Documentation, https://geoxerl.readthedocs.io
Keywords: geospatial,species distribution model,ensemble learning,reinforcement learning,remote sensing,SDM,GWRF,threshold optimization
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: GIS
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.21
Requires-Dist: pandas>=1.3
Requires-Dist: scikit-learn>=1.0
Requires-Dist: scipy>=1.7
Requires-Dist: joblib>=1.0
Requires-Dist: tqdm>=4.60
Requires-Dist: matplotlib>=3.4
Requires-Dist: seaborn>=0.11
Requires-Dist: rasterio>=1.2
Requires-Dist: geopandas>=0.10
Requires-Dist: shapely>=1.8
Requires-Dist: xgboost>=1.5
Requires-Dist: lightgbm>=3.3
Requires-Dist: shap>=0.40
Provides-Extra: dl
Requires-Dist: tensorflow>=2.8; extra == "dl"
Provides-Extra: rl
Requires-Dist: stable-baselines3>=1.8; extra == "rl"
Requires-Dist: gym<0.27,>=0.21; extra == "rl"
Requires-Dist: torch>=1.11; extra == "rl"
Provides-Extra: gdal
Requires-Dist: GDAL>=3.0; extra == "gdal"
Provides-Extra: viz
Requires-Dist: pygam>=0.8; extra == "viz"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: isort>=5.10; extra == "dev"
Requires-Dist: flake8>=6.0; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Requires-Dist: pre-commit>=3.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=6.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.2; extra == "docs"
Requires-Dist: myst-parser>=1.0; extra == "docs"
Provides-Extra: all
Requires-Dist: geoxerl[dev,dl,rl,viz]; extra == "all"

# GeoXERL

> **Geo**spatial species distribution modeling with e**X**treme **E**nsemble methods and **R**einforcement **L**earning-based threshold optimization.

[![Tests](https://github.com/wenshunzhang/GeoXERL/actions/workflows/tests.yml/badge.svg)](https://github.com/wenshunzhang/GeoXERL/actions/workflows/tests.yml)
[![PyPI version](https://img.shields.io/pypi/v/geoxerl.svg)](https://pypi.org/project/geoxerl/)
[![Python](https://img.shields.io/pypi/pyversions/geoxerl.svg)](https://pypi.org/project/geoxerl/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

---

## Overview

GeoXERL is a modular Python toolkit for **species distribution modeling (SDM)** and geospatial prediction tasks. It combines:

1. **Multi-step data preprocessing** — environment variable extraction, presence-point processing, background-point generation, dataset splitting, and feature-stack preparation.
2. **Base model training & evaluation** — unified interface for training and batch inference across multiple algorithms.
3. **Ensemble methods** — Bagging, Boosting, Stacking, Geographically Weighted Random Forest (GWRF), and SHAP-based RL feature selection.
4. **Reinforcement-learning threshold optimization** — Q-Learning and PPO agents that search for the optimal prediction threshold instead of using the default 0.5.

---

## Installation

```bash
pip install geoxerl
```

Or install from source for the latest development version:

```bash
git clone https://github.com/wenshunzhang/GeoXERL.git
cd GeoXERL
pip install -e ".[dev]"
```

**Requirements:** Python >= 3.8, numpy, pandas, scikit-learn, rasterio, geopandas.

To install optional extras:

```bash
pip install geoxerl[rl]    # adds stable-baselines3 and gymnasium for PPO
pip install geoxerl[docs]  # adds Sphinx for building documentation
```

---

## Quick start

### Command line

```bash
# Run each step individually
geoxerl preprocess
geoxerl train
geoxerl ensemble --method stacking
geoxerl optimize

# Or run the full pipeline in one command
geoxerl run-all

# Check version
geoxerl --version
```

### Python API

```python
from geoxerl.data_preprocessing.main import main as preprocess
from geoxerl.base_models.train import main as train_models
from geoxerl.ensemble.stacking import main as run_ensemble
from geoxerl.threshold_optimization.q_main import main as optimize_threshold

# Step 1: preprocess raw environmental rasters
preprocess()

# Step 2: train base models
train_models()

# Step 3: build the ensemble
run_ensemble()

# Step 4: find the optimal prediction threshold via Q-Learning
optimize_threshold()
```

See the [`examples/`](examples/) directory for ready-to-run scripts covering each stage.

---

## Project structure

```
GeoXERL/
├── geoxerl/                          # Main package
│   ├── __init__.py
│   ├── __version__.py
│   ├── __main__.py                   # Enables python -m geoxerl
│   ├── cli.py                        # Command-line interface
│   ├── data_preprocessing/           # Steps 00-05: env vars -> feature stack
│   │   ├── 00_env_variables_preprocessing.py
│   │   ├── 01_env_variables_preprocessing.py
│   │   ├── 02_presence_points_processing.py
│   │   ├── 03_background_points_generation.py
│   │   ├── 04_dataset_splitting.py
│   │   ├── 05_prepare_feature_stack.py
│   │   ├── config.py
│   │   ├── main.py
│   │   └── utils.py
│   ├── base_models/                  # Model training, evaluation, batch inference
│   │   ├── models.py
│   │   ├── train.py
│   │   ├── evaluate.py
│   │   ├── batch_models.py
│   │   └── config.json
│   ├── ensemble/                     # Bagging, Boosting, Stacking, GWRF, PPO
│   │   ├── bagging.py
│   │   ├── boosting.py
│   │   ├── stacking.py
│   │   ├── gwrf.py
│   │   ├── gwrf_shap_analysis.py
│   │   ├── gwrf_shap_tif.py
│   │   ├── feature_selector_rl2.py
│   │   ├── ppo_main.py
│   │   ├── predict_gwrf.py
│   │   └── metrics.py
│   └── threshold_optimization/       # Q-Learning / PPO threshold search
│       ├── q_learning_optimizer.py
│       ├── q_main.py
│       ├── threshold_analyzer.py
│       ├── data_processor.py
│       ├── visualizer.py
│       └── config.py
├── tests/                            # Unit tests
├── examples/                         # Ready-to-run example scripts
├── docs/                             # Documentation
├── .github/workflows/                # CI/CD (tests + PyPI publish)
├── pyproject.toml
├── README.md
├── CHANGELOG.md
├── CONTRIBUTING.md
└── LICENSE
```

---

## Module descriptions

### `data_preprocessing`

Processes raw environmental raster layers and species occurrence records into a clean, analysis-ready dataset. Scripts are numbered `00`-`05` to indicate execution order; `main.py` runs them all in sequence.

| Script | Purpose |
|--------|---------|
| `00` / `01` | Clip, reproject, and derive environmental variables from raw rasters |
| `02` | Filter and spatially thin species occurrence records |
| `03` | Generate background / pseudo-absence points |
| `04` | Split dataset into train / validation / test sets |
| `05` | Stack selected features into a single analysis-ready array |

### `base_models`

Provides a unified interface for fitting individual classifiers (`train.py`), computing standard SDM metrics — AUC, TSS, Kappa (`evaluate.py`), and running inference over large raster stacks (`batch_models.py`).

### `ensemble`

Implements three classical ensemble strategies and two geospatial-aware methods:

| Method | File | Notes |
|--------|------|-------|
| Bagging | `bagging.py` | Bootstrap aggregation |
| Boosting | `boosting.py` | Gradient boosting |
| Stacking | `stacking.py` | Meta-learner on base model outputs |
| GWRF | `gwrf.py` | Geographically Weighted Random Forest with SHAP explainability |
| PPO feature selector | `feature_selector_rl2.py` / `ppo_main.py` | RL agent that learns which features to include |

### `threshold_optimization`

Casts threshold selection as a reinforcement learning problem. The Q-Learning optimizer discretizes the threshold space into states and learns a policy through reward signals based on TSS / F1. `threshold_analyzer.py` and `visualizer.py` provide post-hoc analysis and plotting tools.

---

## Configuration

Each module has its own config file. Edit these before running to set your data paths and hyperparameters:

| Module | Config file |
|--------|-------------|
| `data_preprocessing` | `geoxerl/data_preprocessing/config.py` |
| `base_models` | `geoxerl/base_models/config.json` |
| `ensemble` | `geoxerl/ensemble/config.py` |
| `threshold_optimization` | `geoxerl/threshold_optimization/config.py` |

---

## Contributing

Contributions are welcome! Please read [CONTRIBUTING.md](CONTRIBUTING.md) for setup instructions, code style guidelines, and the pull request checklist.

```bash
# Set up development environment
git clone https://github.com/wenshunzhang/GeoXERL.git
cd GeoXERL
pip install -e ".[dev]"
pre-commit install

# Run tests
pytest tests/
```

---

## Citation

If you use GeoXERL in your research, please cite:

```bibtex
@software{geoxerl2024,
  author  = {Zhang, Wenshun},
  title   = {GeoXERL: Geospatial Ensemble and Reinforcement Learning Toolkit for Species Distribution Modeling},
  year    = {2024},
  url     = {https://github.com/wenshunzhang/GeoXERL},
  version = {0.1.0}
}
```

---

## License

MIT — see [LICENSE](LICENSE) for details.

---

## Contact

Wenshun Zhang — [zhangwenshun24@mails.ucas.ac.cn](mailto:zhangwenshun24@mails.ucas.ac.cn)

University of Chinese Academy of Sciences
