Metadata-Version: 2.4
Name: soil-etl
Version: 0.1.0a2
Summary: YAML-driven ETL mappers for ForSITE soil database imports.
License: MIT
License-File: LICENSE
Keywords: soil,etl,yaml,postgis,forestry,forsite
Author: Olivier Forgues-Nhan
Author-email: olivier.forgues-nhan@agr.gc.ca
Requires-Python: >=3.10,<4.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: GIS
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Dist: certifi (>=2025.8.3,<2026.0.0)
Requires-Dist: geoalchemy2 (>=0.17.1)
Requires-Dist: geopandas (>=1.0.1)
Requires-Dist: hydra-colorlog (>=1.2.0,<2.0.0)
Requires-Dist: hydra-core (>=1.3.2,<2.0.0)
Requires-Dist: nrcan-etl-toolbox (>=0.2.1)
Requires-Dist: openpyxl (>=3.1.0)
Requires-Dist: pandas (>=2.2.3)
Requires-Dist: psycopg2-binary (>=2.9.10)
Requires-Dist: pydantic (>=2.0)
Requires-Dist: pyproj (>=3.0)
Requires-Dist: pyyaml (>=6.0)
Requires-Dist: shapely (>=2.1.0)
Requires-Dist: sqlalchemy (>=2.0.40)
Requires-Dist: sqlmodel (>=0.0.25)
Requires-Dist: xlrd (>=2.0.2)
Project-URL: Homepage, https://gitlab.com/nrcan-rncan-cfs-scf/lfc-cfl/soil-etl
Project-URL: Issues, https://gitlab.com/nrcan-rncan-cfs-scf/lfc-cfl/soil-etl/issues
Project-URL: Repository, https://gitlab.com/nrcan-rncan-cfs-scf/lfc-cfl/soil-etl
Description-Content-Type: text/markdown

# SOIL - Structured Observation Ingestion Library

YAML-driven ETL mapper package for importing ForSITE soil datasets into the ForSITE
PostGIS database model.

This package is being split out of [
`forsite-soil-db-interface`](https://gitlab.com/nrcan-rncan-cfs-scf/lfc-cfl/forsite-soil-db-interface)
so the mapper layer can be versioned, tested, and published independently on PyPI.

## Acknowledgements

| Project Partners                                                                                               |
|:---------------------------------------------------------------------------------------------------------------|
| <img src="docs/gc-logo.svg" width="310" style="background-color: white ; padding: 10px; padding-right: 55px" > |
| <img src="docs/nrcan-logo.svg" width="310" style="background-color: white ; padding: 10px">                    |
| <img src="docs/aafc_logo.svg" width="310" style="background-color: white ; padding: 10px">                     |

This project was developed through joint funding provided by Agriculture and Agri-Food Canada (`AAFC`) under
the [CSBO](https://sis.agr.gc.ca/cansis/biome/index.html)
program and Natural Resources Canada (`NRCan`) for the ForSITE-Soil degradation project.

The code developed as part of this project includes contributions from:

- [Olivier Forgues-Nhan](olivier.forgues-nhan@agr.gc.ca)
- [Xavier Malet](xavier.malet@nrcan-rncan.gc.ca)
- [Catlan Dallaire](catlan.dallaire@nrcan-rncan.gc.ca)
- [David Gagné](david.gagne@agr.gc.ca)

## Target Public API

### Basic Usage of the YAML Mapper

```python
from pathlib import Path

from soil_etl.yaml_mapper import YamlDatasetMapper

mapper = YamlDatasetMapper(
    config_path=Path("examples/configs/example_config.yaml"),
    data_file_path=Path("data.xlsx"),
)
mapper.import_data()
```

For more details on the YAML format, refer to the [YAML Mapper documentation](docs/README_yaml_format.md).

### Database Connection Configuration

#### Database Connection String

Database connection details can be provided when running the import. By default,
the mapper still uses the existing environment/config lookup, but a call can now
override it explicitly:

```python
mapper.import_data(
    database_connection_string="postgresql+psycopg2://user:password@host:5432/database",
    database_engine_kwargs={"pool_size": 10, "pool_pre_ping": True},
)
```

#### Hydra Database Configuration Files

When using Hydra database configuration files instead of a direct connection
string, pass the config directory and config name:

```python
mapper.import_data(
    database_config_path="configs",
    database_config_name="configs.yaml",
)
```

The config directory can also be provided through the `CONFIG_PATH` environment
variable:

```bash
export CONFIG_PATH=configs
export DB_USER=user
export DB_PASSWORD=password
```

and then called in the import:

```python
mapper.import_data(database_config_name="configs.yaml")
```

or entierly in python:

```python
import os

os.environ['CONFIG_PATH'] = "./path_to_config_dir"
os.environ['DB_USER'] = "user"
os.environ['DB_PASSWORD'] = "password"
mapper.import_data(database_config_name="configs.yaml")
```

### YAML Mapper Configuration examples

The YAML mapper configuration is a YAML file that defines the mapping between
the data file columns and the database table columns. Detailled documentation
of the YAML format is available in
[README_yaml_format_en.md](docs/README_yaml_format_en.md).

YAML mapper configuration examples are available in
[`examples/configs/example_config.yaml`](examples/configs/example_config.yaml). Packaged
reference files are also kept in
[`src/soil_etl/yaml_mapper/template.yaml`](src/soil_etl/yaml_mapper/template.yaml) and
[`src/soil_etl/yaml_mapper/examples/on_master.yaml`](src/soil_etl/yaml_mapper/examples/on_master.yaml).

## Package Status

The package is in alpha and the extractor implementation has been copied into
`src/soil_etl`. The current tree includes the YAML mapper, binding helpers,
dataclasses, enums, staging models, database interface, and ORM models needed by
the migrated tests.

The remaining packaging work is to finish aligning the public import surface,
tests, and documentation with the final package namespace. Some older project
notes may still refer to the previous `forsite_soil_extractors` migration target.

## Development

```bash
poetry install --with dev,test
poetry run pytest
```

The migrated data extractor tests are documented in `docs/testing.md`.

## Canonical Imports

```python
from soil_etl.yaml_mapper import YamlDatasetMapper
from soil_etl.bindings import Binding, FromColumn
from soil_etl.db import ImportationInterface
```

The canonical package namespace is `soil_etl`. Older migration notes may still refer to
`forsite_soil_extractors`, but new code should use only `soil_etl` imports.

## Build

```bash
python -m build
twine check dist/*
```

