Metadata-Version: 2.4
Name: spectral-datamaker
Version: 0.6.0
Summary: CLI tool for creating hyperspectral image datasets for machine learning.
Author: Daniel Pérez Rodríguez
License-Expression: MIT
Keywords: hyperspectral,imaging,segmentation,classification,napari
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Requires-Dist: click<9,>=8.2
Requires-Dist: hypertool<0.4,>=0.3.1
Requires-Dist: PyYAML<7,>=6
Dynamic: license-file

# SpectralDatamaker
Python CLI tool designed to facilitate the creation of datasets with hyperspectral images for machine learning.

The dataset structure is organized as follows:

```
dataset_root/
├── images
│   ├── DATASET-01_image-name_0
│   ├── DATASET-01_image-name_1
│   ├── DATASET-01_image-name_2
│   └── DATASET-01_image-name_3
├── masks
│   ├── RoiMASK_image-name.csv
│   ├── PxMASK_image-name.npy
│   ├── DATASET-01_image-name_0
│   ├── DATASET-01_image-name_1
│   ├── DATASET-01_image-name_2
│   └── DATASET-01_image-name_3
├── source
│   ├── image-name.hdr
│   └── image-name.raw
└── metadata.json
```

This tool provides functionalities for processing the source images, generating region of interest (ROI) masks, pixel masks, labels, and cropping the images based on the generated masks.

## CLI Usage

After installing the package, you can use the console command:

```bash
spectral-datamaker --help
```

You can also invoke the package module directly:

```bash
python -m spectral_datamaker --help
```

The CLI provides the following commands:

**Create a complete dataset:**
```bash
spectral-datamaker create <config.yaml> <output_directory>
```
Options:
- `--dry-run`: Validate configuration without executing
- `--skip-validation`: Skip final dataset validation
- `--no-interactive`: Skip interactive mask adjustment (not yet implemented)

The pipeline executes these steps sequentially: `structure` → `roi-mask` → `pixel-mask` → `crop` → `metadata` → `splits`. The `splits` step is skipped automatically unless `dev_split` is set in the segmentation configuration.

**Validate an existing dataset:**
```bash
spectral-datamaker validate <dataset_directory>
```
Options:
- `--config <file>`: Validate against a specific configuration file

**Inspect dataset metadata:**
```bash
spectral-datamaker inspect <dataset_directory>
```
Options:
- `--format [json|yaml|table]`: Output format (default: table)
- `--show-images`: List all processed images

**Execute individual pipeline steps:**
```bash
spectral-datamaker step <step_name> <config.yaml> <dataset_directory>
```
Available steps: `structure`, `roi-mask`, `pixel-mask`, `crop`, `metadata`, `splits`

Options:
- `--force`: Overwrite existing files (passed to the pipeline context for steps to use)

**Compose a new dataset from existing ones:**
```bash
spectral-datamaker compose <compose.yaml> <output_directory>
```
Options:
- `--dry-run`: Validate configuration without copying files

**Generate splits for an existing dataset:**
```bash
spectral-datamaker splits <dataset_directory> --dev-split 0.2
```
Generates a `splits.csv` file with balanced dev/test assignments without requiring a config file. Useful for re-splitting already-built datasets.

## Library usage (Python API)

Besides the CLI, SpectralDatamaker can be used as a Python library. The most useful exports are:

- `load_dataset_config(path)`: Load a dataset configuration YAML file. Returns `DatasetConfig`.
- `load_compose_config(path)`: Load a compose configuration YAML file. Returns `ComposeConfig`.
- `DatasetStructure`: Infers canonical dataset locations (`images/`, `masks/`, `source/`, `metadata.json`) from a root directory.
- `Filenames`: Derives expected filenames and absolute paths for masks, labels, cropped outputs, and metadata.
- `DatasetValidator`: Validates an existing dataset either from a config file or from `metadata.json`.
- `DatasetManager`: Provides methods for retrieving dataset information, listing processed images, and accessing metadata details.
- `ComposeConfig` / `SourceSelection`: Dataclasses for compose configuration.
- `ComposeProcessor`: Builds a composed dataset programmatically from a `ComposeConfig`.
- `SplitsStep`: Pipeline step that generates balanced dev/test splits (`splits.csv`).
- `generate_balanced_splits` / `write_splits_csv`: Low-level split utilities.
- `resolve_from_config_dir`: Resolves relative paths against a config file directory.

```python
from spectral_datamaker.config import load_dataset_config, DatasetStructure, Filenames
from spectral_datamaker.dataset import DatasetValidator, DatasetManager

# 1) Load configuration
config = load_dataset_config("/path/to/dataset.yaml")
print(config.name, config.segmentation_config.classes)

# 2) Infer dataset structure from root directory
structure = DatasetStructure("/path/to/dataset_root")
print(structure.images_dir)
print(structure.metadata_file)

# 3) Derive expected file paths and names
names = Filenames(structure)
print(names.get_roi_mask("image_1.hdr", abs=True))
print(names.get_px_mask("image_1.hdr", abs=True))
print(names.get_dataset_metadata(abs=True))

# 4) Validate dataset contents
validator = DatasetValidator(structure)
validator.validate_dataset_from_config("/path/to/dataset.yaml")
# Or, if metadata already exists:
# validator.validate_dataset_from_metadata()

# 5) Work with dataset metadata
manager = DatasetManager(structure)
assignments = manager.get_dataset_assignments(abs=True)
for class_name, image_paths in assignments.items():
    print(f"{class_name}: {len(image_paths)} images")
```

## Dataset config file
The dataset configuration file (e.g., dataset.yaml) contains the necessary information for creating a dataset from ENVI images. The YAML file should have the following structure:

Path resolution rules:
- Absolute paths are used as-is.
- Relative paths in `source-images[].path` are resolved relative to the directory that contains the YAML file.

```yaml
dataset:
  name: dataset-example
  description: An example dataset created with SpectralDatamaker.

  source-images:
    - path: ../images/source/image_1.hdr
      masking:
        shape: circle
        size: 35
        num: 6

    - path: ../images/source/image_2.hdr
      masking:
        shape: square
        size: 20
        num: 4

    - path: ../images/source/image_n.hdr
      masking:
        shape: triangle
        size: 50
        num: 2

  segmentation:
    enabled: true
    classes:
      - type_A
      - type_B
    dev_split: 0.2   # optional — ratio for dev/test splits (e.g. 0.2 = 20% dev, 80% test)

  classification:
    enabled: false
```

## Segmentation mode
When segmentation mode is enabled, SpectralDatamaker will generate a dataset with segmentation masks for each source image. The pipeline runs through the following steps:

1. **structure**: Creates the dataset directory layout (`images/`, `masks/`, `source/`).
2. **roi-mask**: Creates ROI masks based on the specified shape, size, and number of regions. A napari viewer is launched to allow interactive adjustment.
3. **pixel-mask**: Generates pixel masks from the ROI masks, labeling each region with the corresponding class from the configuration.
4. **crop**: Crops the source images based on the generated masks and saves the cropped images and masks.
5. **metadata**: Generates `metadata.json` with dataset information and processing details.
6. **splits** *(optional)*: If `dev_split` is set in the segmentation config, generates a `splits.csv` file with balanced dev/test assignments per class.

## Classification mode
> [!NOTE]
> The classification mode is currently in development is not yet available for use. The following description is based on the intended functionality.

When classification mode is enabled, SpectralDatamaker will generate a dataset with class labels for each source image. The steps are as follows:
1. Creates ROI masks based on the specified shape, size, and number of regions in the configuration file. A napari viewer is launched to allow the user to adjust the generated masks if necessary. Masks are saved when the user closes the viewer.
2. Asks the user to label each ROI with the corresponding class from the configuration file. Saves the class labels in a CSV file.
3. Crops the source images based on the generated masks and saves the cropped images in the appropriate directories.

## Compose mode

The `compose` command builds a new dataset by selecting ROI crops from one or more already-processed datasets, without re-annotating anything. It reads the metadata of the source datasets to locate the crops, copies them to the new dataset, remaps the class labels according to the new class list, and generates the `metadata.json` of the composed dataset.

### Compose config file

Path resolution rules:
- Absolute paths are used as-is.
- Relative paths in `sources[].dataset` are resolved relative to the directory that contains the compose YAML file.

```yaml
compose:
  name: composed-dataset
  description: Dataset composed from multiple source datasets.
  classes:
    - type_A
    - type_B
    - type_C

  sources:
    - dataset: ../datasets/source_dataset_1
      class: type_A

    - dataset: ../datasets/source_dataset_1
      class: type_B
      num: 4        # optional — limit to 4 crops; omit to use all available

    - dataset: ../datasets/source_dataset_2
      class: type_C
```

- `classes`: defines the label mapping of the output dataset (`classes[0]` → label `1`, `classes[1]` → label `2`, etc.).
- `sources`: each entry selects all ROI crops of a given `variety` from a source `dataset`. The source dataset must have been created with `spectral-datamaker create` and must contain `metadata.json`.

### Composed dataset structure

The output directory follows the same structure as a regular dataset:

```
output_dir/
├── images/
│   ├── COMPOSED_imageA_type_A_0.npy
│   ├── COMPOSED_imageA_type_A_1.npy
│   └── COMPOSED_imageB_type_C_0.npy
├── masks/
│   ├── COMPOSED_imageA_type_A_0.npy
│   ├── COMPOSED_imageA_type_A_1.npy
│   └── COMPOSED_imageB_type_C_0.npy
├── source/
└── metadata.json
```

Crops from different source images and varieties are grouped into virtual source image keys of the form `<source_image>_<variety>`. Within each group, crops are indexed sequentially from `0`.

# Dataset metadata
SpectralDatamaker generates a metadata.json file containing information about the dataset, including the dataset name, description, source images, and the processing steps applied to each image. This metadata file is recognized by the SpectralDatamaker and can be used to validate the dataset structure and contents. An example of the metadata.json structure is as follows:

```json
{
    "name": "dataset-03",
    "description": "Dataset created with one hyperespectral image.",
    "last_update": "2026-04-08 13:52:01",
    "source_images": ["/path/to/image_1.hdr"],
    "types": ["segmentation"],
    "segmentation_masking": {
        "image_1": {
            "label_map": {"0": "background", "1": "type_A", "2": "type_B"},
            "num_classes": 3,
            "classes": ["type_A", "type_B"],
            "assignments": {
                "type_A": [0,2,3],
                "type_B": [1,5,4]
            },
            "source_image": "image_1.hdr",
            "source_dataset": "",
            "rois_file": "RoiMASK_image_1.csv",
            "mask_file": "PxMASK_image_1.npy",
            "created": "2026-04-08T13:51:33.931524",
            "format": "npy"
        }
    }
}
```

For composed datasets, `source_images` contains virtual group keys (one per source image × variety combination) and each `segmentation_masking` entry includes a `source_dataset` field pointing to the origin dataset:

```json
{
    "name": "composed-dataset",
    "description": "Dataset composed from multiple source datasets.",
    "last_update": "2026-05-18 10:00:00",
    "source_images": ["image_1_type_A", "image_2_type_C"],
    "types": ["segmentation"],
    "segmentation_masking": {
        "image_1_type_A": {
            "label_map": {"0": "background", "1": "type_A", "2": "type_B", "3": "type_C"},
            "num_classes": 4,
            "classes": ["type_A", "type_B", "type_C"],
            "assignments": {
                "type_A": [0, 1, 2],
                "type_B": [],
                "type_C": []
            },
            "source_image": "image_1_type_A",
            "source_dataset": "/path/to/source_dataset_1",
            "rois_file": "",
            "mask_file": "",
            "created": "2026-05-18T10:00:00.000000",
            "format": "npy"
        }
    }
}
```

## Validations
SpectralDatamaker includes validation checks allowing users to verify the generated dataset structure and contents, as well as validate existing datasets. The validation includes checks for the presence of required directories and expected files.

## Extending
See [EXTENDING.md](./docs/EXTENDING.md) for a guide on adding new pipeline steps, CLI commands, and configuration options.
