Metadata-Version: 2.4
Name: etiket_sync_agent_folderbase
Version: 0.3.0b1
Summary: Folder-based backend for eTiKeT sync agent
Author: QHarbor team
License-Expression: LicenseRef-Proprietary
Project-URL: Homepage, https://qharbor.nl
Project-URL: Documentation, https://docs.qharbor.nl
Keywords: etiket,sync,backend,folderbase,filesystem
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENCE
Requires-Dist: etiket_sync_agent>=0.3.0b1
Requires-Dist: xarray
Requires-Dist: pyyaml
Dynamic: license-file

# eTiKeT Sync Agent - FolderBase Backend

Backend for synchronizing folder-based datasets with the eTiKeT platform. This backend scans directories for datasets marked with a `_QH_dataset_info.yaml` file and syncs their contents to the cloud.

## How It Works

The FolderBase backend continuously watches a specified folder, automatically detects new and existing datasets, and uploads them to QHarbor. Note that it synchronizes to the server, not from the server.

A folder is recognized as a **dataset** when it contains a `_QH_dataset_info.yaml` file. This file specifies the minimum amount of information needed to create a dataset. Every other file in the folder (and subdirectories) is considered a data file and will be added to the dataset.

### Example Folder Structure

```
main_folder/
├── 20240101/
│   ├── 20240101-211245-165-731d85-experiment_1/
│   │   ├── _QH_dataset_info.yaml
│   │   ├── 01-01-2024_01-01-01.json
│   │   └── 01-01-2024_01-01-01.hdf5
├── 20240102/
│   ├── 20240102-220655-268-455d85-experiment_2/
│   │   ├── _QH_dataset_info.yaml
│   │   ├── 02-01-2024_02-02-02.json
│   │   ├── 02-01-2024_02-02-02.hdf5
│   │   └── analysis/
│   │       ├── 02-01-2024_02-02-02_analysis.json
│   │       └── 02-01-2024_02-02-02_analysis.hdf5
└── some_other_folder/
    ├── _QH_dataset_info.yaml
    └── 01-01-2024_01-01-01.json
```

If a file is added to any of these folders or a new dataset folder is created, the sync agent will automatically detect and upload it.

---

## Installation

```bash
pip install etiket_sync_agent_folderbase
```

The package is automatically discovered by `etiket_sync_agent` through the entry-point system.

---

## Configuration

The FolderBase backend requires a `FolderBaseConfigData` configuration:

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `root_directory` | `Path` or `str` | Yes | Root directory to watch for datasets. Supports `~` expansion. |
| `is_server_folder` | `bool` | Yes | Whether this is a network/server folder (e.g., on a university network drive) |

Please use our flutter GUI or the etiket_sdk to add this sync source.

---

## The `_QH_dataset_info.yaml` File

When performing measurements, we recommend programmatically creating the `_QH_dataset_info.yaml` file in the dataset folder.

### Minimal Example

```yaml
version: 0.1
```

### Full Field Reference

| Field | Required | Type | Description |
|-------|----------|------|-------------|
| `version` | Yes | `str` | File format version (currently `0.1`) |
| `dataset_name` | No | `str` | Name of the dataset. Default: folder name |
| `created` | No | `str` | Creation date in format `YYYY-MM-DDTHH:MM:SS`. Default: earliest file modification time |
| `collected` | No | `str` | Collection date (alternative to `created`) |
| `description` | No | `str` | Description of the dataset |
| `attributes` | No | `dict` | Key-value pairs (values must be `str` or `number`) |
| `tags` | No | `list` | Tags for the dataset |
| `skip` | No | `list` | Glob patterns for files/folders to exclude (e.g., `["*.json", "raw_data/*"]`) |
| `converters` | No | `dict` | File converters to apply (see below) |

### Complete Example

```yaml
version: 0.1
dataset_name: 'my_dataset_name'
description: "Description of the experiment I want to do."
attributes:
  initials: 'QH'
  set_up: 'XLD001'
  sample: 'my_sample'
tags: ['rabi', 'test']
skip: ['*.json', 'raw_data/*']
converters:
  csv_to_hdf5_converter:
    module: etiket_sync_agent_qh_converters
    class: CSVToHDF5Converter
```

> ⚠️ **Note**: The YAML file must use **spaces** for indentation, not tabs. Using tabs will cause parsing errors and synchronization will fail.

---

## File Converters

You can specify converters to automatically transform files during sync. The naming convention is `{input}_to_{output}_converter`.

### Converter Syntax

```yaml
converters:
  txt_to_csv_converter:
    module: my_library.location.to.module
    class: MyConverterClass
```

### Available Converters

The `etiket_sync_agent_qh_converters` package provides built-in converters:
- `zarr` → HDF5
- CSV → HDF5
- And more...

To create custom converters, implement a class that inherits from `FileConverter` and provides the `convert` method. The converter can be installed with the `etiket_sdk` package. For more information on creating converters, see the `etiket_sync_agent` package documentation.

---

## Programmatic Dataset Creation

You can programmatically create the `_QH_dataset_info.yaml` file using the `generate_dataset_info` function:

```python
from datetime import datetime
from etiket_sync_agent_folderbase import generate_dataset_info
from etiket_sync_agent_qh_converters import CSVToHDF5Converter

path = "my_path/test/"
generate_dataset_info(
    path,
    dataset_name="my_dataset_name",
    creation=datetime.now(),
    description="Description of the experiment I want to do.",
    attributes={"sample": "my_sample"},
    tags=["rabi", "test"],
    converters=[CSVToHDF5Converter],
    skip=["*.json", "raw_data/*"]
)
```

> **Note**: This function is also re-exported by the `qdrive` package as `qdrive.dataset.generate_dataset_info`.

See [`dataset_info.py`](etiket_sync_agent_folderbase/dataset_info.py) for the full function signature and documentation.

---

## What Gets Synchronized

| Source | eTiKeT Field | Description |
|--------|--------------|-------------|
| `dataset_name` or folder name | `name` | Name of the dataset |
| `description` | `description` | Dataset description (appended with source path) |
| `created`/`collected` or earliest file mtime | `collected` | Dataset creation time |
| `tags` | `tags` | Searchable tags |
| `attributes` | `attributes` | Key-value metadata |
| All files (except skipped) | Data files | Uploaded with detected file type |

### Supported File Types

Any file type is supported. For zarr files (which are actually folders), use a converter from `etiket_sync_agent_qh_converters` to convert them to HDF5.

---

## Features

- **Directory-based dataset discovery**: Automatically finds datasets by `_QH_dataset_info.yaml` presence
- **YAML-based configuration**: Simple declarative dataset metadata
- **File converter support**: Transform files during sync (e.g., zarr → HDF5, CSV → HDF5)
- **Skip patterns**: Exclude files/folders using glob patterns
- **Automatic file type detection**: Detects JSON, text, HDF5/NetCDF files
- **Subdirectory support**: Syncs all files recursively within dataset folders

## Requirements

- Python >= 3.10
- xarray
- h5netcdf
- PyYAML

## License

Copyright © 2025 QHarbor. All Rights Reserved. See [LICENCE](LICENCE) for details.
