Metadata-Version: 2.3
Name: fw-dataset
Version: 0.3.3
Summary: A library for working with Flywheel datasets
Author: joshicola
Author-email: joshicola <joshuajacobs@flywheel.io>
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Requires-Dist: pandas>=3
Requires-Dist: adlfs>=2024.7.0,<2025
Requires-Dist: pyarrow>=20.0,<21
Requires-Dist: duckdb>=1.1.1,<2
Requires-Dist: s3fs>=2024.9.0,<2025
Requires-Dist: gcsfs>=2024.9.0.post1,<2025
Requires-Dist: httpcore>=1.0.0,<2
Requires-Dist: httpx>=0.28.1,<0.29
Requires-Dist: h11>=0.16.0,<0.17
Requires-Dist: fw-client>=0.8.6,<0.9
Requires-Dist: flywheel-sdk>=20,<21
Requires-Dist: pydantic>=2.9.2,<3
Requires-Dist: orjson>=3.10.7,<4
Requires-Dist: deepdiff>=8.4,<9
Requires-Dist: fsspec>=2024.9.0,<2025
Requires-Dist: pyopenssl>=24.2.1,<25
Requires-Dist: jinja2>=3.1.4,<4
Requires-Python: >=3.12, <4.0
Description-Content-Type: text/markdown

# fw-dataset

This repository contains classes and functions for creating, managing, and serving
Flywheel Datasets. Flywheel Datasets are a way to organize, share, and query data from
the Flywheel Data Model.

[[_TOC_]]

> [!important] This Python package is under active development and should be considered
> unstable. It is provided as-is, without any guarantee of support or maintenance at
> this stage. Features may be incomplete, change without notice, or be removed in future
> versions. Use at your own risk for experimental or development purposes only.

## Getting started

### Installation

The `fw-dataset` package has been built for use with Python 3.10 and above. It can be
installed with pip:

```bash
pip install fw-dataset
```

or poetry:

```bash
poetry add fw-dataset
```

### Usage

#### Rendering Datasets

See
[notebooks/quickstart_dataset_creation.ipynb](notebooks/quickstart_dataset_creation.ipynb)
for a walkthrough of using the `DatasetBuilder` to render a Flywheel dataset.

#### Accessing and Managing Datasets

See
[notebooks/quickstart_dataset_management.ipynb](notebooks/quickstart_dataset_management.ipynb)
for a walkthrough of using the `FWDatasetClient` to access and query a Flywheel dataset.

#### Unassociated Datasets

If you have a valid dataset that is not associated with a Flywheel project, you can
still use the `FWDatasetClient` to access the dataset. You will need to provide the
`type`,`bucket`, `prefix`, and `credentials` of cloud or local filesystem to instantiate
and query the dataset.

```python
from fw_dataset import FWDatasetClient

# There is no need to provide an API-Key or instantiate the dataset client

fs_type = "s3" # or "gcs", "azure", "fs", "local"
bucket = "your-bucket"
prefix = "your-prefix"
credentials = {"url": "{bucket-specific-credential-string}"}

dataset = FWDatasetClient.get_dataset_from_filesystem(fs_type, bucket, prefix, credentials)
```

#### Merging Related Datasets

If you have multiple datasets that have related tables you want to query together, you
can merge the datasets into a single dataset.

NOTE: Federated Querying is not yet enabled across datasets. This is a work in progress.

##### Requirements

1. The `source` dataset must have a valid `tables` directory structure.
2. The `source` dataset must have a valid `schemas` directory structure.
    - Every table in the `tables` directory must have a valid corresponding schema file
      in the `schemas` directory.
    - The schema file must be named `{table_name}.schema.json` where `{table_name}` is
      the name of the table that the schema describes.
    - The schema file must be a valid JSON file with the minimum structure:

        ```json
        {
            "schema": "http://json-schema.org/draft-07/schema#",
            "id": "{table_name}",
            "description": "",
            "properties": {},
            "required": [],
            "type": "object"
        }
        ```

3. The `destination` dataset must have the same requirements as the `source` dataset.
4. Tables and schemas selected from the `source` MUST NOT have the same names as
   existing ones in the `destination`

Once the above requirements have been met, you may merge the datasets by copying or
moving the selected tables and schemas from the `source` dataset to the `destination`
dataset.

## Flywheel Project Requirements

For the Flywheel Dataset Client and the Dataset objects to function, the following
requirements must be met:

### Flywheel Project Structure

The Flywheel Project must have the following valid custom information metadata:

```json
{
    "dataset": {
        "type": "s3",
        "bucket": "{bucket-name}",
        "prefix": "{path/to/dataset}",
        "storage_id": "storage-id-of-fw-storage-object"
    }
}
```

#### type

The `type` field must be one of the following:

- `s3`: The dataset is stored in an S3 bucket.
- `gcs`: The dataset is stored in a Google Cloud Storage bucket.
- `azure`: The dataset is stored in an Azure Blob Storage container.
- `fs`,`local`: The dataset is stored on a local filesystem.

#### bucket

The `bucket` field is the name of the bucket or container where the dataset is stored.

#### prefix

The `prefix` field is the path to the dataset within the bucket or container.

The directory structure beneath the `prefix` should be as described in the [Dataset
Structure](#dataset-structure) section.

#### storage_id

The `storage_id` field is the Flywheel ID of the cloud storage record that describes the
filesystem or cloud storage bucket that the dataset is stored in. This should be a valid
storage object in the Flywheel database.

### Dataset Structure

The dataset should be stored in the bucket or container with the following structure:

```bash
{bucket}/{prefix}/
└── versions/
    └── {version}/
        ├── provenance/
        │   ├── dataset_description.json
        │   ├── snapshot.db.gz
        │   ├── snapshot_info.json
        │   └── project.json
        ├── tables/
        │   └── {table_name}/
        │       └── {hash}.parquet
        └── schemas/
            └── {table_name}.schema.json
```

Each version is stored in a separate subdirectory named with its version identifier
(typically a BSON ID like "66cf6701af1c6f3855f1ee61"). The "latest" version is
determined dynamically by comparing the creation dates in each version's
`dataset_description.json` file.

The above structure is more completely described in the [Dataset
Definition](docs/dataset_definition.md#dataset-components) Document in the `docs`
directory.

#### Schema Files

The schema files are JSON files that describe the schema of the tables in the dataset.
The schema files are stored in the `schemas` directory. The schema files are named
`{table_name}.schema.json` where `{table_name}` is the name of the table that the schema
describes.

Ideally, the schema files should be fully descriptive. However, if a minimal schema is
desired merely to allow the dataset to be queried, the schema file can be as simple as:

```json
{
    "schema": "http://json-schema.org/draft-07/schema#",
    "id": "{table_name}",
    "description": "Table derived from Tabular Data File: conditions.csv",
    "properties": {},
    "required": [],
    "type": "object"
}
```

## Appendix

### Flywheel Data Model

The Flywheel Data Model is a hierarchical structure that organizes data in a Flywheel
Project. The Flywheel Data Model is organized as follows:

- `Project` (has `files` and `analyses`)
- `Subject` (has `files` and `analyses`)
- `Session` (has `files` and `analyses`)
- `Acquisition` (has `files` and `analyses`)
- `File`
- `Analysis`

The SQLite snapshot of the Flywheel Data Model has each of the above entities as tables.
The tables consist of an `id` column and a `data` column. The `data` column is a binary
string containing the JSON representation of each entity.
