Metadata-Version: 2.4
Name: iden
Version: 0.3.0
Dynamic: Maintainer
Dynamic: Maintainer-email
Summary: simple library to manage a dataset of shards to train machine learning models
Project-URL: homepage, https://github.com/durandtibo/iden
Project-URL: repository, https://github.com/durandtibo/iden
Project-URL: documentation, https://durandtibo.github.io/iden/
Author-email: Thibaut Durand <durand.tibo+gh@gmail.com>
License-Expression: BSD-3-Clause
License-File: LICENSE
Keywords: dataset,shard
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.10
Requires-Dist: coola<1.0,>=0.11.0
Requires-Dist: objectory<1.0,>=0.3.0
Provides-Extra: all
Requires-Dist: cloudpickle<4.0,>=3.0; extra == 'all'
Requires-Dist: joblib<2.0,>=1.3; extra == 'all'
Requires-Dist: numpy<3.0,>=1.24; extra == 'all'
Requires-Dist: pyyaml<7.0,>=6.0; extra == 'all'
Requires-Dist: safetensors<1.0,>=0.6; extra == 'all'
Requires-Dist: torch<2.3,>=2.0; (sys_platform == 'darwin' and platform_machine != 'arm64' and python_version < '3.13') and extra == 'all'
Requires-Dist: torch<3.0,>=2.0; (sys_platform == 'linux' or (sys_platform == 'darwin' and platform_machine == 'arm64')) and extra == 'all'
Provides-Extra: cloudpickle
Requires-Dist: cloudpickle<4.0,>=3.0; extra == 'cloudpickle'
Provides-Extra: joblib
Requires-Dist: joblib<2.0,>=1.3; extra == 'joblib'
Provides-Extra: numpy
Requires-Dist: numpy<3.0,>=1.24; extra == 'numpy'
Provides-Extra: pyyaml
Requires-Dist: pyyaml<7.0,>=6.0; extra == 'pyyaml'
Provides-Extra: safetensors
Requires-Dist: safetensors<1.0,>=0.6; extra == 'safetensors'
Provides-Extra: torch
Requires-Dist: torch<2.3,>=2.0; (sys_platform == 'darwin' and platform_machine != 'arm64' and python_version < '3.13') and extra == 'torch'
Requires-Dist: torch<3.0,>=2.0; (sys_platform == 'linux' or (sys_platform == 'darwin' and platform_machine == 'arm64')) and extra == 'torch'
Description-Content-Type: text/markdown

# iden

<p align="center">
    <a href="https://github.com/durandtibo/iden/actions/workflows/ci.yaml">
        <img alt="CI" src="https://github.com/durandtibo/iden/actions/workflows/ci.yaml/badge.svg">
    </a>
    <a href="https://github.com/durandtibo/iden/actions/workflows/nightly-tests.yaml">
        <img alt="Nightly Tests" src="https://github.com/durandtibo/iden/actions/workflows/nightly-tests.yaml/badge.svg">
    </a>
    <a href="https://github.com/durandtibo/iden/actions/workflows/nightly-package.yaml">
        <img alt="Nightly Package Tests" src="https://github.com/durandtibo/iden/actions/workflows/nightly-package.yaml/badge.svg">
    </a>
    <a href="https://codecov.io/gh/durandtibo/iden">
        <img alt="Codecov" src="https://codecov.io/gh/durandtibo/iden/branch/main/graph/badge.svg">
    </a>
    <br/>
    <a href="https://durandtibo.github.io/iden/">
        <img alt="Documentation" src="https://github.com/durandtibo/iden/actions/workflows/docs.yaml/badge.svg">
    </a>
    <a href="https://durandtibo.github.io/iden/dev/">
        <img alt="Documentation" src="https://github.com/durandtibo/iden/actions/workflows/docs-dev.yaml/badge.svg">
    </a>
    <br/>
    <a href="https://github.com/psf/black">
        <img  alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-000000.svg">
    </a>
    <a href="https://google.github.io/styleguide/pyguide.html#s3.8-comments-and-docstrings">
        <img  alt="Doc style: google" src="https://img.shields.io/badge/%20style-google-3666d6.svg">
    </a>
    <a href="https://github.com/astral-sh/ruff">
        <img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json" alt="Ruff" style="max-width:100%;">
    </a>
    <a href="https://github.com/guilatrova/tryceratops">
        <img  alt="Doc style: google" src="https://img.shields.io/badge/try%2Fexcept%20style-tryceratops%20%F0%9F%A6%96%E2%9C%A8-black">
    </a>
    <br/>
    <a href="https://pypi.org/project/iden/">
        <img alt="PYPI version" src="https://img.shields.io/pypi/v/iden">
    </a>
    <a href="https://pypi.org/project/iden/">
        <img alt="Python" src="https://img.shields.io/pypi/pyversions/iden.svg">
    </a>
    <a href="https://opensource.org/licenses/BSD-3-Clause">
        <img alt="BSD-3-Clause" src="https://img.shields.io/pypi/l/iden">
    </a>
    <br/>
    <a href="https://pepy.tech/project/iden">
        <img  alt="Downloads" src="https://static.pepy.tech/badge/iden">
    </a>
    <a href="https://pepy.tech/project/iden">
        <img  alt="Monthly downloads" src="https://static.pepy.tech/badge/iden/month">
    </a>
    <br/>
</p>

## Overview

`iden` is a simple Python library to manage a dataset of shards when training a machine learning
model.
`iden` uses a lazy loading approach to load the shard's data, so it is easy to manage shards without
loading their data.
`iden` supports different formats to store shards on disk.

### Key Features

- **Lazy Loading**: Shards are loaded only when needed, enabling efficient memory management
- **Multiple Formats**: Support for JSON, YAML, Pickle, PyTorch, safetensors, and more
- **Flexible Dataset Management**: Organize data into splits (train/val/test) with associated assets
- **URI-based Identification**: Each shard has a unique URI for easy persistence and loading
- **Caching Support**: Optional in-memory caching for frequently accessed shards
- **Extensible**: Easy to add custom shard types and loaders

### Quick Example

```python
import tempfile
from pathlib import Path
from iden.dataset import create_vanilla_dataset
from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple

# Create a simple dataset
with tempfile.TemporaryDirectory() as tmpdir:
    # Create shards
    train_tuple = create_shard_tuple(
        [
            create_json_shard(
                [1, 2, 3], uri=Path(tmpdir).joinpath("train1.json").as_uri()
            ),
            create_json_shard(
                [4, 5, 6], uri=Path(tmpdir).joinpath("train2.json").as_uri()
            ),
        ],
        uri=Path(tmpdir).joinpath("train_tuple").as_uri(),
    )
    val_tuple = create_shard_tuple(
        [create_json_shard([7, 8, 9], uri=Path(tmpdir).joinpath("val1.json").as_uri())],
        uri=Path(tmpdir).joinpath("val_tuple").as_uri(),
    )

    # Organize shards into splits
    shards = create_shard_dict(
        shards={"train": train_tuple, "val": val_tuple},
        uri=Path(tmpdir).joinpath("shards").as_uri(),
    )
    assets = create_shard_dict(shards={}, uri=Path(tmpdir).joinpath("assets").as_uri())

    # Create dataset
    dataset = create_vanilla_dataset(
        shards=shards,
        assets=assets,
        uri=Path(tmpdir).joinpath("my_dataset").as_uri(),
    )

    # Access data
    train_shards = dataset.get_shards("train")
    print(train_shards[0].get_data())  # Output: [1, 2, 3]
```

## Installation

We highly recommend installing
a [virtual environment](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).
`iden` can be installed from pip using the following command:

```shell
uv pip install iden
```

To make the package as slim as possible, only the minimal packages required to use `iden` are
installed.
To include all the dependencies, the following command can be used:

```shell
uv pip install iden[all]
```

Please check the [get started page](https://durandtibo.github.io/iden/get_started) to see how to
install only some specific dependencies or other alternatives to install the library.

## Documentation

- **[Get Started](https://durandtibo.github.io/iden/get_started)**: Installation instructions
- **[User Guide](https://durandtibo.github.io/iden/guide/shard)**: Learn about shards and datasets
- **[How-to Guides](https://durandtibo.github.io/iden/howto/shard)**: Step-by-step guides for common
  tasks
- **[API Reference](https://durandtibo.github.io/iden/refs/shard)**: Complete API documentation
- **[Examples](examples/)**: Practical code examples

## Basic Usage

### Working with Shards

```python
from iden.shard import create_json_shard

# Create a shard
shard = create_json_shard(data={"key": "value"}, uri="file:///path/to/data.json")

# Get data from shard
data = shard.get_data()

# Cache data for faster access
data = shard.get_data(cache=True)
```

### Managing Datasets

```python
from iden.dataset import create_vanilla_dataset
from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple

# Create a dataset with train/val splits
train_tuple = create_shard_tuple([shard1, shard2, shard3], uri="file:///train_tuple")
val_tuple = create_shard_tuple([shard4, shard5], uri="file:///val_tuple")

shards = create_shard_dict(
    shards={"train": train_tuple, "val": val_tuple},
    uri="file:///shards",
)
assets = create_shard_dict(shards={}, uri="file:///assets")

dataset = create_vanilla_dataset(
    shards=shards,
    assets=assets,
    uri="file:///path/to/dataset",
)

# Access shards
train_shards = dataset.get_shards("train")
first_shard_data = train_shards[0].get_data()
```

The following is the corresponding `iden` versions and tested dependencies.

| `iden`  | `coola`         | `objectory`  | `numpy`<sup>*</sup> | `pyyaml`<sup>*</sup> | `safetensors`<sup>*</sup> | `torch`<sup>*</sup> | `python`      |
|---------|-----------------|--------------|---------------------|----------------------|---------------------------|---------------------|---------------|
| `main`  | `>=0.11.0,<1.0` | `>=0.3,<1.0` | `>=1.24,<2.0`       | `>=6.0,<7.0`         | `>=0.6,<1.0`              | `>=2.0,<3.0`        | `>=3.10`      |
| `0.3.0` | `>=0.11.0,<1.0` | `>=0.3,<1.0` | `>=1.24,<2.0`       | `>=6.0,<7.0`         | `>=0.6,<1.0`              | `>=2.0,<3.0`        | `>=3.10`      |
| `0.2.0` | `>=0.8.4,<1.0`  | `>=0.2,<1.0` | `>=1.22,<2.0`       | `>=6.0,<7.0`         | `>=0.4,<1.0`              | `>=2.0,<3.0`        | `>=3.9,<3.14` |
| `0.1.0` | `>=0.8.4,<1.0`  | `>=0.2,<1.0` | `>=1.22,<2.0`       | `>=6.0,<7.0`         | `>=0.4,<1.0`              | `>=2.0,<3.0`        | `>=3.9,<3.14` |

| `iden`  | `cloudpickle`<sup>*</sup> | `joblib`<sup>*</sup> |
|---------|---------------------------|----------------------|
| `main`  | `>=3.0,<4.0`              | `>=1.3,<2.0`         |
| `0.3.0` | `>=3.0,<4.0`              | `>=1.3,<2.0`         |

<sup>*</sup> indicates an optional dependency

<details>
    <summary>older versions</summary>

| `iden`  | `coola`      | `objectory`  | `numpy`<sup>*</sup> | `pyyaml`<sup>*</sup> | `safetensors`<sup>*</sup> | `torch`<sup>*</sup> | `python`      |
|---------|--------------|--------------|---------------------|----------------------|---------------------------|---------------------|---------------|
| `0.0.4` | `>=0.3,<1.0` | `>=0.1,<1.0` | `>=1.22,<2.0`       | `>=6.0,<7.0`         | `>=0.4,<1.0`              | `>=2.0,<3.0`        | `>=3.9,<3.13` |
| `0.0.3` | `>=0.3,<1.0` | `>=0.1,<1.0` | `>=1.22,<2.0`       | `>=6.0,<7.0`         | `>=0.4,<1.0`              | `>=2.0,<3.0`        | `>=3.9,<3.12` |
| `0.0.2` | `>=0.4,<1.0` | `>=0.1,<1.0` | `>=1.22,<2.0`       | `>=6.0,<7.0`         | `>=0.4,<1.0`              | `>=2.0,<2.1`        | `>=3.9,<3.12` |
| `0.0.1` | `>=0.4,<1.0` | `>=0.1,<1.0` | `>=1.22,<2.0`       | `>=6.0,<7.0`         | `>=0.4,<1.0`              | `>=2.0,<2.1`        | `>=3.9,<3.12` |

</details>

## Contributing

Please check the instructions in [CONTRIBUTING.md](CONTRIBUTING.md).

## Suggestions and Communication

Everyone is welcome to contribute to the community.
For any questions or suggestions,
[Github Issues](https://github.com/durandtibo/iden/issues) can be submitted.
All issues will be addressed as soon as possible.

## API stability

:warning: While `iden` is in development stage, no API is guaranteed to be stable from one
release to the next.
In fact, it is very likely that the API will change multiple times before a stable 1.0.0 release.
In practice, this means that upgrading `iden` to a new version will possibly break any code that
was using the old version of `iden`.

## License

`iden` is licensed under BSD 3-Clause "New" or "Revised" license available in [LICENSE](LICENSE)
file.
