Metadata-Version: 2.4
Name: datumaro
Version: 1.13.0
Summary: Dataset Management Framework (Datumaro)
Maintainer: Intel Open Edge Platform
Project-URL: Homepage, https://github.com/open-edge-platform/datumaro
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Operating System :: OS Independent
Requires-Python: <3.15,>=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: attrs>=25.3
Requires-Dist: cachetools~=7.0
Requires-Dist: defusedxml~=0.7
Requires-Dist: imagesize~=2.0
Requires-Dist: json-stream~=2.4
Requires-Dist: lxml~=6.0
Requires-Dist: numpy~=2.2
Requires-Dist: opencv-python-headless~=4.11
Requires-Dist: orjson~=3.10
Requires-Dist: pandas~=2.3
Requires-Dist: pillow~=12.0
Requires-Dist: pyarrow~=24.0
Requires-Dist: pycocotools~=2.0
Requires-Dist: PyYAML~=6.0
Requires-Dist: ruamel.yaml~=0.19.1
Requires-Dist: shapely~=2.1
Requires-Dist: tqdm~=4.67
Requires-Dist: typing_extensions~=4.15
Provides-Extra: dev
Requires-Dist: ruff; extra == "dev"
Requires-Dist: prek; extra == "dev"
Provides-Extra: test
Requires-Dist: pytest>=5.3.5; extra == "test"
Requires-Dist: pytest-cov>=4.0.0; extra == "test"
Requires-Dist: pytest-stress; extra == "test"
Requires-Dist: pytest-html; extra == "test"
Requires-Dist: coverage; extra == "test"
Requires-Dist: dill; extra == "test"
Requires-Dist: tifffile; extra == "test"
Provides-Extra: docs
Requires-Dist: markupsafe==3.0.3; extra == "docs"
Requires-Dist: nbconvert>=7.2.3; extra == "docs"
Requires-Dist: ipython==8.39.0; extra == "docs"
Requires-Dist: sphinx==7.4.7; extra == "docs"
Requires-Dist: pydata-sphinx-theme==0.17.1; extra == "docs"
Requires-Dist: sphinx-copybutton; extra == "docs"
Requires-Dist: sphinx-autoapi; extra == "docs"
Requires-Dist: myst-parser; extra == "docs"
Requires-Dist: nbsphinx; extra == "docs"
Requires-Dist: jupyter; extra == "docs"
Requires-Dist: jupyterlab>=4.5.7; extra == "docs"
Requires-Dist: notebook>=7.5.6; extra == "docs"
Requires-Dist: jupytext; extra == "docs"
Requires-Dist: pandoc; extra == "docs"
Requires-Dist: sphinx-design; extra == "docs"
Requires-Dist: sphinx-toolbox; extra == "docs"
Provides-Extra: tf
Requires-Dist: tensorflow; (python_version >= "3.11" and python_version < "3.13" and sys_platform != "darwin") and extra == "tf"
Requires-Dist: keras>=3.13.2; (python_version >= "3.11" and python_version < "3.13" and sys_platform != "darwin") and extra == "tf"
Provides-Extra: torch
Requires-Dist: torch>=2.9; extra == "torch"
Requires-Dist: torchvision>=0.24; extra == "torch"
Provides-Extra: kaggle
Requires-Dist: kaggle; extra == "kaggle"
Provides-Extra: scipy
Requires-Dist: scipy; extra == "scipy"
Provides-Extra: nlp
Requires-Dist: nltk; extra == "nlp"
Requires-Dist: tokenizers; extra == "nlp"
Requires-Dist: portalocker; extra == "nlp"
Provides-Extra: tabulate
Requires-Dist: tabulate; extra == "tabulate"
Provides-Extra: cli
Requires-Dist: tensorboardX!=2.3,>=1.8; extra == "cli"
Requires-Dist: tabulate; extra == "cli"
Requires-Dist: scipy; extra == "cli"
Requires-Dist: matplotlib>=3.3.1; extra == "cli"
Provides-Extra: visualizer
Requires-Dist: matplotlib>=3.3.1; extra == "visualizer"
Provides-Extra: h5py
Requires-Dist: h5py>=3.15.0; extra == "h5py"
Provides-Extra: nibabel
Requires-Dist: nibabel>=3.2.1; extra == "nibabel"
Provides-Extra: protobuf
Requires-Dist: protobuf; extra == "protobuf"
Provides-Extra: experimental
Requires-Dist: polars~=1.35; extra == "experimental"
Dynamic: license-file

# Dataset Management Framework (Datumaro)

[![Build status](https://github.com/open-edge-platform/datumaro/actions/workflows/health_check.yml/badge.svg)](https://github.com/open-edge-platform/datumaro/actions/workflows/health_check.yml)
[![codecov](https://codecov.io/gh/open-edge-platform/datumaro/branch/develop/graph/badge.svg?token=FG25VU096Q)](https://codecov.io/gh/open-edge-platform/datumaro)
[![Downloads](https://static.pepy.tech/badge/datumaro)](https://pepy.tech/project/datumaro)
[![OpenSSF Scorecard](https://api.scorecard.dev/projects/github.com/open-edge-platform/datumaro/badge)](https://scorecard.dev/viewer/?uri=github.com/open-edge-platform/datumaro)

A framework and CLI tool to build, transform, and analyze datasets.

<!--lint disable fenced-code-flag-->

```
VOC dataset                                  ---> Annotation tool
     +                                     /
COCO dataset -----> Datumaro ---> dataset ------> Model training
     +                                     \
CVAT annotations                             ---> Publication, statistics etc.
```

<!--lint enable fenced-code-flag-->

- [Getting started](https://open-edge-platform.github.io/datumaro/latest/docs/get-started/quick-start-guide)
- [Level Up](https://open-edge-platform.github.io/datumaro/latest/docs/level-up/basic_skills)
- [Features](#features)
- [User manual](https://open-edge-platform.github.io/datumaro/latest/docs/user-manual/how_to_use_datumaro)
- [Developer manual](https://open-edge-platform.github.io/datumaro/latest/docs/reference/datumaro_module)
- [Contributing](#contributing)

## Features

[(Back to top)](#dataset-management-framework-datumaro)

- Dataset reading, writing, conversion in any direction.

  - [CIFAR-10/100](https://www.cs.toronto.edu/~kriz/cifar.html) (`classification`)
  - [Cityscapes](https://www.cityscapes-dataset.com/)
  - [COCO](http://cocodataset.org/#format-data) (`image_info`, `instances`, `person_keypoints`,
    `captions`, `labels`, `panoptic`, `stuff`)
  - [CVAT](https://opencv.github.io/cvat/docs/manual/advanced/xml_format/)
  - [ImageNet](http://image-net.org/)
  - [Kitti](http://www.cvlibs.net/datasets/kitti/index.php) (`segmentation`, `detection`,
    `3D raw` / `velodyne points`)
  - [LabelMe](http://labelme.csail.mit.edu/Release3.0)
  - [LFW](http://vis-www.cs.umass.edu/lfw/) (`classification`, `person re-identification`,
    `landmarks`)
  - [MNIST](http://yann.lecun.com/exdb/mnist/) (`classification`)
  - [Open Images](https://storage.googleapis.com/openimages/web/download.html)
  - [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/htmldoc/index.html)
    (`classification`, `detection`, `segmentation`, `action_classification`, `person_layout`)
  - [TF Detection API](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/using_your_own_dataset.md)
    (`bboxes`, `masks`)
  - [YOLO](https://github.com/AlexeyAB/darknet#how-to-train-pascal-voc-data) (`bboxes`)

  Other formats and documentation for them can be found [here](https://open-edge-platform.github.io/datumaro/latest/docs/data-formats/formats).

- Dataset building
  - Merging multiple datasets into one
  - Dataset filtering by a custom criteria:
    - remove polygons of a certain class
    - remove images without annotations of a specific class
    - remove `occluded` annotations from images
    - keep only vertically-oriented images
    - remove small area bounding boxes from annotations
  - Annotation conversions, for instance:
    - polygons to instance masks and vice-versa
    - apply a custom colormap for mask annotations
    - rename or remove dataset labels
  - Splitting a dataset into multiple subsets like `train`, `val`, and `test`:
    - random split
    - task-specific splits based on annotations,
      which keep initial label and attribute distributions
      - for classification task, based on labels
      - for detection task, based on bboxes
      - for re-identification task, based on labels,
        avoiding having same IDs in training and test splits
- Dataset quality checking
  - Simple checking for errors
  - Comparison with model inference
  - Merging and comparison of multiple datasets
  - Annotation validation based on the task type(classification, etc)
- Dataset comparison
- Dataset statistics (image mean and std, annotation statistics)

> Check
> [the design document](https://open-edge-platform.github.io/datumaro/latest/docs/explanation/architecture)
> for a full list of features.
> Check
> [the user manual](https://open-edge-platform.github.io/datumaro/latest/docs/user-manual/how_to_use_datumaro)
> for usage instructions.

## Contributing

[(Back to top)](#dataset-management-framework-datumaro)

Feel free to
[open an Issue](https://github.com/open-edge-platform/datumaro/issues/new), if you
think something needs to be changed. You are welcome to participate in
development, instructions are available in our
[contribution guide](https://github.com/open-edge-platform/datumaro/blob/develop/contributing.md).
