Metadata-Version: 2.2
Name: origami-ml
Version: 0.1.3
Summary: An ML classifier model to make predictions from semi-structured data.
Home-page: https://github.com/mongodb-labs/origami
Author: Thomas Rueckstiess
Author-email: thomas.rueckstiess@mongodb.com
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.1.7
Requires-Dist: click-option-group>=0.5.6
Requires-Dist: guildai>=0.9.0
Requires-Dist: lightgbm>=4.5.0
Requires-Dist: matplotlib>=3.9.2
Requires-Dist: mdbrtools>=0.1.1
Requires-Dist: numpy>=1.26.4
Requires-Dist: omegaconf>=2.3.0
Requires-Dist: openml>=0.15.1
Requires-Dist: pandas>=2.2.3
Requires-Dist: pymongo>=4.8.0
Requires-Dist: python-dotenv>=1.0.1
Requires-Dist: scikit_learn>=1.5.2
Requires-Dist: torch>=2.4.1
Requires-Dist: tqdm>=4.66.4
Requires-Dist: xgboost>=2.1.3
Provides-Extra: dev
Requires-Dist: jupyter>=1.1.1; extra == "dev"
Requires-Dist: jupyter_contrib_nbextensions>=0.7.0; extra == "dev"
Requires-Dist: pytest>=8.3.3; extra == "dev"
Requires-Dist: ruff>=0.9.3; extra == "dev"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# ORiGAMi - Object Representation through Generative Autoregressive Modelling

## Overview

ORiGAMi is a transformer-based Machine Learning model for supervised classification from semi-structured data such as MongoDB documents or JSON files.

Typically, when working with semi-structured data in a Machine Learning context, the data needs to be flattened into a tabular format first. This flattening can be lossy, especially in the presence of arrays and nested objects, and often requires domain expertise to extract meaningful higher-order features from the raw data. This feature extraction step is manual, slow and expensive and doesn't scale well.

ORiGAMi circumvents this by directly operating on JSON data. Once a model is trained, it can be used to make predictions on any field in the dataset.

## Installation

ORiGAMi requires Python version 3.10 or higher. We recommend using a virtual environment, such as
Python's native [`venv`](https://docs.python.org/3/library/venv.html).

To install ORiGAMi with `pip`, use

```shell
pip install origami-ml
```

You can also clone the repository to your local machine and install the dependencies manually:

```shell
git clone https://github.com/mongodb-labs/origami.git
cd origami
pip install -r requirements.txt
pip install -e .
```

## Usage

ORiGAMi comes with a command line interface (CLI) and a Python SDK.

### Usage from the Command Line

The CLI allows to train a model and make predictions from a trained model. After installation, run `origami` from your shell to see an overview of available commands.

Help for specific commands is available with `origami <command> --help`, where `<command>` is currently one of `train` or `predict`.

Detailed documentation for the CLI and available options can be found in [`CLI.md`](CLI.md).

### Usage with Python

To see an example on how to use ORiGAMi from Python, take a look at the provided [./notebooks](./notebooks/) folder, e.g. the [`example_origami_dungeons.ipynb`](./notebooks/example_origami_dungeons.ipynb) notebook.

## Experiment Reproduction

This code is released alongside our paper, which can be found on Arxiv: [ORIGAMI: A generative transformer architecture for predictions from semi-structured data](https://arxiv.org/abs/2412.17348). To reproduce the experiments in the paper, see the instructions in the [`./experiments/`](./experiments/) directory.
