Metadata-Version: 2.2
Name: origami-ml
Version: 0.1.0
Summary: An ML classifier model to make predictions from semi-structured data.
Home-page: https://github.com/yourusername/origami-ml
Author: Thomas Rueckstiess
Author-email: thomas.rueckstiess@mongodb.com
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.1.7
Requires-Dist: click-option-group>=0.5.6
Requires-Dist: guildai>=0.9.0
Requires-Dist: lightgbm>=4.5.0
Requires-Dist: matplotlib>=3.9.2
Requires-Dist: mdbrtools>=0.1.1
Requires-Dist: numpy>=1.26.4
Requires-Dist: omegaconf>=2.3.0
Requires-Dist: openml>=0.15.1
Requires-Dist: pandas>=2.2.3
Requires-Dist: pymongo>=4.8.0
Requires-Dist: python-dotenv>=1.0.1
Requires-Dist: scikit_learn>=1.5.2
Requires-Dist: torch>=2.4.1
Requires-Dist: tqdm>=4.66.4
Requires-Dist: xgboost>=2.1.3
Provides-Extra: dev
Requires-Dist: jupyter>=1.1.1; extra == "dev"
Requires-Dist: jupyter_contrib_nbextensions>=0.7.0; extra == "dev"
Requires-Dist: pytest>=8.3.3; extra == "dev"
Requires-Dist: ruff>=0.9.3; extra == "dev"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

<p align="center">
  <img src="assets/origami_logo.jpg" style="width: 100%; height: auto;">
</p>

# ORiGAMi - Object Representation through Generative Autoregressive Modelling

<p align="center">
| <a href="https://arxiv.org/abs/2412.17348"><b>ORiGAMi Paper on Arxiv</b></a> |
</p>

## Disclaimer

Please note: This tool is not officially supported or endorsed by MongoDB, Inc. The code is released for use "AS IS" without any warranties of any kind, including, but not limited to its installation, use, or performance. Do not run this tool against critical production systems.

## Overview

ORiGAMi is a transformer-based Machine Learning model to directly process semi-structured data such as MongoDB documents or JSON files and make predictions from this data.

Typically, when working with semi-structured data in a Machine Learning context, the data needs to be flattened
into a tabular form first. This flattening can be lossy, especially in the presence of arrays and nested objects, and often requires domain expertise to extract meaningful higher-order features from the raw data. This feature extraction step is manual, slow and expensive and doesn't scale well.

ORiGAMi is a transformer model and follows the trend of many other deep learning models by operating directly on the raw data and discovering meaningful features itself. Preprocessing is fully automated (apart from some hyper-parameters that can improve the model performance).


## Installation

ORiGAMi requires Python version 3.10 or higher. We recommend using a virtual environment, such as
Python's native [`venv`](https://docs.python.org/3/library/venv.html).

To install ORiGAMi with `pip`, use

```shell
pip install origami-ml
```

You can also clone the repository to your local machine and install the dependencies manually:

```shell
git clone https://github.com/mongodb-labs/origami.git
cd origami
pip install -r requirements.txt
pip install -e .
```

## Usage

ORiGAMi comes with a command line interface (CLI) and a Python SDK.

### Usage from the Command Line

The CLI allows to train a model and make predictions from a trained model. After installation, run `origami` from your shell to see an overview of available commands.

Help for specific commands is available with `origami <command> --help`, where `<command>` is currently one of `train` or `predict`.

Detailed documentation for the CLI and available options can be found in [`CLI.md`](CLI.md).

### Usage with Python

To see an example on how to use ORiGAMi from Python, take a look at the provided [./notebooks](./notebooks/) folder, e.g. the [`example_origami_dungeons.ipynb`](./notebooks/example_origami_dungeons.ipynb) notebook.

## Experiment Reproduction

This code is released alongside our paper, which can be found on Arxiv: [ORIGAMI: A generative transformer architecture for predictions from semi-structured data](https://arxiv.org/abs/2412.17348). To reproduce the experiments in the paper, see the instructions in the [`./experiments/`](./experiments/) directory.
