Metadata-Version: 2.4
Name: osma-ai
Version: 0.1.0
Summary: Machine learning framework for building specialist models.
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: datasets>=4.4.1
Requires-Dist: fsspec>=2025.10.0
Requires-Dist: gcsfs>=2025.10.0
Requires-Dist: jinja2>=3.1.6
Requires-Dist: litellm>=1.80.10
Requires-Dist: outlines>=1.2.9
Requires-Dist: pandas>=2.3.3
Requires-Dist: pandas-profiling>=3.6.6
Requires-Dist: peft>=0.18.0
Requires-Dist: pydantic>=2.12.5
Requires-Dist: tenacity>=9.1.2
Requires-Dist: transformers>=4.57.3
Requires-Dist: trl>=0.26.1
Requires-Dist: widgetsnbextension>=4.0.15
Dynamic: license-file

![Osma Header](docs/images/header.png)

Osma is a powerful framework designed to significantly streamline the process of fine-tuning language models using data curated by larger, more capable teacher models, in effort to outperform the teacher models. It provides a structured approach to defining the curation process using signatures, generating high-quality training datasets, and fine-tuning local models. Osma is inspired by the research done by Stanford University's [Natural Language Processing (NLP) Group](https://nlp.stanford.edu/), and is in alpha development.

## Features

- **Dataset Management**: Easy loading, manipulation, and saving of datasets.
- **Structured Signatures**: Define strict input/output schemas ensuring consistency in generated data.
- **Teacher-Student Workflow**: Use a managed model to curate training examples from raw data.
- **Trainset Curation**: Automatically generate reasoning and labels for your dataset.
- **Filtering**: Mechanisms to validate and filter generated data against ground truth or custom logic.
- **Local Fine-Tuning**: Seamlessly fine-tune local models using curated datasets.
- **Evaluation**: Tools to evaluate model performance against test sets.

## Simple Example

```python
import osma
from typing import Literal

# Load and shuffle data
ds = osma.Dataset("data.jsonl").shuffle()

# Define the task signature with inputs and outputs
classes = Literal["positive", "negative"]
sg = osma.Signature(
    osma.InputFields("text"),
    osma.OutputField("sentiment", classes),
    reasoning=True
)

# Initialize the Teacher Model
teacher = osma.LanguageModel("gemini/gemini-1.5-flash")

# Curate a training set
trainset = osma.Trainset(ds.range(0, 500), sg, teacher)
trainset.save("train.jsonl")

# Fine-tune a local Student Model
student = osma.LanguageModel("google/gemma-2-2b-it", provider=osma.ModelProvider.LOCAL)
student.train(trainset)

# Run Inference
print(student(sg, text="I love this framework!"))
```

## Installation

Using `uv`:
```bash
uv install osma
```

Using `pip`:
```bash
pip install osma
```

## Environment Variables

To use Osma, you must export the necessary keys for the models you intend to use.

- **`HF_TOKEN`**: Required for accessing open-source models (student models).

When using Osma to curate a trainset, you will need to specify the appropriate API key for the managed model's provider:
- **`GEMINI_API_KEY`**: Required if using Google Gemini as a teacher.
- **`OPENAI_API_KEY`**: Required if using OpenAI models as a teacher.

Note: the above is a non-exhaustive list - you can find the appropriate API key for your model provider in the documentation for that provider.

## Key Methods

### Dataset

Initialize a dataset from a JSONL file.
```python
ds = osma.Dataset("path/to/data.jsonl")
```

Randomly shuffle the dataset rows.
```python
ds = ds.shuffle()
```

Select a subset of rows based on index range.
```python
ds = ds.range(0, 100)
```

Return the first n rows of the dataset.
```python
ds = ds.head(5)
```

Save the dataset to a file.
```python
ds.save("output.jsonl")
```

### Signature

Define a task signature with input fields, output fields, and optional reasoning.
```python
sg = osma.Signature(osma.InputFields("input_col"), osma.OutputField("output_name", str), reasoning=True)
```

### Trainset

Curate a new trainset by processing a dataset with a teacher model.
```python
ts = osma.Trainset(ds, sg, teacher_model)
```

Load an existing trainset from a file.
```python
ts = osma.Trainset("train.jsonl")
```

Filter rows based on a comparison function between generated and source data.
```python
ts = ts.filter(ds, lambda x, y: x['field'] == y['field'])
```

Save the trainset to a file.
```python
ts.save("curated.jsonl")
```

Randomly shuffle the trainset rows.
```python
ts = ts.shuffle()
```

Select a subset of the trainset based on index range.
```python
ts = ts.range(0, 100)
```

Return the first n rows of the trainset.
```python
ts = ts.head(5)
```

### Model

Initialize a managed teacher model using the provider/model string.
```python
teacher = osma.LanguageModel("gemini/gemini-1.5-flash")
```

Initialize a local student model.
```python
student = osma.LanguageModel("google/gemma-2b", provider=osma.ModelProvider.LOCAL)
```

Generate output for a given signature and specific input arguments.
```python
result = model(sg, text="example input")
```

Fine-tune the local model on the provided trainset.
```python
student.train(trainset)
```

Evaluate the model on a test dataset using a scoring function.
```python
results = student.evaluate(sg, eval_ds, lambda res, row: res['val'] == row['val'])
```
