Metadata-Version: 2.3
Name: nobs-canonicalize
Version: 0.5.0
Summary: 
Author: Boris Dev
Author-email: boris.dev@gmail.com
Requires-Python: >=3.11,<3.15
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: bertopic (>=0.16.4,<0.17.0)
Requires-Dist: diskcache (>=5.6.3,<6.0.0)
Requires-Dist: instructor (>=1.7.0,<2.0.0)
Requires-Dist: loguru (>=0.7.2,<0.8.0)
Requires-Dist: more-itertools (>=10.6.0,<11.0.0)
Requires-Dist: numpy (==1.26.4)
Requires-Dist: openai (>=1.58.1,<2.0.0)
Requires-Dist: pydantic (>=2.7.1,<3.0.0)
Requires-Dist: pytest-asyncio (>=0.25.3,<0.26.0)
Requires-Dist: python-dotenv (>=1.0.1,<2.0.0)
Requires-Dist: update (>=0.0.1,<0.0.2)
Description-Content-Type: text/markdown

# nobs-canonicalize

The purpose of this library is to reduce development time
needed to cluster documents into topics, at least for your prototype.

> [!CAUTION]
> This library is in early development. It is not ready for production use.

The library has been tested on 2,500 sentences. A smell test of 10,000 sentences
seems to pass, but of of course, the topic quality will be unknown, so be cautious and evaluate carefully.

The approach here is to use DBSAN clustering algorithm from [BERTopic](https://maartengr.github.io/BERTopic/index.html)
along with OPENAI's `o3-mini` LLM model to name the clusters and classify outliers.

## Motivations

-   Topic modeling is a time-consuming development task. I did not find any
    tools to help me quickly make quality topics for my prototype. BERTopic
    library is a great tool, but it is not easy to use with complicated
    options.
-   **OpenAI's cutting-edge `o3-mini`** names clusters well, and reduces outliers better than [BERTopic](https://maartengr.github.io/BERTopic/index.html)'s default method.

## Example usage

### OpenAI

```python
import os

from dotenv import load_dotenv
from rich import print

from nobs_canonicalize import nobs_canonicalize

load_dotenv()
openai_api_key = os.environ["OPENAI_API_KEY"]

texts = [
    "16/8 fasting",
    "16:8 fasting",
    "24-hour fasting",
    "24-hour one meal a day (OMAD) eating pattern",
    "2:1 ketogenic diet, low-glycemic-index diet",
    "30-day nutrition plan",
    "36-hour fast",
    "4-day fast",
    "40 hour fast, low carb meals",
    "4:3 fasting",
    "5-day fasting-mimicking diet (FMD) program",
    "7 day fast",
    "84-hour fast",
    "90/10 diet",
    "Adjusting macro and micro nutrient intake",
    "Adjusting target macros",
    "Macro and micro nutrient intake",
    "AllerPro formula",
    "Alternate Day Fasting (ADF), One Meal A Day (OMAD)",
    "American cheese",
    "Atkin's diet",
    "Atkins diet",
    "Avoid seed oils",
    "Avoiding seed oils",
    "Limiting seed oils",
    "Limited seed oils and processed foods",
    "Avoiding seed oils and processed foods",
]

clusters = nobs_canonicalize(
    texts=texts,
    openai_api_key=openai_api_key,
    reasoning_effort="low",  # low, medium, high ... slow, slower, slowest
    subject="personal diet intervention outcomes",
)
print(clusters)
```

### Azure OpenAI

```python
import json
import os

from dotenv import load_dotenv

from nobs_canonicalize import nobs_canonicalize_azure, AzureConfig

load_dotenv()

azure_config = AzureConfig(
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version="2024-12-01-preview",
    azure_endpoint="https://your-resource.openai.azure.com/",
    embedding_deployment="text-embedding-3-large",  # default
    llm_deployment="o3-mini",                        # default
)

clusters = nobs_canonicalize_azure(
    texts=texts,
    reasoning_effort="low",
    subject="personal diet intervention outcomes",
    azure_config=azure_config,
)
print(clusters)
```

## Example output

![pytest output](images/polished_clusters.png)

## What's happening under the hood? The three steps...

This is a opinionated hybrid approach to topic modeling using a combination of
embeddings and LLM completions. The embeddings are for clustering and the LLM
completions are for naming and outlier classification.

```mermaid
graph TD;
    A[Start] -->|sentences| B{1.Run Bertopic};
    B -->|clusters| C[2.Name clusters];
    C -->|target classifications| D;;
    B -->|outliers| D[3.Classify and merge outliers];
```

### Step 1 - Cluster sentences

Bertopic library clusters using embeddings from a `text-embedding-3-large` LLM model.

### Step 2 - Name clusters

Names are generated by a `o3-mini` LLM model for the resulting clusters from **Step 1**.

### Step 3 - Re-group outliers

Outlier sentences, those that did not fit into any of the Bertopic clusters
from **Step 1**, are classified by the `o3-mini` LLM using the resulting
cluster names from **Step 2**.

### Install

#### Pre-requisites

-   `python = ">=3.11,<3.15"`

```shell
pip install nobs-canonicalize
```

## Some BERTopic FAQs

[Why does it take so long to import BERTopic?](https://maartengr.github.io/BERTopic/faq.html#how-can-i-use-bertopic-with-chinese-documents)

## Pointers for contributing developer

Run a smoke test

```shell
git clone git@github.com:borisdev/nobs-canonicalize.git
cd nobs-canonicalize
pip install -e .
# set the OPENAI_API_KEY in the code or as an environment variable
poetry run pytest tests/test_models.py -v  # unit tests, no API key needed
poetry run pytest tests/test_main.py::test_nobs_canonicalize -v  # integration test
# remember it takes a while to import the bertopic library
```

-   make a tiny PR so I can see how I can help you get started

