Metadata-Version: 2.3
Name: bertopic-easy
Version: 0.1.0
Summary: 
Author: Boris Dev
Author-email: boris.dev@gmail.com
Requires-Python: >=3.12,<3.13
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: bertopic (>=0.16.4,<0.17.0)
Requires-Dist: diskcache (>=5.6.3,<6.0.0)
Requires-Dist: instructor (>=1.7.0,<2.0.0)
Requires-Dist: loguru (>=0.7.2,<0.8.0)
Requires-Dist: numpy (==1.26.4)
Requires-Dist: openai (>=1.58.1,<2.0.0)
Requires-Dist: pydantic (>=2.7.1,<3.0.0)
Requires-Dist: pytest-asyncio (>=0.25.3,<0.26.0)
Requires-Dist: python-dotenv (>=1.0.1,<2.0.0)
Requires-Dist: update (>=0.0.1,<0.0.2)
Description-Content-Type: text/markdown

# BERTopic Easy

Polishing [BERTopic](https://maartengr.github.io/BERTopic/index.html) output using OpenAI's `o3-mini`.

## Motivations

-   **OpenAI's `o3-mini`** names clusters well.
-   **OpenAI's `o3-mini`** reduces outliers better than [BERTopic](https://maartengr.github.io/BERTopic/index.html)'s default method.

## Example usage

```python
from bertopic_easy.main import bertopic_easy

openai = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
async_openai = AsyncOpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

texts = [
    "16/8 fasting",
    "16:8 fasting",
    "24-hour fasting",
    "24-hour one meal a day (OMAD) eating pattern",
    "2:1 ketogenic diet, low-glycemic-index diet",
    "30-day nutrition plan",
    "36-hour fast",
    "4-day fast",
    "40 hour fast, low carb meals",
    "4:3 fasting",
    "5-day fasting-mimicking diet (FMD) program",
    "7 day fast",
    "84-hour fast",
    "90/10 diet",
    "Adjusting macro and micro nutrient intake",
    "Adjusting target macros",
    "Macro and micro nutrient intake",
    "AllerPro formula",
    "Alternate Day Fasting (ADF), One Meal A Day (OMAD)",
    "American cheese",
    "Atkin's diet",
    "Atkins diet",
    "Avoid seed oils",
    "Avoiding seed oils",
    "Limiting seed oils",
    "Limited seed oils and processed foods",
    "Avoiding seed oils and processed foods",
]


clusters = bertopic_easy(
    texts=texts,
    openai=openai,
    async_openai=async_openai,
    reasoning_effort="low",
    subject="personal diet intervention outcomes",
)
print(clusters)
```

## Example output

![pytest output](images/polished_clusters.png)

## What's happening under the hood? The three steps...

This is a opinionated hybrid approach to topic modeling using a combination of
embeddings and LLM completions. The embeddings are for clustering and the LLM
completions are for naming and outlier classification.

```mermaid
graph TD;
    A[Start] -->|sentences| B{1.Run Bertopic};
    B -->|clusters| C[2.Name clusters];
    C -->|target classifications| D;;
    B -->|outliers| D[3.Classify and merge outliers];
```

### Step 1 - Cluster sentences

Bertopic library clusters using embeddings from a `text-embedding-3-large` LLM model.

### Step 2 - Name clusters

Names are generated by a `o3-mini` LLM model for the resulting clusters from **Step 1**.

### Step 3 - Re-group outliers (not implemented yet)

Outlier sentences, those that did not fit into any of the Bertopic clusters
from **Step 1**, are classified by the `o3-mini` LLM using the resulting
cluster names from **Step 2**.

### Install

-   `git clone` this repo
-   `cd` to the root of the repo
-   set `OPENAI_API_KEY` as an environment variable or in a `.env` local file
-   `poetry install`
-   `poetry shell` # to activate the virtual environment, if needed
-   `poetry run python demo.py`

## Run smoke test

```shell

poetry run pytest tests/test_main.py::test_bertopic_easy
```

