Metadata-Version: 2.4
Name: openaivec
Version: 2.3.1
Summary: Generative mutation for tabular calculation
Project-URL: Homepage, https://microsoft.github.io/openaivec/
Project-URL: Repository, https://github.com/microsoft/openaivec
Author-email: Hiroki Mizukami <hmizukami@microsoft.com>
License: MIT
License-File: LICENSE
Keywords: llm,openai,openai-api,openai-python,pandas,pyspark
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Requires-Dist: aiohttp>=3.9.0
Requires-Dist: azure-identity>=1.25.1
Requires-Dist: duckdb>=1.0.0
Requires-Dist: ipywidgets>=8.1.7
Requires-Dist: openai>=2.0.0
Requires-Dist: pandas>=2.3.3; python_full_version < '3.11'
Requires-Dist: pandas>=3.0.0; python_full_version >= '3.11'
Requires-Dist: pyarrow>=19.0.0
Requires-Dist: tiktoken>=0.9.0
Requires-Dist: tqdm>=4.67.1
Provides-Extra: spark
Requires-Dist: pyspark>=4.0.0; extra == 'spark'
Description-Content-Type: text/markdown

# openaivec

AI text processing for pandas and Spark. Apply one prompt to many rows with automatic batching and caching.

[Contributor guidelines](AGENTS.md)

## Quick start

```bash
pip install openaivec
```

Apply one prompt to many values:

```python
import os
import pandas as pd
from openaivec import pandas_ext

os.environ["OPENAI_API_KEY"] = "your-api-key"

fruits = pd.Series(["apple", "banana", "cherry"])
french_names = fruits.ai.responses("Translate this fruit name to French.")
print(french_names.tolist())
# ['pomme', 'banane', 'cerise']

```

For Azure OpenAI and custom client setup, see [pandas authentication options](#pandas-authentication-options).

**Pandas tutorial (GitHub Pages):** https://microsoft.github.io/openaivec/examples/pandas/

## Benchmarks

Simple task benchmark from [benchmark.ipynb](https://github.com/microsoft/openaivec/blob/main/docs/examples/benchmark.ipynb) (100 numeric strings → integer literals, `Series.aio.responses`, model `gpt-5.1`):

| Mode                | Settings                                        | Time (s) |
| ------------------- | ----------------------------------------------- | -------- |
| Serial              | `batch_size=1`, `max_concurrency=1`             | ~141     |
| Batching            | default `batch_size`, `max_concurrency=1`       | ~15      |
| Concurrent batching | default `batch_size`, default `max_concurrency` | ~6       |

Batching alone removes most HTTP overhead, and letting batching overlap with concurrency cuts total runtime to a few seconds while still yielding one output per input.

<img alt="image" src="https://github.com/user-attachments/assets/8ace9bcd-bcae-4023-a37e-13082cd645e5" />

## Contents

- [Why openaivec?](#why-openaivec)
- [Overview](#overview)
- [Core Workflows](#core-workflows)
- [Pandas authentication options](#pandas-authentication-options)
- [Using with Apache Spark UDFs](#using-with-apache-spark-udfs)
- [Spark authentication options](#spark-authentication-options)
- [Using with DuckDB](#using-with-duckdb)
- [Building Prompts](#building-prompts)
- [Using with Microsoft Fabric](#using-with-microsoft-fabric)
- [Contributing](#contributing)
- [Additional Resources](#additional-resources)
- [Community](#community)

## Why openaivec?

- Drop-in `.ai` and `.aio` accessors keep pandas analysts in familiar tooling.
- OpenAI batch-optimized: `BatchCache`/`AsyncBatchCache` coalesce requests, dedupe prompts, preserve order, and release waiters on failure.
- Reasoning support mirrors the OpenAI SDK; structured outputs accept Pydantic `response_format`.
- Built-in caches and retries remove boilerplate; pandas and async helpers can share caches explicitly, while Spark UDFs dedupe repeated inputs within each partition.
- Spark UDFs, DuckDB integration, and Microsoft Fabric guides move notebooks into production-scale ETL.
- Prompt tooling (`FewShotPromptBuilder`, `improve`) and the task library ship curated prompts with validated outputs.

## Overview

Vectorized OpenAI batch processing so you handle many inputs per call instead of one-by-one. Batching proxies dedupe inputs, enforce ordered outputs, and unblock waiters even on upstream errors. Shared-cache helpers reuse expensive prompts across pandas and async flows, while Spark UDF builders dedupe repeated inputs within each partition. Reasoning models honor SDK semantics. Requires Python 3.10+.

## Core Workflows

### Direct API usage

For maximum control over batch processing:

```python
import os
from openai import OpenAI
from openaivec import BatchResponses

# Initialize the batch client
client = BatchResponses.of(
    client=OpenAI(),
    model_name="gpt-5.1",
    system_message="Please answer only with 'xx family' and do not output anything else.",
    # batch_size defaults to None (automatic optimization)
)

result = client.parse(
    ["panda", "rabbit", "koala"],
    reasoning={"effort": "none"},
)
print(result)  # Expected output: ['bear family', 'rabbit family', 'koala family']
```

📓 **[Complete tutorial →](https://microsoft.github.io/openaivec/examples/pandas/)**

### pandas authentication options

Configure authentication once before using `.ai` or `.aio`.

#### OpenAI (API key)

```python
import os

os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
```

#### Azure OpenAI (API key)

```python
import os

os.environ["AZURE_OPENAI_API_KEY"] = "your-azure-openai-api-key"
os.environ["AZURE_OPENAI_BASE_URL"] = "https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/"
```

The base URL must end with `/openai/v1/`.

#### Azure OpenAI with Entra ID (no API key)

```python
import os

os.environ["AZURE_OPENAI_BASE_URL"] = "https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/"
os.environ.pop("AZURE_OPENAI_API_KEY", None)
```

`openaivec` uses `DefaultAzureCredential` when `AZURE_OPENAI_API_KEY` is not set.

#### Custom clients (optional)

```python
import openaivec
from openai import AsyncOpenAI, OpenAI
from openaivec import pandas_ext

openaivec.set_client(OpenAI())
openaivec.set_async_client(AsyncOpenAI())
```

### pandas integration (recommended)

The easiest way to get started with your DataFrames (after authentication):

```python
import openaivec
import pandas as pd
from openaivec import pandas_ext

openaivec.set_responses_model("gpt-5.1")

df = pd.DataFrame({"name": ["panda", "rabbit", "koala"]})

result = df.assign(
    family=lambda df: df.name.ai.responses(
        "What animal family? Answer with 'X family'",
        reasoning={"effort": "none"},
    )
)
```

| name   | family           |
| ------ | ---------------- |
| panda  | bear family      |
| rabbit | rabbit family    |
| koala  | marsupial family |

📓 **[Interactive pandas examples →](https://microsoft.github.io/openaivec/examples/pandas/)**

### Using with reasoning models

Reasoning models (o1-preview, o1-mini, o3-mini, etc.) follow OpenAI SDK semantics. Pass `reasoning` when you want to override model defaults.

```python
import openaivec

openaivec.set_responses_model("o1-mini")  # Set your reasoning model

result = df.assign(
    analysis=lambda df: df.text.ai.responses(
        "Analyze this text step by step",
        reasoning={"effort": "none"},  # Optional: mirrors the OpenAI SDK argument
    )
)
```

You can omit `reasoning` to use the model defaults or tune it per request with the same shape (`dict` with effort) as the OpenAI SDK.

### Using pre-configured tasks

For common text processing operations, openaivec provides ready-to-use tasks that eliminate the need to write custom prompts:

```python
from openaivec.task import nlp, customer_support

text_df = pd.DataFrame({
    "text": [
        "Great product, fast delivery!",
        "Need help with billing issue",
        "How do I reset my password?"
    ]
})

results = text_df.assign(
    sentiment=lambda df: df.text.ai.task(
        nlp.sentiment_analysis(),
        reasoning={"effort": "none"},
    ),
    intent=lambda df: df.text.ai.task(
        customer_support.intent_analysis(),
        reasoning={"effort": "none"},
    ),
)

# Extract structured results into separate columns
extracted_results = results.ai.extract("sentiment")
```

### Asynchronous processing with `.aio`

High-throughput workloads use the `.aio` accessor for async versions of all operations:

```python
import asyncio
import openaivec
import pandas as pd
from openaivec import pandas_ext

openaivec.set_responses_model("gpt-5.1")

df = pd.DataFrame({"text": [
    "This product is amazing!",
    "Terrible customer service",
    "Good value for money",
    "Not what I expected"
] * 250})  # 1000 rows for demonstration

async def process_data():
    return await df["text"].aio.responses(
        "Analyze sentiment and classify as positive/negative/neutral",
        reasoning={"effort": "none"},  # Recommended for reasoning models
        max_concurrency=12    # Allow up to 12 concurrent requests
    )

sentiments = asyncio.run(process_data())
```

**Performance benefits:** Parallel processing with automatic batching/deduplication, built-in rate limiting and error handling, and memory-efficient streaming for large datasets.

## Using with Apache Spark UDFs

Scale to enterprise datasets with distributed processing.

📓 **[Spark tutorial →](https://microsoft.github.io/openaivec/examples/spark/)**

### Spark authentication options

Choose one setup path before registering UDFs.

#### OpenAI (API key)

```python
from pyspark.sql import SparkSession
from openaivec.spark_ext import setup

spark = SparkSession.builder.getOrCreate()
setup(
    spark,
    api_key="your-openai-api-key",
    responses_model_name="gpt-5.1",
    embeddings_model_name="text-embedding-3-small",
)
```

#### Azure OpenAI (API key)

```python
from pyspark.sql import SparkSession
from openaivec.spark_ext import setup_azure

spark = SparkSession.builder.getOrCreate()
setup_azure(
    spark,
    api_key="your-azure-openai-api-key",
    base_url="https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/",
    responses_model_name="my-gpt-deployment",
    embeddings_model_name="my-embedding-deployment",
)
```

The base URL must end with `/openai/v1/`.

#### Azure OpenAI with Entra ID (no API key)

```python
import os

os.environ["AZURE_OPENAI_BASE_URL"] = "https://YOUR-RESOURCE-NAME.services.ai.azure.com/openai/v1/"
os.environ.pop("AZURE_OPENAI_API_KEY", None)
```

`openaivec` uses `DefaultAzureCredential` when `AZURE_OPENAI_API_KEY` is not set.

Create and register UDFs using the provided helpers:

```python
from openaivec.spark_ext import responses_udf

spark.udf.register(
    "extract_brand",
    responses_udf(
        instructions="Extract the brand name from the product. Return only the brand name.",
        reasoning={"effort": "none"},
    )
)

products = spark.createDataFrame(
    [("Nike Air Max",), ("Apple iPhone 15",)],
    ["product_name"],
)
products.selectExpr("product_name", "extract_brand(product_name) AS brand").show()
```

Other helper UDFs are available: `task_udf`, `embeddings_udf`, `count_tokens_udf`, `similarity_udf`, and `parse_udf`.

### Spark performance tips

- Duplicate detection automatically caches repeated inputs per partition for UDFs.
- `batch_size=None` auto-optimizes; set 32–128 for fixed sizes if needed.
- `max_concurrency` is per executor; total concurrency = executors × max_concurrency. Start with 4–12.
- Monitor rate limits and adjust concurrency to your OpenAI tier.

## Using with DuckDB

Register AI-powered functions as DuckDB UDFs and use them in pure SQL. Structured outputs are returned as native `STRUCT` types with direct field access.

```python
import openaivec
import duckdb
from pydantic import BaseModel
from typing import Literal
from openaivec import duckdb_ext

openaivec.set_responses_model("gpt-5.4")


class Sentiment(BaseModel):
    label: Literal["positive", "negative", "neutral"]
    confidence: float
    summary: str


conn = duckdb.connect()

duckdb_ext.responses_udf(
    conn,
    "analyze_sentiment",
    instructions="Analyze customer sentiment. Return label, confidence (0-1), and a one-sentence summary.",
    response_format=Sentiment,
)

# Query CSV directly — structured fields, no JSON parsing
conn.sql("""
    SELECT
        customer,
        analyze_sentiment(response).label      AS sentiment,
        analyze_sentiment(response).confidence AS confidence,
        analyze_sentiment(response).summary    AS summary
    FROM 'survey.csv'
""")

# Aggregate with standard SQL
conn.sql("""
    WITH results AS (
        SELECT analyze_sentiment(response).label AS sentiment
        FROM 'survey.csv'
    )
    SELECT sentiment, COUNT(*) AS count
    FROM results
    GROUP BY sentiment
""")
```

Embedding UDFs work the same way:

```python
duckdb_ext.embeddings_udf(conn, "embed")

conn.sql("""
    SELECT text, list_cosine_similarity(embed(a.text), embed(b.text)) AS similarity
    FROM docs a, queries b
""")
```

All UDFs use Arrow vectorized execution — DuckDB sends batches of rows that are processed with async concurrency and automatic deduplication.

📓 **[DuckDB tutorial →](https://microsoft.github.io/openaivec/examples/duckdb/)**

## Building Prompts

Few-shot prompts improve LLM quality. `FewShotPromptBuilder` structures purpose, cautions, and examples; `improve()` iterates with OpenAI to remove contradictions.

```python
from openaivec import FewShotPromptBuilder

prompt = (
    FewShotPromptBuilder()
    .purpose("Return the smallest category that includes the given word")
    .caution("Never use proper nouns as categories")
    .example("Apple", "Fruit")
    .example("Car", "Vehicle")
    .improve(max_iter=1)  # optional
    .build()
)
```

📓 **[Advanced prompting techniques →](https://microsoft.github.io/openaivec/examples/prompt/)**

## Using with Microsoft Fabric

[Microsoft Fabric](https://www.microsoft.com/en-us/microsoft-fabric/) is a unified, cloud-based analytics platform. Add `openaivec` from PyPI in your Fabric environment, select it in your notebook, and use `openaivec.spark_ext` like standard Spark.

### Recommended authentication: Service Principal + Key Vault

Inside Fabric notebooks, the recommended way to authenticate against Azure OpenAI / Azure AI Foundry is to keep a Service Principal client secret in Azure Key Vault and retrieve it through [`notebookutils.credentials.getSecret`](https://learn.microsoft.com/fabric/data-engineering/notebookutils/notebookutils-credentials#get-secret). Never hard-code secrets in notebooks.

**One-time setup**

1. Create a Service Principal (App Registration) in Microsoft Entra ID and generate a client secret.
2. Assign the Service Principal a data-plane role on the AI resource so it can call inference (see role table below).
3. Store the client secret in an Azure Key Vault.
4. Grant the **Fabric Workspace identity** the `Key Vault Secrets User` role on that Key Vault. The workspace identity — not the user — is what authenticates from the notebook to Key Vault.

**Required Azure roles**

| Identity | Role | Scope | Purpose |
|---|---|---|---|
| Service Principal | [`Cognitive Services OpenAI User`](https://learn.microsoft.com/azure/ai-foundry/openai/how-to/role-based-access-control#cognitive-services-openai-user) | Azure OpenAI resource (or its resource group / subscription) | Call `responses` / `embeddings` against an Azure OpenAI endpoint. |
| Service Principal | [`Azure AI User`](https://learn.microsoft.com/azure/ai-foundry/concepts/rbac-azure-ai-foundry#azure-ai-user) | Azure AI Foundry project (Cognitive Services / AI Services account) | Call inference through a Foundry project endpoint (`/api/projects/<name>/openai/v1/`). |
| Fabric Workspace identity | [`Key Vault Secrets User`](https://learn.microsoft.com/azure/key-vault/general/rbac-guide#azure-built-in-roles-for-key-vault-data-plane-operations) | The Key Vault holding the SP secret | Allow `notebookutils.credentials.getSecret` to read the secret at runtime. |

Notes:

- Use **`Cognitive Services OpenAI User`** when you talk directly to an Azure OpenAI resource endpoint (`https://<resource>.openai.azure.com/` or `https://<resource>.services.ai.azure.com/openai/v1/`). It grants the minimum needed to invoke deployments; do **not** assign `Cognitive Services OpenAI Contributor` unless the SP must also manage deployments.
- Use **`Azure AI User`** when you call a Foundry project endpoint (Option B below). Foundry data-plane RBAC is documented at [Role-based access control for Azure AI Foundry](https://learn.microsoft.com/azure/ai-foundry/concepts/rbac-azure-ai-foundry).
- The Key Vault must use [Azure RBAC permission model](https://learn.microsoft.com/azure/key-vault/general/rbac-guide) (not legacy access policies) for `Key Vault Secrets User` to take effect.

References: [NotebookUtils credentials](https://learn.microsoft.com/fabric/data-engineering/notebookutils/notebookutils-credentials), [Fabric Spark security: accessing Key Vault](https://learn.microsoft.com/fabric/data-engineering/spark-best-practices-security#accessing-azure-key-vault-akv-from-notebook), [Azure OpenAI with Microsoft Entra ID](https://learn.microsoft.com/azure/ai-foundry/openai/how-to/managed-identity).

#### Option A — env-var driven (recommended)

`openaivec` detects Fabric and pulls the client secret from Key Vault automatically when these four env vars are set, then builds the bearer-token provider for you:

```python
import os

os.environ["AZURE_TENANT_ID"]          = "<your-tenant-id>"          # Service Principal tenant
os.environ["AZURE_CLIENT_ID"]          = "<your-client-id>"          # Service Principal client ID
os.environ["KEY_VAULT_URL"]            = "https://<your-keyvault>.vault.azure.net/"
os.environ["KEY_VAULT_SECRET_NAME"]    = "<your-secret-name>"        # SP client secret in KV

os.environ["AZURE_OPENAI_BASE_URL"]    = "https://<your-resource>.services.ai.azure.com/openai/v1/"

import pandas as pd
from openaivec import pandas_ext  # noqa: F401  registers the .ai accessor

pd.Series(["apple", "banana"]).ai.responses("Translate to French.")
```

Do **not** set `AZURE_OPENAI_API_KEY`; leaving it unset is what triggers the Entra ID code path.

#### Option B — bring your own `OpenAI` client (Foundry project endpoint)

For Azure AI Foundry **project endpoints** (`/api/projects/<name>/openai/v1/`) you can build the `OpenAI` client manually and hand it to `openaivec`:

```python
import notebookutils
from azure.identity import ClientSecretCredential, get_bearer_token_provider
from openai import OpenAI

import openaivec

TENANT_ID = "<your-tenant-id>"           # Service Principal tenant
CLIENT_ID = "<your-client-id>"           # Service Principal client ID
KV_URI    = "https://<your-keyvault>.vault.azure.net/"
SECRET    = "<your-secret-name>"         # SP client secret in KV

client_secret = notebookutils.credentials.getSecret(KV_URI, SECRET)

credential = ClientSecretCredential(
    tenant_id=TENANT_ID,
    client_id=CLIENT_ID,
    client_secret=client_secret,
)
token_provider = get_bearer_token_provider(
    credential,
    "https://ai.azure.com/.default",
)

openaivec.set_client(OpenAI(
    base_url="https://<your-resource>.services.ai.azure.com/api/projects/<your-project>/openai/v1/",
    api_key=token_provider,
))
openaivec.set_responses_model("<your-deployment-or-model>")
```

The `https://ai.azure.com/.default` scope is the [documented audience for Microsoft Foundry Models endpoints](https://learn.microsoft.com/azure/foundry/foundry-models/concepts/endpoints#keyless-authentication) (`*.services.ai.azure.com`). Older classic Azure OpenAI references may show `https://cognitiveservices.azure.com/.default`; for the Foundry endpoint shape recommended above, use `ai.azure.com`.

## Contributing

We welcome contributions! Please:

1. Fork and branch from `main`.
2. Add or update tests when you change code.
3. Run formatting and tests before opening a PR.

Install dev deps:

```bash
uv sync --all-extras --dev
```

Lint and format:

```bash
uv run ruff check . --fix
```

Quick test pass:

```bash
uv run pytest -m "not slow and not requires_api"
```

## Additional Resources

📓 **[Customer feedback analysis →](https://microsoft.github.io/openaivec/examples/customer_analysis/)** - Sentiment analysis & prioritization  
📓 **[Survey data transformation →](https://microsoft.github.io/openaivec/examples/survey_transformation/)** - Unstructured to structured data  
📓 **[Asynchronous processing examples →](https://microsoft.github.io/openaivec/examples/aio/)** - High-performance async workflows  
📓 **[Auto-generate FAQs from documents →](https://microsoft.github.io/openaivec/examples/generate_faq/)** - Create FAQs using AI  
📓 **[All examples →](https://microsoft.github.io/openaivec/examples/pandas/)** - Complete collection of tutorials and use cases

## Community

Join our Discord community for support and announcements: https://discord.gg/hXCS9J6Qek
