Metadata-Version: 2.4
Name: dfcontext
Version: 0.1.0
Summary: Generate optimal LLM context from pandas DataFrames within a token budget
Project-URL: Homepage, https://github.com/sserada/dfcontext
Project-URL: Repository, https://github.com/sserada/dfcontext
Project-URL: Issues, https://github.com/sserada/dfcontext/issues
Project-URL: Documentation, https://github.com/sserada/dfcontext#readme
Author: So
License-Expression: MIT
License-File: LICENSE
Keywords: ai,context,data-science,dataframe,llm,pandas,prompt-engineering,token
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: numpy>=1.23.0
Requires-Dist: pandas>=1.5.0
Provides-Extra: all
Requires-Dist: pyyaml>=6.0; extra == 'all'
Requires-Dist: tiktoken>=0.5.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: build; extra == 'dev'
Requires-Dist: mypy>=1.8; extra == 'dev'
Requires-Dist: pandas-stubs; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Requires-Dist: twine; extra == 'dev'
Requires-Dist: types-pyyaml; extra == 'dev'
Provides-Extra: tiktoken
Requires-Dist: tiktoken>=0.5.0; extra == 'tiktoken'
Provides-Extra: yaml
Requires-Dist: pyyaml>=6.0; extra == 'yaml'
Description-Content-Type: text/markdown

# dfcontext

> Generate optimal LLM context from pandas DataFrames within a token budget.

[![PyPI version](https://badge.fury.io/py/dfcontext.svg)](https://pypi.org/project/dfcontext/)
[![Python](https://img.shields.io/pypi/pyversions/dfcontext)](https://pypi.org/project/dfcontext/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Tests](https://github.com/sserada/dfcontext/actions/workflows/ci.yml/badge.svg)](https://github.com/sserada/dfcontext/actions)

## Why?

You have a 100K-row DataFrame. Your LLM has a context window.

- `df.to_string()` gives you **millions of tokens**
- `df.head()` gives you **5 rows** with no statistical context

**dfcontext** gives you the sweet spot — intelligent, column-type-aware summarization that fits within your token budget. No LLM calls required.

## Install

```bash
pip install dfcontext
```

Optional dependencies for accurate token counting and YAML output:

```bash
pip install dfcontext[all]       # tiktoken + pyyaml
pip install dfcontext[tiktoken]  # accurate token counting only
pip install dfcontext[yaml]      # YAML format output only
```

## Quick Start

```python
import pandas as pd
from dfcontext import to_context

df = pd.read_csv("sales.csv")  # 100K rows
ctx = to_context(df, token_budget=2000)
print(ctx)
```

**Output:**

```markdown
## Dataset overview
- 100,000 rows × 5 columns

## Schema
| Column | Type | Non-null |
|--------|------|----------|
| region | object | 100% |
| sales | float64 | 100% |
| quantity | int64 | 100% |
| date | datetime64[ns] | 100% |
| is_return | bool | 100% |

## Column statistics
### region (categorical, 4 unique)
Top values: East (28.0%), West (25.8%), North (23.2%), South (23.0%)

### sales (numeric)
Range: 4.64 — 8,172.45 | Mean: 1,010.55 | Std: 1,030.04
Distribution: [█▃▁▁▁▁▁▁]

### date (datetime)
Range: 2024-01-01 — 2024-02-11 | Granularity: hourly

### is_return (boolean)
True: 6.0% | False: 94.0%

## Sample rows (diverse selection)
| region | sales | quantity | date | is_return |
|---|---|---|---|---|
| East | 4.64 | 32 | 2024-01-14 | False |
| South | 697.55 | 50 | 2024-01-15 | False |
| West | 8172.45 | 68 | 2024-01-02 | False |
```

## Features

- **Column-type-aware analysis** — different strategies for numeric, categorical, text, datetime, and boolean columns
- **Token budget management** — output always fits within your specified token limit
- **Query hints** — tell it what you're analyzing, and it prioritizes relevant columns
- **Multiple formats** — Markdown, plain text, or YAML output
- **Zero LLM dependency** — pure data processing, works with any LLM provider
- **Fast** — handles 100K rows in under a second

## Advanced Usage

### Query Hints

Provide a hint to allocate more token budget to relevant columns:

```python
ctx = to_context(df, token_budget=2000, hint="regional sales trends")
# "region" and "sales" columns get more detailed analysis
```

### Output Formats

```python
ctx_md = to_context(df, format="markdown")   # default
ctx_plain = to_context(df, format="plain")   # no markdown syntax
ctx_yaml = to_context(df, format="yaml")     # requires pyyaml
```

### Configuration Object

For full control, use `ContextConfig`:

```python
from dfcontext import ContextConfig, to_context

config = ContextConfig(
    token_budget=3000,
    format="markdown",
    hint="churn analysis",
    include_schema=True,
    include_stats=True,
    include_samples=True,
    max_sample_rows=5,
)
ctx = to_context(df, config=config)
```

### Column Analysis

Get structured analysis results as Python objects:

```python
from dfcontext import analyze_columns

summaries = analyze_columns(df)
for name, s in summaries.items():
    print(f"{name}: {s.column_type}, {s.unique_count} unique")
```

### Token Counting

```python
from dfcontext import count_tokens

tokens = count_tokens("some text")
```

### Use with Claude

```python
import anthropic
from dfcontext import to_context

df = pd.read_csv("sales.csv")
ctx = to_context(df, token_budget=2000, hint="sales trends")

client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"{ctx}\n\nWhat are the key sales trends?",
    }],
)
```

## API Reference

| Function | Description |
|----------|-------------|
| `to_context(df, ...)` | Generate LLM context string from a DataFrame |
| `analyze_columns(df)` | Get structured column analysis results |
| `count_tokens(text)` | Count tokens in text |

| Class | Description |
|-------|-------------|
| `ContextConfig` | Configuration dataclass for `to_context()` |
| `ColumnSummary` | Structured result from column analysis |
| `BudgetPlan` | Token budget allocation plan |

## License

MIT
