Metadata-Version: 2.1
Name: safedata-guard
Version: 1.0.4
Summary: A lightweight framework for safely enabling LLMs to analyze pandas/Polars data without exposing raw data or blindly executing generated code.
Author: Aravind Chakravarthy
Project-URL: Homepage, https://pypi.org/project/safedata-guard/
Keywords: ai,agent,llm,pandas,data,safety,sandbox
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Intended Audience :: Developers
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas >=1.3
Requires-Dist: numpy >=1.20
Provides-Extra: dev
Requires-Dist: pytest >=7.0 ; extra == 'dev'
Requires-Dist: openpyxl >=3.0 ; extra == 'dev'
Provides-Extra: excel
Requires-Dist: openpyxl >=3.0 ; extra == 'excel'
Provides-Extra: polars
Requires-Dist: polars >=0.20 ; extra == 'polars'

# safedata-guard

A lightweight framework for safely letting LLMs analyze pandas/Polars data
without exposing raw data or blindly running the code they generate.

Most "chat with your data" tools send your whole table to the model and run
whatever code it writes, unchecked. safedata-guard changes both halves: it
sends a compact, quality-aware **summary** instead of raw rows, and it runs the
model's code behind guardrails on a **copy** of your data.

## What it does

**1. Summarises your data before it reaches the model.** Instead of pushing
100,000 rows into a prompt, it sends the columns, their types, a few sample
values, basic stats, and warnings about common data traps:

- numbers stored as text (`"$500"`, `"1,000"`)
- the same category written several ways (`"North"`, `"north "`, `"NORTH"`)
- dates stored as text, or Excel serial dates stored as plain numbers (`45292`)
- ID columns that are not actually unique
- columns that are completely empty, or mostly empty
- columns that hold the same value in every row
- duplicated column names
- negative values in columns whose names imply they should not have any

**2. Runs the model's code with guardrails.** Before running, an AST-based
screen refuses imports outside a small analysis set (pandas, numpy, math,
statistics, datetime, re), introspection/dunder tricks, dangerous builtins, and
file/data readers and writers. The code then runs on a copy of your data in a
separate process with a timeout. The model may add or transform columns freely
(it only touches the copy), but afterwards the guardrail checks that it did not
silently drop rows or return an empty result, and feeds any error back so the
model can fix its own code.

### Scope: please read this honestly

This is **defense in depth** for cooperative or semi-trusted model output: it
stops the destructive accidents an honest model makes and the obvious escape
attempts. It is **not** a security sandbox for deliberately malicious code.
In-process Python sandboxes have a long history of clever escapes, and on
Windows a child process still shares your filesystem permissions, so the
subprocess gives you timeout and crash isolation, not a filesystem jail. To run
code from an untrusted source, put safedata-guard inside OS-level isolation (a
container, a locked-down user, or a VM). It also cannot prove the model's maths
is correct, which no tool can do in general.

## Install

```bash
pip install safedata-guard
pip install "safedata-guard[polars]"   # optional, for Polars support
```

Pass a pandas or Polars DataFrame anywhere; the library detects the type and
applies the same summary and safety checks to both.

## Quick example

```python
import safedata
import pandas as pd

df = pd.DataFrame({
    "date": ["2025-01-01", "2024-05-01", "2025-08-01"],
    "amount": [100.0, 50.0, 200.0],
})

def my_model(prompt):
    # Replace this with a call to your own model: take the prompt text,
    # return Python code as a string.
    return "result = df[df['date'].str.startswith('2025')]['amount'].sum()"

agent = safedata.Agent(model=my_model)
out = agent.ask(df, "What were total sales in 2025?")

print(out.answer)     # 300.0
print(out.blocked)    # False if the code passed the safety checks
print(out.attempts)   # list of code attempts that were made
```

## Connecting a real model

Real models return messy text: code wrapped in Markdown fences, chatter like
"Here is the code:", and occasional failures. `safedata.wrap()` takes any
function that sends text to a model and returns text, pulls the bare code out of
the reply, and turns failures into a clear `ModelError` instead of a crash. Any
text-in/text-out function works, hosted or local, so you are not tied to one
provider.

```python
import safedata

def my_call(prompt):
    return some_model_that_takes_text_and_returns_text(prompt)

agent = safedata.Agent(model=safedata.wrap(my_call))
out = agent.ask(df, "What were total sales in 2025?")
print(out.answer)
```

The library stays safe with any model; a stronger model just means good code on
the first try and fewer retries.

## Command line

After installing you get a `safedata` command that summarises a file (quality
warnings and token estimate) without writing any Python. It only reads and
summarises; it never executes code.

```bash
safedata check sales.csv
safedata check data.xlsx --report quality.html
safedata check sales.csv --no-redact --samples 5
```

Supported: `.csv`, `.tsv`, `.xlsx`, `.xls`, `.parquet`, `.json`. `--report`
writes the HTML report; `--no-redact` shows raw samples instead of masking PII.

You can also run it as `python -m safedata check sales.csv` if the command is
not on your PATH.

## PII masking

The summary sends a few real sample values to the model, and those can contain
personal data. By default safedata-guard masks obvious PII (emails, card-like
numbers, phones, SSNs, IPs) before the summary leaves your machine and notes
which columns were masked.

```python
safedata.summarize(df)                    # PII masked by default
safedata.summarize(df, redact_pii=False)  # raw samples, if you are sure
```

This is **best-effort, regex-based redaction, not a compliance guarantee.** It
catches common patterns and will miss unusual formats, names, addresses, and
free text. For regulated data, keep it out of third-party LLMs by policy rather
than relying on a regex. Treat masking as a seatbelt, not a vault.

## Use the parts on their own

```python
print(safedata.summarize(df))                  # quality-aware summary
result = safedata.run_safely(code_string, df)  # run code through the guardrails
verdict = safedata.check_code(code_string)     # is this code safe? (does NOT run it)
safedata.report(df, "report.html")             # HTML quality report
print(safedata.token_savings(df))              # estimated token (cost) saving
```

`check_code(code)` returns a result with `.safe` (bool) and `.reason`, using the
same screen as `run_safely` but without executing anything, so you can use it as
a guardrail inside your own agent loop.

## Token saving

Sending a whole table costs tokens for every row; the summary is far smaller.

```python
print(safedata.token_savings(df))
# Sending the summary uses about 620 tokens instead of about 13,007,168 for the
# raw data. Estimated saving: 99.99% (about 13,006,548 tokens).
```

Every `agent.ask(...)` result also carries a `.tokens` estimate. All token
figures are estimates (each provider counts differently) but show the scale of
the saving.

## Function reference

**Asking questions**

- `safedata.Agent(model, max_retries=3)` builds an agent. `model` takes a prompt
  and returns code (usually made with `wrap`); `max_retries` is how many times
  the model may correct itself after a block.
- `agent.ask(df, question, verbose=False)` runs the full loop and returns a
  result with `.answer`, `.blocked`, `.reason`, `.attempts`, and `.tokens`.

**Connecting a model**

- `safedata.wrap(call, clean=...)` turns any text-in/text-out function into a
  model the agent can use, stripping messy replies and raising `ModelError` on
  failure.
- `safedata.extract_code(text)` pulls bare Python code out of a reply (handles
  Markdown fences and chatter). `wrap` uses it by default.
- `safedata.ModelError` is raised when a wrapped model call fails.

**Looking at the data**

- `safedata.summarize(df, redact_pii=True)` returns the text summary with trap
  warnings. This is what gets sent to the model.
- `safedata.report(df, path=None)` writes an HTML quality report to `path`, or
  returns the HTML string if no path is given.

**Running code safely**

- `safedata.run_safely(code, df, result_var="result", isolate=True, timeout=10.0)`
  runs code against a copy of `df`, blocks unsafe operations, checks nothing was
  damaged, and returns the result variable. Raises `SafetyError` if unsafe.
- `safedata.check_code(code)` screens code without running it; returns a
  `CodeCheck` (`.safe`, `.reason`).
- `safedata.SafetyError` is raised when code is blocked.

**Token estimates**

- `safedata.token_savings(df)` returns a readable sentence.
- `safedata.token_stats(df)` returns `summary_tokens`, `raw_tokens`,
  `saved_tokens`, `saved_percent`.
- `safedata.estimate_tokens(text)` estimates tokens for any text (~4 chars each).

## License

MIT
