Metadata-Version: 2.4
Name: safedata-guard
Version: 1.0.1
Summary: A lightweight framework for safely enabling LLMs to analyze pandas/Polars data without exposing raw data or blindly executing generated code.
Author: Aravind Chakravarthy
License: MIT
Project-URL: Homepage, https://pypi.org/project/safedata-guard/
Keywords: ai,agent,llm,pandas,data,safety,sandbox
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Intended Audience :: Developers
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.3
Requires-Dist: numpy>=1.20
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: openpyxl>=3.0; extra == "dev"
Provides-Extra: polars
Requires-Dist: polars>=0.20; extra == "polars"
Provides-Extra: excel
Requires-Dist: openpyxl>=3.0; extra == "excel"
Dynamic: license-file

# safedata-guard

A lightweight framework for safely enabling LLMs to analyze pandas/Polars data
without exposing raw data or blindly executing generated code.

Most "chat with your data" tools send your whole dataset to the model and then
run whatever code it writes, unchecked. safedata-guard sits between the AI and
your data and changes both halves of that: it sends a compact, quality-aware
*summary* instead of raw rows (cheaper, and it keeps sensitive values out of the
prompt), and it runs the AI's code behind guardrails on a copy of your data.

At a glance, it gives you:

- a quality-aware summary of any DataFrame, with warnings about common data
  traps, used as a cheap prompt instead of raw rows
- best-effort PII masking of sample values before they reach the model
- a guardrail layer that screens and runs AI-written code on a copy, with a
  timeout, and feeds errors back so the model can correct itself
- `check_code()` to screen code without running it, for use in your own agent
- works with pandas or Polars, and a `safedata` command-line tool
- honest about its limits: defense in depth, not a security sandbox

## What it does

**1. It summarises your data before sending it to an AI.**

Instead of pushing 100,000 rows into a prompt (slow and expensive), it sends a
short summary: the columns, their types, a few sample values, and basic stats.
The summary also flags common data problems that trip up analysis, such as:

- numbers stored as text (like "$500" or "1,000")
- the same category written several ways (like "California" and "CA")
- ID columns that are not actually unique
- columns that are completely empty
- columns that are mostly empty
- Excel dates stored as plain numbers (like 45292)
- negative values in columns that should not have them

**2. It runs the AI's code with guardrails.**

When the AI writes code to answer a question, safedata-guard runs it on a copy
of your data, in a separate process with a timeout. Before running, an
AST-based screen refuses unsafe imports, introspection/dunder tricks, dangerous
builtins, and data/file readers and writers (a small set of analysis imports
like pandas, numpy, and datetime is allowed). The AI may add or transform
columns freely (it works on the copy), but after running, the bodyguard checks
that it did not silently drop rows from the data and that the result is not
silently empty, and if something looks wrong it sends the error back so the AI
can fix its own code.

**Scope: please read this honestly.** This is *defense in depth* for
cooperative / semi-trusted model output: it stops the destructive accidents an
honest model makes, and the obvious escape attempts. It is **not** a security
sandbox for deliberately malicious untrusted code. In-process Python
"sandboxes" have a long history of clever escapes, and on Windows a child
process still shares your filesystem permissions; so the subprocess gives you
*timeout and crash isolation*, not a filesystem jail. If you need to run code
from an untrusted source, run safedata-guard inside OS-level isolation (a
container, a locked-down user account, or a VM). It also does not prove the
AI's maths is correct, which no tool can do in general.

## Install

```bash
pip install safedata-guard
```

### Using Polars instead of pandas

safedata-guard works with either pandas or Polars DataFrames. The safety
screen, the copy-and-isolate execution, and the data-trap summary all handle
both. To use Polars, install the extra:

```
pip install "safedata-guard[polars]"
```

Then pass a Polars frame anywhere you would pass a pandas one; the library
detects the type. The safety screen blocks Polars' file writers and readers
(`write_csv`, `write_parquet`, lazy `sink_*`, `read_*`, `scan_*`) the same way
it blocks the pandas equivalents. The scope note above applies identically: it
is defense in depth for cooperative model output, not a sandbox for malicious
code.

### PII masking in the summary

The summary sends a few real sample values to the LLM so it can write correct
code. Those samples can contain personal data. By default, safedata-guard masks
obvious PII (emails, card-like numbers, phones, SSNs, IPs) before the summary
leaves your machine, and notes which columns were masked.

```python
safedata.summarize(df)                    # PII masked by default
safedata.summarize(df, redact_pii=False)  # raw samples, if you are sure
```

This is **best-effort, regex-based redaction, not a compliance guarantee.** It
catches common well-formed patterns and will miss unusual formats, names,
addresses, and free-text. If you handle regulated data, keep it out of
third-party LLMs by policy; do not rely on a regex. Masking is on because
leaking less is better than leaking more, but treat it as a seatbelt, not a
vault.

## Command line: check a file in one line

After installing, you get a `safedata` command. Point it at a data file to see
the quality summary, data-trap warnings, and token-saving estimate without
writing any Python:

```
safedata check sales.csv
safedata check data.xlsx --report quality.html
safedata check sales.csv --no-redact --samples 5
```

Supported file types: `.csv`, `.tsv`, `.xlsx`, `.xls`, `.parquet`, `.json`.
The command only reads and summarises the file; it never executes code. `--report`
also writes the HTML quality report, and `--no-redact` shows raw sample values
instead of masking detected PII.

## Quick example

This runs end to end today. The `my_model` function below returns the code as a
string. In a real project you would replace its body with a call to a model of
your choice.

```python
import safedata
import pandas as pd

df = pd.DataFrame({
    "date": ["2025-01-01", "2024-05-01", "2025-08-01"],
    "amount": [100.0, 50.0, 200.0],
})

def my_model(prompt):
    # Replace this with a call to your own model.
    # It should take the prompt text and return Python code as a string.
    return "result = df[df['date'].str.startswith('2025')]['amount'].sum()"

agent = safedata.Agent(model=my_model)
out = agent.ask(df, "What were total sales in 2025?")

print(out.answer)     # 300.0
print(out.blocked)    # False if the code passed the safety checks
print(out.attempts)   # the list of code attempts that were made
```

## Connecting a real model with wrap()

Real models are messy. They wrap code in Markdown code fences, add
sentences like "Here is the code:", and sometimes fail because of a bad key or
no internet. `safedata.wrap()` takes any function that sends text to a model and
returns the model's text, and it handles the messy parts for you: pulling the
bare code out of the reply and turning failures into a clear message instead of
a crash.

You write a small function that calls your model. It does not matter which model
it is: a hosted one like Claude or GPT-4, a model running on your own machine, or
your own custom function. As long as it takes text and returns text, it works.

```python
import safedata

# Example shape of your own model call. Replace the body with your model.
def my_call(prompt):
    reply = some_model_that_takes_text_and_returns_text(prompt)
    return reply

agent = safedata.Agent(model=safedata.wrap(my_call))
out = agent.ask(df, "What were total sales in 2025?")
print(out.answer)
```

Because `wrap` works with any text-in, text-out function, the library is not
tied to a single provider. You can point it at whatever model you already have.

A note on quality: the library connects to any model, but the quality of the
answers depends on the model. A strong model writes good code on the first try.
A weaker model may write code that gets blocked by the safety checks and then
retried. The library stays safe either way, but a better model means fewer
retries.

## Use the parts on their own

```python
print(safedata.summarize(df))                  # the data summary with warnings
result = safedata.run_safely(code_string, df)  # run code through the safety layer
verdict = safedata.check_code(code_string)     # is this code safe? (does NOT run it)
safedata.report(df, "report.html")             # write an HTML quality report
print(safedata.token_savings(df))              # estimated token (cost) saving
```

## Token and cost saving

Sending a whole table to a model can cost a lot, because every row becomes
tokens that you pay for. safedata sends a short summary instead, which is far
smaller. `safedata.token_savings(df)` shows the estimated saving in plain words:

```python
print(safedata.token_savings(df))
# Sending the summary uses about 620 tokens instead of about 13,007,168 for the
# raw data. Estimated saving: 99.99% (about 13,006,548 tokens).
```

Every `agent.ask(...)` result also carries a `tokens` estimate you can inspect:

```python
out = agent.ask(df, "What were total sales in 2025?")
print(out.tokens)   # {'summary_tokens': ..., 'raw_tokens': ..., 'saved_percent': ...}
```

These numbers are estimates. Each model provider counts tokens with its own
method, so exact figures vary, but the estimate shows the scale of the saving.

## HTML report

`safedata.report(df, "report.html")` writes a simple web page that lists each
column with a red, amber or green status, the problems it found, and suggested
fixes. It is meant to be readable by someone who does not write code. Call it
without a path to get the HTML back as a string instead.

## How the question-answering loop works

1. The data is summarised, including the trap warnings.
2. Your model writes code based on the summary and the question.
3. The code runs on a copy and is checked for safety.
4. If it is blocked, the error is sent back and the model tries again.

## About the model

The library does not include a model. You supply one through a single function,
so you can use a local model or a hosted one without changing anything else in
your code.

## Full function reference

Everything the library makes available:

**Asking questions**

- `safedata.Agent(model, max_retries=3)` builds an agent. `model` is a function
  that takes a prompt and returns code (usually made with `wrap`). `max_retries`
  is how many times the model may correct itself after a block.
- `agent.ask(df, question, verbose=False)` runs the full loop and returns a
  result object. Set `verbose=True` to print each code attempt as it happens.
- The result object has: `.answer` (the result), `.blocked` (True if it could
  not be completed safely), `.reason` (why it was blocked, if so), `.attempts`
  (the list of code attempts), and `.tokens` (the token estimate for the call).

**Connecting a model**

- `safedata.wrap(call, clean=...)` turns any text-in, text-out function into a
  model the agent can use. It strips code out of messy replies and turns
  failures into a clear `ModelError`.
- `safedata.extract_code(text)` is the helper that pulls bare Python code out of
  a reply (handling Markdown code fences and chatter). `wrap` uses it by default;
  you can call it yourself or pass your own version to `wrap`.
- `safedata.ModelError` is the error raised when a wrapped model call fails
  (bad key, no internet, unusable output).

**Looking at the data**

- `safedata.summarize(df)` returns the short text summary, including the data
  trap warnings. This is what gets sent to the model.
- `safedata.report(df, path=None)` writes an HTML quality report to `path`, or
  returns the HTML as a string if no path is given.

**Running code safely on its own**

- `safedata.run_safely(code, df, result_var="result")` runs a piece of code
  against a copy of `df`, blocks unsafe operations, checks that nothing was
  damaged, and returns the value of the result variable. Raises `SafetyError`
  if the code is unsafe.
- `safedata.SafetyError` is the error raised when code is blocked.

**Token and cost estimates**

- `safedata.token_savings(df)` returns a readable sentence describing the
  estimated token saving.
- `safedata.token_stats(df)` returns the raw numbers as a dictionary:
  `summary_tokens`, `raw_tokens`, `saved_tokens`, `saved_percent`.
- `safedata.estimate_tokens(text)` estimates the number of tokens in any piece
  of text, using a rough rule of about four characters per token.

All token figures are estimates. Each model provider counts tokens with its own
method, so exact numbers vary, but the estimate shows the scale of the saving.

## License

MIT
