Metadata-Version: 2.4
Name: openlayer-guardrails
Version: 0.3.0
Summary: Guardrails that can be used to check inputs and outputs of functions and works well with Openlayer tracing.
Requires-Python: >=3.10
Requires-Dist: openlayer>=0.2.0a89
Provides-Extra: pii
Requires-Dist: presidio-analyzer>=2.2.0; extra == 'pii'
Requires-Dist: presidio-anonymizer>=2.2.0; extra == 'pii'
Provides-Extra: prompt-injection
Requires-Dist: torch>=2.0.0; extra == 'prompt-injection'
Requires-Dist: transformers>=4.40.0; extra == 'prompt-injection'
Provides-Extra: toxicity
Requires-Dist: torch>=2.0.0; extra == 'toxicity'
Requires-Dist: transformers>=4.30.0; extra == 'toxicity'
Description-Content-Type: text/markdown

# Openlayer Guardrails

Open source guardrail implementations that work with Openlayer tracing.

## Installation

```bash
pip install openlayer-guardrails
```

## Usage

### Standalone Usage

```python
from openlayer_guardrails import PIIGuardrail

# Create guardrail
pii_guard = PIIGuardrail(
    block_entities={"CREDIT_CARD", "US_SSN"},
    redact_entities={"EMAIL_ADDRESS", "PHONE_NUMBER"}
)

# Check inputs manually
data = {"message": "My email is john@example.com and SSN is 123-45-6789"}
result = pii_guard.check_input(data)

if result.action.value == "block":
    print(f"Blocked: {result.reason}")
elif result.action.value == "modify":
    print(f"Modified data: {result.modified_data}")
```

### With Openlayer Tracing

```python
from openlayer_guardrails import PIIGuardrail
from openlayer.lib.tracing import trace

# Create guardrail
pii_guard = PIIGuardrail()

# Apply to traced functions
@trace(guardrails=[pii_guard])
def process_user_data(user_input: str):
    return f"Processed: {user_input}"

# PII is automatically handled
result = process_user_data("My email is john@example.com")
# Output: "Processed: My email is [EMAIL-REDACTED]"
```

### Toxicity Guardrail (Brazilian Portuguese)

Detects toxic content in Brazilian Portuguese using the [ToxiGuardrailPT](https://huggingface.co/nicholasKluge/ToxiGuardrailPT) model.

```bash
pip install 'openlayer-guardrails[toxicity]'
```

```python
from openlayer_guardrails import ToxicityPTGuardrail

# Create guardrail (default threshold=0.0; positive scores = safe, negative = toxic)
toxicity_guard = ToxicityPTGuardrail()

# Check inputs
result = toxicity_guard.check_input({"message": "Você é um idiota!"})
print(result.action)  # GuardrailAction.BLOCK

# Check outputs with contextual scoring (sentence-pair encoding)
result = toxicity_guard.check_output(
    output="Claro, aqui está a informação solicitada.",
    inputs={"prompt": "Me ajude com meu trabalho."},
)
print(result.action)  # GuardrailAction.ALLOW
```

### Toxicity Guardrail (English)

Detects toxic content in English across six categories using [unitary/toxic-bert](https://huggingface.co/unitary/toxic-bert).

```bash
pip install 'openlayer-guardrails[toxicity]'
```

```python
from openlayer_guardrails import ToxicityENGuardrail

# Create guardrail (default threshold=0.5)
toxicity_guard = ToxicityENGuardrail()

# Check inputs
result = toxicity_guard.check_input({"message": "You are terrible and should die"})
print(result.action)  # GuardrailAction.BLOCK
print(result.metadata["triggered_categories"])
# e.g. {'toxic': 0.98, 'severe_toxic': 0.72, 'insult': 0.89, 'threat': 0.81}

# Monitor only specific categories
guard = ToxicityENGuardrail(categories={"threat", "severe_toxic"})
```

### Handling long texts

By default, all guardrails truncate inputs to 512 tokens for fast inference.
To evaluate the full text, enable chunking mode by setting `max_length=None`:

```python
guard = ToxicityPTGuardrail(max_length=None)   # or ToxicityENGuardrail(max_length=None)
```

In chunking mode, long texts are split into overlapping 512-token windows and
each window is scored independently. The most toxic score across all windows is
used. Latency scales linearly with the number of chunks.

## Model Limitations

### Prompt Injection Guardrail

| Property | Value |
|---|---|
| **Model** | [meta-llama/Prompt-Guard-86M](https://huggingface.co/meta-llama/Prompt-Guard-86M) |
| **Max tokens** | 512 |
| **Language** | English |
| **Parameters** | 86M |
| **Scope** | Input-only (outputs are not checked) |

Texts longer than 512 tokens are truncated. Only the first 512 tokens are evaluated.

### Toxicity Guardrail (PT-BR)

| Property | Value |
|---|---|
| **Model** | [nicholasKluge/ToxiGuardrailPT](https://huggingface.co/nicholasKluge/ToxiGuardrailPT) |
| **Max tokens** | 512 |
| **Language** | Brazilian Portuguese |
| **Parameters** | 109M |
| **Architecture** | BERTimbau (bert-base-portuguese-cased) |
| **Output type** | Single scalar reward score (positive = safe, negative = toxic) |
| **Reported accuracy** | 70.36% (hatecheck-portuguese), 74.04% (told-br) |
| **Scope** | Input and output (output uses sentence-pair encoding for context) |

By default, texts longer than 512 tokens are truncated. Set `max_length=None` to enable chunking for full-text coverage. The model was trained on Brazilian Portuguese data and may not generalize well to European Portuguese or other languages.

### Toxicity Guardrail (EN)

| Property | Value |
|---|---|
| **Model** | [unitary/toxic-bert](https://huggingface.co/unitary/toxic-bert) |
| **Max tokens** | 512 (chunking available via `max_length=None`) |
| **Language** | English |
| **Parameters** | 110M |
| **Architecture** | BERT (bert-base-uncased) |
| **Output type** | Multi-label probabilities across 6 categories |
| **Categories** | `toxic`, `severe_toxic`, `obscene`, `threat`, `insult`, `identity_hate` |
| **Reported AUC** | 0.98636 (Jigsaw Toxic Comment Challenge) |
| **Scope** | Input and output |

By default, texts longer than 512 tokens are truncated. Set `max_length=None` to enable chunking for full-text coverage. The model was trained on English data.

