Metadata-Version: 2.4
Name: tacit-citadel
Version: 0.1.0
Summary: GPU-powered Structured Data De-identification Engine
Requires-Python: <3.14,>=3.11
Requires-Dist: click>=8.4.1
Requires-Dist: numpy>=2; python_full_version >= '3.13'
Requires-Dist: openai>=2.41.1
Requires-Dist: presidio-analyzer>=2.2.362
Requires-Dist: presidio-anonymizer>=2.2.362
Requires-Dist: pydantic>=2.13.4
Requires-Dist: pydash>=8.0.6
Requires-Dist: pyjq>=2.6.0
Requires-Dist: pyyaml>=6.0.3
Provides-Extra: cuda
Requires-Dist: spacy[cuda12x]>=3.8.14; (python_full_version == '3.12.*' and sys_platform == 'linux') and extra == 'cuda'
Description-Content-Type: text/markdown

# Citadel

Citadel is a policy-driven de-identification tool for JSONL training and
evaluation data. It reads one JSON object per line, applies a versioned YAML
policy, and writes a compact de-identified JSONL file beside the input.

The normal command is:

```bash
uv run tacit.citadel policy.yaml input.jsonl
```

For `input.jsonl`, Citadel writes:

```text
input.citadel.jsonl
```

The CLI currently takes exactly two positional arguments: the policy file and
the input JSONL file. The output path is always derived from the input path.

## Setup

Citadel is packaged as `tacit-citadel` and exposes the `tacit.citadel` console
script.

The project supports Python `>=3.11,<3.14`; the checked-in `.python-version`
currently selects Python 3.11.

```bash
uv sync
```

`pyjq` is a runtime dependency and builds native code when no compatible wheel
is available. On macOS, make sure Xcode command line tools and the autotools
chain are installed. If setup fails with `No such file or directory:
'autoreconf'`, install the missing tools and rerun `uv sync`.

```bash
brew install autoconf automake libtool
```

If the `pyjq` build finds the command line tools but fails with `stdlib.h` not
found, pass the active macOS SDK path into the build:

```bash
SDKROOT=$(xcrun --show-sdk-path) uv sync
```

CUDA-enabled spaCy is available as an optional extra for Linux CPython 3.12:

```bash
uv sync --extra cuda
```

The default install path does not install CUDA spaCy packages.

## Usage

Run the sample policy against the sample record:

```bash
uv run tacit.citadel policy.yaml sample.jsonl
```

This creates:

```text
sample.citadel.jsonl
```

On success, Citadel prints a short report:

```text
output: sample.citadel.jsonl
records processed: 1
fields changed: 7
llm calls: 1
epoch seed: 1787680000
```

The `epoch seed` is generated from the current Unix time unless `process_file`
is called directly with an explicit `epoch_seed`.

## Input

Citadel expects JSONL. Each line must be a complete JSON object.

```json
{"client_id":"007","intake_details":{"date":"2026-01-05","weight":102.4}}
```

Non-object JSONL lines fail the run. Citadel processes records in chunks of 50
and writes one compact JSON object per output line.

## Policy

Policy files are YAML mappings validated with Pydantic. Extra fields are
rejected.

Required top-level fields:

```yaml
version: 1
name: nourish-intake-and-trajectory
description: De-identification policy for Nourish-style records.

llm:
  base_url: http://127.0.0.1:8000/v1
  model: google/gemma-4-12B-it-qat-w4a16-ct
  temperature: 1.0
  top_p: 0.95
  top_k: 64

rules:
  - path: .client_id
    action: drop
```

Each rule has:

```yaml
- path: .jq.selector
  action: drop
  required: true
  params: {}
```

`path` is a jq selector. Citadel resolves selectors through `pyjq` and applies
actions to the concrete JSON locations returned by `path(...)`.

`required` defaults to `true`. If a required rule matches nothing, the run
fails. Use `required: false` for sparse paths that are absent from some records.

## Actions

Citadel currently supports four actions.

### `drop`

Removes the matched object field.

```yaml
- path: .client_id
  action: drop
```

`drop` only deletes object fields. It does not remove array elements.

### `fuzz_number`

Shifts numeric values while preserving approximate modelling signal. Boolean
and non-numeric values are rejected.

Percent mode:

```yaml
- path: .intake_details.weight
  action: fuzz_number
  params:
    mode: percent
    max_percent: 5
    precision: 1
```

Range mode:

```yaml
- path: .intake_details.age
  action: fuzz_number
  params:
    mode: range
    min_delta: -2
    max_delta: 2
    step: 1
```

The random generator is seeded once per run. Integer inputs stay integers when
the fuzzed value is integral.

### `date_offset`

Replaces an absolute date with a human-readable offset from an anchor date in
the same record.

```yaml
- path: .trajectories[] | select(.type == "set_target").date
  action: date_offset
  required: false
  params:
    anchor_path: .intake_details.date
    output: human_relative
```

Supported output strings are:

```text
same day
N day after
N days after
N day before
N days before
```

Date values must be strings accepted by Python's ISO date/datetime parser.

### `llm_rewrite`

Queues selected string fields for rewriting through an OpenAI-compatible chat
completion endpoint.

```yaml
- path: .trajectories[] | select(.type == "messages").thread[].content
  action: llm_rewrite
  required: false
  params:
    system_prompt: You are a high-recall sensitive-data anonymizer.
    user_prompt: |
      Rewrite the INPUT text by replacing sensitive values with typed
      placeholders. Return only the rewritten text.

      INPUT
      {{content}}
```

Only the matched field value is sent to the model. `{{content}}` in the system
or user prompt is replaced with that selected text.

The LLM client uses the policy's `llm.base_url`, `llm.model`, `temperature`,
`top_p`, and `top_k`. The API key is set to `not-needed`, which matches local
OpenAI-compatible servers such as vLLM.

Within a run, duplicate source text is rewritten once and reused from an
in-memory cache. Cache misses in the same chunk are submitted concurrently.

If a rewrite request fails or is cancelled, Citadel writes
`<LLM_REWRITE_FAILED>` into that field and continues the run.

To smoke-test a local rewrite server directly:

```bash
uv run python -m tacit_citadel.llm \
  --base-url http://127.0.0.1:8000/v1 \
  --model google/gemma-4-12B-it-qat-w4a16-ct \
  --text "Hi Jamie, your appointment is on January 12."
```

## Processing Model

For each run, Citadel:

1. Validates the policy YAML.
2. Opens the input JSONL file.
3. Parses each line as a JSON object.
4. Applies policy rules in order.
5. Resolves jq selectors to concrete JSON locations.
6. Queues and runs LLM rewrites for each 50-record chunk.
7. Writes compact JSONL to a temporary output file.
8. Atomically replaces the derived output path after the full run succeeds.
9. Prints a short report.

If a fatal error occurs before replacement, Citadel deletes the temporary file.
An existing output file is preserved.

## Failure Behavior

Citadel fails the run for:

* missing policy or input files
* invalid policy YAML or unsupported policy fields
* invalid JSONL
* JSONL lines that are not objects
* invalid jq selectors
* unmatched required rule paths
* action type errors, such as applying `fuzz_number` to a string
* invalid or missing `date_offset` anchors

LLM rewrite request failures are nonfatal. The failed field is replaced with
`<LLM_REWRITE_FAILED>` and processing continues.

## Development

Run the test suite:

```bash
uv run pytest
```

Run the lightweight checks:

```bash
uv run ruff check .
uv run ty check
```
