Metadata-Version: 2.4
Name: anonim-video-text-library
Version: 0.1.7
Summary: Library and CLI for text anonymization plus audio/video transcription with diarization
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.2.0
Requires-Dist: transformers>=4.38.0
Requires-Dist: faster-whisper>=1.0.0
Requires-Dist: pyannote.audio>=3.3.0
Requires-Dist: imageio-ffmpeg>=0.5.1

# Anonim Video Text Library

Standalone project for two separate workflows:
- text anonymization for `JSON/JSONL/CSV/Markdown/TXT` with a persistent `people.json` dictionary
- audio/video transcription with diarization

## Project layout

- `src/anonim_video_text_library/` - importable Python package
- `text_anonim/` - default runtime workspace for the anonymizer
- `examples/Anonimizez_example/` - self-contained anonymizer example
- `examples/Transcibator_example/` - self-contained transcription example
- `main.py` - local wrapper for the transcription CLI
- `gpu_backends/` - helper scripts for GPU transcription backends
- `whisper.cpp/` - local checkout of `whisper.cpp`

## Installation

```bash
cd Anonim_video_text_Library
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
```

This installs the dependencies for both workflows, including `torch`,
`transformers`, `faster-whisper`, `pyannote.audio`, and `imageio-ffmpeg`.

## Run the anonymizer separately

CLI entrypoints:

```bash
python3 -m anonim_video_text_library --help
python3 text_anonim/anonimizer.py --help
```

Self-contained example:

```bash
cd examples/Anonimizez_example
python3 run_anonymizer_example.py
```

What is inside `examples/Anonimizez_example/`:
- `.env` and `.env.example` for settings
- `input/` for source `json/jsonl/csv/md/txt` files
- `output/` for anonymized copies
- `runtime_root/files/pii/` for `people.json` and blocklists
- `run_anonymizer_example.py` as the runner

The example writes output files to `output/`. It does not print anonymized
content only to the terminal.

## Run the transcriber separately

CLI entrypoints:

```bash
anonim-video-text-transcribe --help
python3 main.py --help
```

Self-contained example:

```bash
cd examples/Transcibator_example
python3 run_transcriber_example.py
```

What is inside `examples/Transcibator_example/`:
- `.env` and `.env.example` for settings
- `input/` for media files
- `output/` for generated transcripts
- `run_transcriber_example.py` as the runner

If you need diarization, set `HF_TOKEN` in `.env` or in your shell.

## Generate runtime examples

To generate the same two example folders inside any runtime workspace:

```bash
python3 -m anonim_video_text_library \
  --runtime-root /path/to/runtime \
  --example
```

To rebuild the generated README and example folders:

```bash
python3 -m anonim_video_text_library \
  --runtime-root /path/to/runtime \
  --example \
  --force-example
```

The generated runtime examples live under:
- `examples/Anonimizez_example/`
- `examples/Transcibator_example/`

Each example is isolated. The demo scripts no longer create another nested
`examples/` tree inside their own runtime data.

## Default runtime workspace

By default the anonymizer uses:

```text
text_anonim
```

That workspace contains:
- `files/pii/` for input files, `people.json`, and blocklists
- `files/pii_anonymized/` for anonymized output
- `README.md` with generated workspace instructions
- `examples/` with the two generated example folders

## Python API

The main public API is `TextAnonymizationSession`.

```python
from pathlib import Path
from anonim_video_text_library import TextAnonymizationSession

session = TextAnonymizationSession.from_defaults(
    runtime_root=Path("/path/to/runtime"),
    device="auto",
    ner_batch_size=16,
)

text, stats = session.anonymize_text(
    "Jordan Miller from Northwind Labs wrote to contact@example.com",
    file_id="demo.txt",
)
print(text)
print(stats)

payload, stats = session.anonymize_value(
    {
        "title": "Jordan Miller",
        "body": "Northwind Labs contact: contact@example.com",
    },
    file_id="demo.json",
)
print(payload)
print(stats)

directory_stats = session.anonymize_directory(
    input_root=Path("/path/to/input"),
    output_root=Path("/path/to/output"),
    skip_existing=True,
)
print(directory_stats)
print(session.people_file)
```

## Related docs

- [text_anonim/README.md](text_anonim/README.md)
- [gpu_backends/README.md](gpu_backends/README.md)
- [whisper.cpp/README.md](whisper.cpp/README.md)

## Notes

- `text_anonim/files` may contain large working datasets
- `whisper.cpp/models/*.bin` are not copied automatically with the project
- `fairseq_env` was intentionally not moved with the standalone package; recreate a local environment if you still need it
