Metadata-Version: 2.4
Name: pleno-dlp
Version: 0.12.0
Summary: Unified DLP scanner for SaaS sources — secret detection (trufflehog, gitleaks, native regex) plus PII detection (pleno-anonymize). API-driven content collection from GitHub, GitLab, Bitbucket, Slack, Notion, Confluence, Jira.
Project-URL: Homepage, https://github.com/plenoai/pleno-dlp
Project-URL: Repository, https://github.com/plenoai/pleno-dlp
Project-URL: Issues, https://github.com/plenoai/pleno-dlp/issues
Author-email: pleno <ai@egahika.dev>
License-Expression: AGPL-3.0-or-later
Keywords: anonymize,dlp,gitleaks,pii,saas,scanner,secrets,trufflehog
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Security
Requires-Python: >=3.12
Requires-Dist: httpx>=0.27
Requires-Dist: rich>=13.9
Requires-Dist: typer>=0.12
Provides-Extra: dev
Requires-Dist: mypy>=1.13; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.24; extra == 'dev'
Requires-Dist: pytest>=8.3; extra == 'dev'
Requires-Dist: ruff>=0.7; extra == 'dev'
Provides-Extra: pii
Description-Content-Type: text/markdown

# pleno-dlp (Python)

Unified DLP scanner for SaaS content — **secrets** (trufflehog /
gitleaks / native regex) and **PII** (delegating to
[pleno-anonymize](https://github.com/plenoai/pleno-anonymize)).

A *connector* models a SaaS provider — github, gitlab, bitbucket,
slack, notion, confluence, jira — and owns the full lifecycle: walks
content through the provider's API, **detects** leaks in that content,
and (optionally) verifies / revokes credentials. Detection happens
*inside the connector*; the engine choice (`native`, `trufflehog`,
`gitleaks`, `pii`) is a per-connector option, not a separate plugin.
Every connector self-describes via `ConnectorSpec.capabilities`
(`SOURCE` + `DETECT` baseline, optional `VERIFY` / `REVOKE`).

`pip install pleno-dlp` pulls one wheel exposing one console script
(`pleno-dlp`). The Go binary in this repo (`cmd/pleno-dlp`) remains
for filesystem-only scans; the Python package is the path forward
for SaaS.

## Install

```sh
uv tool install pleno-dlp
# or
pipx install pleno-dlp

# Add the PII backend (pulls pleno-anonymize):
uv tool install 'pleno-dlp[pii]'
```

## Usage

The CLI is connector-agnostic: knobs flow through the generic
``--option key=value`` flag, and the detection engine is picked with
``--engine``. Run ``pleno-dlp describe <connector>`` for the accepted
keys, types, defaults, and which ones are secrets.

```sh
# Discover what's registered
pleno-dlp list                              # connectors + engines
pleno-dlp list --capability verify          # connectors with VERIFY
pleno-dlp describe github

# Secret scan over an entire GitHub org with the default native engine
GITHUB_TOKEN=ghp_... pleno-dlp scan github --option owner=plenoai

# Scan a single repo, only code, with trufflehog verification
pleno-dlp scan github \
    --option owner=plenoai --option repo=pleno-dlp \
    --option resources=code --engine trufflehog

# Issue + PR conversations only, PII detection (requires pleno-anonymize)
pleno-dlp scan github --option owner=plenoai \
    --option resources=issues,prs --engine pii

# SARIF output for GitHub code-scanning ingestion
pleno-dlp scan github --option owner=plenoai \
    --format sarif > findings.sarif

# Slack workspace — same shape, different source connector
pleno-dlp scan slack --token xoxb-... --option include_threads=false

# Confirm a leaked github PAT is still live
pleno-dlp verify github --token ghp_…
```

Auth resolution for github: `--token` → `GITHUB_TOKEN` env var →
`gh auth token`. Anonymous works for public content but is rate-limited
to 60 req/h. Other source connectors take their token via `--token`
(shorthand for `--option token=…`) or via `--option api_token=…` /
`--option access_token=…` depending on the auth mode (see
`describe`).

## Detection engines

Engines are the internal scanners connectors compose with. They are
stateless utilities that turn a ``Document.text`` into ``Finding``\\s.
Operators do not address them directly — instead pick one with
``--engine`` (or ``--option engine=…``); the connector hands its own
Documents to the chosen engine. Default for every connector: ``native``.

| Engine | Class | Verifies | System dep |
|---|---|---|---|
| trufflehog | secret | yes (per-detector) | `trufflehog` CLI on PATH |
| gitleaks | secret | no | `gitleaks` CLI on PATH |
| native | secret | no | none — bundled regex (AWS, GitHub PAT, Slack bot, OpenAI, Anthropic) |
| pii | PII | n/a | `pleno-anonymize` HTTP API (installed via `pleno-dlp[pii]` extra) |

## Source connectors

Each connector self-describes via a `ConnectorSpec` (auth modes,
resources, options, runtime capabilities). Today: **github**, **gitlab**,
**bitbucket** (cloud + server), **slack** (xoxb / xoxp), **notion**,
**confluence** (cloud + datacenter), **jira** (cloud + datacenter).
Run `pleno-dlp list` for the live list and `pleno-dlp describe <name>`
for the option sheet.

### Capabilities

A connector advertises one or more capabilities:

* `Capability.SOURCE` — implements the `Connector` Protocol
  (`discover` / `fetch` / `capabilities`). Every shipped connector has
  this.
* `Capability.DETECT` — implements the `Detector` Protocol
  (`detect(doc) -> AsyncIterator[Finding]`). Every shipped connector
  has this; the engine choice is configured via
  ``--option engine=…``.
* `Capability.VERIFY` — implements the `Verifier` Protocol
  (`verify(secret) -> VerifyResult`). Today: **github** (probes
  `GET /user`).
* `Capability.REVOKE` — implements the `Revoker` Protocol
  (`revoke(secret) -> RevokeResult`). Reserved; no built-in connector
  has this yet — providers without a programmatic revoke endpoint
  should leave it unset and document the manual rotation flow.

`pleno-dlp verify <connector> --token …` exercises `VERIFY`. Exit
codes: `0` = LIVE, `1` = REVOKED, `2` = UNKNOWN/unsupported.

### Adding a new connector

1. Create `python/src/pleno_dlp/connectors/<name>.py`. Subclass
   ``DetectViaEngineMixin`` from ``pleno_dlp.connectors._detect`` so
   ``detect()`` and the ``engine`` kwarg come for free.
2. Implement the `Connector` Protocol (`discover`, `fetch`,
   `discover_and_fetch`, `capabilities`, `close`). Keep one
   `httpx.AsyncClient` per instance. Call ``self._init_engine(engine)``
   from your ``__init__`` and ``await self._close_engine()`` from your
   ``close()``. Optionally add `verify(secret)` / `revoke(secret)`
   for lifecycle support.
3. Declare a `spec: ClassVar[ConnectorSpec] = ConnectorSpec(...)`
   with `name`, `kind`, `summary`, `capabilities` (defaults to
   ``{SOURCE, DETECT}`` — extend with ``VERIFY`` / ``REVOKE`` as
   you implement them), `auth_modes`, `resources`, `options` (every
   `__init__` kwarg, including `DETECT_ENGINE_OPTION` from
   ``_detect``), and `runtime` (a `Capabilities` describing
   incremental / streaming / concurrency).
4. End the module with `registry.register("<name>", <Class>)`.
5. Wire the import in `pleno_dlp/connectors/__init__.py`.
6. Add fixtures + tests under `python/tests/connectors/test_<name>.py`
   using `httpx.MockTransport`.

Once the spec lands, `pleno-dlp scan <name> --engine <engine>`,
`pleno-dlp verify <name>`, `pleno-dlp list`, and
`pleno-dlp describe` all work without touching the CLI.

## Release

Tag `py-vX.Y.Z` triggers PyPI trusted publishing via GitHub Actions.
