Metadata-Version: 2.4
Name: evadex
Version: 3.13.1
Summary: Comprehensive DLP evasion test suite — scanner-agnostic, file-aware
License-Expression: MIT
Project-URL: Homepage, https://github.com/tbustenk/evadex
Project-URL: Repository, https://github.com/tbustenk/evadex
Project-URL: Bug Tracker, https://github.com/tbustenk/evadex/issues
Project-URL: Changelog, https://github.com/tbustenk/evadex/blob/main/CHANGELOG.md
Keywords: dlp,security,evasion,testing,compliance,pci-dss,scanner
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: System Administrators
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Security
Classifier: Topic :: System :: Systems Administration
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click<9,>=8.1
Requires-Dist: httpx<1,>=0.27
Requires-Dist: python-docx<2,>=1.1
Requires-Dist: fpdf2<3,>=2.7.9
Requires-Dist: openpyxl<4,>=3.1
Requires-Dist: jinja2<4,>=3.1
Requires-Dist: rich<14,>=13.0
Requires-Dist: pyyaml<7,>=6.0
Provides-Extra: dev
Requires-Dist: pytest<10,>=8.0; extra == "dev"
Requires-Dist: pytest-asyncio<2,>=0.23; extra == "dev"
Requires-Dist: respx<1,>=0.21; extra == "dev"
Provides-Extra: barcodes
Requires-Dist: qrcode[pil]<9,>=7.4; extra == "barcodes"
Requires-Dist: python-barcode[images]<1,>=0.15; extra == "barcodes"
Requires-Dist: Pillow<13,>=10.0; extra == "barcodes"
Provides-Extra: data-formats
Requires-Dist: pyarrow<24,>=14.0; extra == "data-formats"
Requires-Dist: pandas<4,>=2.0; extra == "data-formats"
Provides-Extra: archives
Requires-Dist: py7zr<2,>=0.20; extra == "archives"
Dynamic: license-file

# evadex

A scanner-agnostic DLP evasion test suite. evadex generates hundreds of obfuscated variants of known-sensitive values and submits them to your DLP scanner to find what slips through — including through file extraction pipelines (DOCX, PDF, XLSX), not just plain-text API calls.

Built and tested with [dlpscan](https://github.com/oxide11/dlpscan); works with any scanner via its adapter interface. Detection rates vary by scanner, configuration, and ruleset — run evadex against your own deployment to see your results.

---

## What it does

evadex takes a sensitive value (a credit card number, SSN, AWS key, etc.), runs it through every evasion technique it knows — unicode tricks, delimiter manipulation, encoding variants, regional digit scripts, homoglyphs, and more — and records which variants your scanner catches and which it misses.

**Evasion categories:**

| Generator | Techniques |
|---|---|
| `unicode_encoding` | Zero-width chars, fullwidth digits, homoglyphs, NFD/NFC/NFKC/NFKD normalization, HTML entities (decimal + hex), URL encoding (full, digits-only, mixed) |
| `delimiter` | Space, hyphen, dot, slash, tab, newline, mixed, doubled, none |
| `splitting` | Mid-value line break, HTML/CSS comment injection, prefix/suffix noise, JSON field split, whitespace padding, XML wrapping |
| `leetspeak` | Minimal, moderate, and aggressive substitution tiers |
| `regional_digits` | Arabic-Indic, Extended Arabic-Indic, Devanagari, Bengali, Thai, Myanmar, Khmer, Mongolian, NKo, Tibetan — plus mixed-script variants |
| `structural` | Left/right padding (spaces + zeros), noise embedding, partial values, case variation, repeated value |
| `encoding` | Base64 (standard, URL-safe, no-padding, MIME line-breaks, partial, double), ROT13, full/group reversal, double URL encoding, mixed NFD/NFC/NFKD normalization |
| `context_injection` | Value wrapped in email body, JSON record, XML element, CSV row, SQL snippet, and more |
| `unicode_whitespace` | Spaces replaced with NBSP, en-space, em-space, or a mixed pattern |
| `bidirectional` | Unicode bidirectional control characters (RLO, LRO, RLE, RLI, ALM) injected around or within the value |
| `soft_hyphen` | Soft hyphen (U+00AD) and word joiner (U+2060) inserted at group boundaries or between every character |
| `morse_code` | Digits encoded as International Morse Code — space-separated, slash-separated, concatenated, or newline-separated; applies to `credit_card`, `ssn`, `sin`, `iban`, `phone`, and related numeric categories |
| `encoding_chains` | Chained multi-step encodings: `base64(rot13)`, `base64(hex)`, `hex(base64)`, `rot13(base64)`, `url(base64)`, `base64(base64)`, and the triple chain `base64(rot13(hex))` — defeats scanners that only decode one layer |

**Submission strategies** (for dlpscan-cli adapter):

Each variant is tested four ways by default: as plain text, embedded in a DOCX, embedded in a PDF, and embedded in an XLSX. This exercises your scanner's file extraction pipeline, not just its regex layer.

**Built-in test payloads:**

Payloads are classified as **structured** or **heuristic** — see [Structured vs heuristic categories](#structured-vs-heuristic-categories) below.

554 payloads across 489 categories covering **489/557 sub-patterns** (88%) of the dlpscan-rs pattern library, with 421 structured categories confirmed detected by seed scan. See [Coverage](#coverage) for a breakdown by sub-pattern.

#### North America

| Label | Value | Category | Type |
|---|---|---|---|
| Visa 16-digit | `4532015112830366` | `credit_card` | structured |
| Amex 15-digit | `378282246310005` | `credit_card` | structured |
| Mastercard 16-digit | `5105105105105100` | `credit_card` | structured |
| Discover 16-digit | `6011111111111117` | `credit_card` | structured |
| JCB 16-digit | `3530111333300000` | `credit_card` | structured |
| UnionPay 16-digit | `6250941006528599` | `credit_card` | structured |
| Diners Club 14-digit | `30569309025904` | `credit_card` | structured |
| US SSN | `123-45-6789` | `ssn` | structured |
| US ITIN | `912-34-5678` | `us_itin` | structured |
| US EIN | `12-3456789` | `us_ein` | structured |
| US Medicare Beneficiary ID | `1EG4-TE5-MK72` | `us_mbi` | structured |
| US Passport | `340000136` | `us_passport` | structured |
| US state driver's licences (51) | one per state + DC | `us_dl` | structured |
| Canada SIN | `046 454 286` | `sin` | structured |
| Canadian passport | `AB123456` | `ca_passport` | structured |
| Quebec RAMQ health card | `BOUD 1234 5678` | `ca_ramq` | structured |
| Ontario health card | `1234-567-890-AB` | `ca_ontario_health` | structured |
| BC CareCard | `9123456789` | `ca_bc_carecard` | structured |
| Alberta health card | `123456789` | `ca_ab_health` | structured |
| Manitoba health card | `987654321` | `ca_mb_health` | structured |
| Saskatchewan health card | `234567890` | `ca_sk_health` | structured |
| Nova Scotia health card | `1234 567 890` | `ca_ns_health` | structured |
| New Brunswick health card | `1234567890` | `ca_nb_health` | structured |
| PEI health card | `123456789012` | `ca_pei_health` | structured |
| Newfoundland health card | `9876543210` | `ca_nl_health` | structured |
| Quebec driver's licence | `B123456789012` | `ca_qc_drivers` | structured |
| Ontario driver's licence | `A1234-56789-01234` | `ca_on_drivers` | structured |
| BC driver's licence | `1234567` | `ca_bc_drivers` | structured |
| Manitoba driver's licence | `AB-123-456-789` | `ca_mb_drivers` | structured |
| Saskatchewan driver's licence | `12345678` | `ca_sk_drivers` | structured |
| Nova Scotia driver's licence | `AB1234567` | `ca_ns_drivers` | structured |
| New Brunswick driver's licence | `1234567` | `ca_nb_drivers` | structured |
| PEI driver's licence | `123456` | `ca_pei_drivers` | structured |
| Newfoundland driver's licence | `A123456789` | `ca_nl_drivers` | structured |
| Canadian Business Number | `111222333` | `ca_business_number` | structured |
| Canadian GST/HST registration | `111222333RT0001` | `ca_gst_hst` | structured |
| Canadian transit/routing number | `12345-678` | `ca_transit_number` | structured |
| Canadian bank account | `12345678` | `ca_bank_account` | structured |
| Mexico CURP | `BADD110313HCMLNS09` | `mx_curp` | structured |

#### Europe

| Label | Value | Category | Type |
|---|---|---|---|
| UK IBAN | `GB82WEST12345698765432` | `iban` | structured |
| Germany IBAN | `DE89370400440532013000` | `iban` | structured |
| France IBAN | `FR7630006000011234567890189` | `iban` | structured |
| Spain IBAN | `ES9121000418450200051332` | `iban` | structured |
| SWIFT/BIC code | `DEUTDEDB` | `swift_bic` | structured |
| ABA routing number | `021000021` | `aba_routing` | structured |
| UK National Insurance Number | `AB123456C` | `uk_nin` | structured |
| UK driving licence | `MORGA753116SM9IJ` | `uk_dl` | structured |
| German Personalausweis | `L01X00T47` | `de_id` | structured |
| Germany Steuer-IdNr | `86095742719` | `de_tax_id` | structured |
| French CNI | `880692310285` | `fr_cni` | structured |
| France INSEE (NIR) | `282097505604213` | `fr_insee` | structured |
| Spanish DNI | `12345678Z` | `es_dni` | structured |
| Italian Codice Fiscale | `RSSMRA85T10A562S` | `it_cf` | structured |
| Dutch BSN | `111222333` | `nl_bsn` | structured |
| Swedish Personnummer | `811228-9874` | `se_pin` | structured |
| Norwegian Fødselsnummer | `01010112345` | `no_fnr` | structured |
| Finnish Henkilötunnus | `131052-308T` | `fi_hetu` | structured |
| Polish PESEL | `44051401458` | `pl_pesel` | structured |
| Swiss AHV | `756.1234.5678.97` | `ch_ahv` | structured |
| Austria social insurance | `1234-010150` | `at_svn` | structured |
| Belgium National Register Number | `85.01.01-234.56` | `be_nrn` | structured |
| Bulgaria EGN | `8501010001` | `bg_egn` | structured |
| Croatia OIB | `12345678901` | `hr_oib` | structured |
| Cyprus tax ID | `12345678A` | `cy_tin` | structured |
| Czech birth number | `850101/1234` | `cz_rc` | structured |
| Denmark CPR | `010185-1234` | `dk_cpr` | structured |
| Estonia personal code | `38501010002` | `ee_ik` | structured |
| EU VAT number | `DE123456789` | `eu_vat` | structured |
| Greece AMKA | `01018512345` | `gr_amka` | structured |
| Hungary TAJ | `123 456 789` | `hu_taj` | structured |
| Iceland kennitala | `010185-1234` | `is_kt` | structured |
| Ireland PPS number | `1234567A` | `ie_pps` | structured |
| Latvia personal code | `010185-12345` | `lv_pk` | structured |
| Liechtenstein passport | `A12345` | `li_pp` | structured |
| Lithuania personal code | `38501010002` | `lt_ak` | structured |
| Luxembourg national ID | `1985012312345` | `lu_nin` | structured |
| Malta identity card | `12345A` | `mt_id` | structured |
| Portugal NIF | `123456789` | `pt_nif` | structured |
| Romania CNP | `1850101123456` | `ro_cnp` | structured |
| Slovakia birth number | `850101/1234` | `sk_bn` | structured |
| Slovenia EMSO | `0101850500003` | `si_emso` | structured |
| Turkey TC identity | `12345678901` | `tr_tc` | structured |

#### Asia-Pacific

| Label | Value | Category | Type |
|---|---|---|---|
| Australia TFN | `123 456 78` | `au_tfn` | structured |
| Australian Medicare card | `2123456701` | `au_medicare` | structured |
| Australian passport | `PA1234567` | `au_passport` | structured |
| New Zealand IRD | `123456789` | `nz_ird` | structured |
| Singapore NRIC | `S1234567D` | `sg_nric` | structured |
| Hong Kong HKID | `A123456(3)` | `hk_hkid` | structured |
| Japanese My Number | `123456789012` | `jp_my_number` | structured |
| Indian Aadhaar | `2345 6789 0123` | `in_aadhaar` | structured |
| Indian PAN | `ABCDE1234F` | `in_pan` | structured |
| Bangladesh National ID | `1234567890` | `bd_nid` | structured |
| Indonesia NIK | `3201234567890001` | `id_nik` | structured |
| Malaysia MyKad | `850101-01-1234` | `my_mykad` | structured |
| Pakistan CNIC | `12345-1234567-1` | `pk_cnic` | structured |
| Philippines PhilSys | `1234-5678-9012` | `ph_philsys` | structured |
| South Korea RRN | `880101-1234567` | `kr_rrn` | structured |
| Sri Lanka NIC | `123456789V` | `lk_nic` | structured |
| Thailand national ID | `1-1001-00001-85-1` | `th_nid` | structured |
| Vietnam CCCD | `001012345678` | `vn_cccd` | structured |

#### Latin America

| Label | Value | Category | Type |
|---|---|---|---|
| Brazilian CPF | `123.456.789-09` | `br_cpf` | structured |
| Brazilian CNPJ | `11.222.333/0001-81` | `br_cnpj` | structured |
| Argentine DNI | `12345678` | `ar_dni` | structured |
| Chilean RUT | `12.345.678-9` | `cl_rut` | structured |
| Colombia cédula | `123.456.789-0` | `co_cedula` | structured |
| Costa Rica cédula | `1-0123-0456` | `cr_cedula` | structured |
| Ecuador cédula | `1234567890` | `ec_cedula` | structured |
| Paraguay RUC | `12345678-9` | `py_ruc` | structured |
| Peru DNI | `12345678` | `pe_dni` | structured |
| Uruguay cédula | `1.234.567-8` | `uy_ci` | structured |
| Venezuela cédula | `V-12345678` | `ve_cedula` | structured |

#### Middle East & Africa

| Label | Value | Category | Type |
|---|---|---|---|
| UAE Emirates ID | `784-1234-1234567-1` | `uae_eid` | structured |
| Saudi National ID | `1234567890` | `sa_nid` | structured |
| South African ID | `9202204720082` | `za_id` | structured |
| Israeli Teudat Zehut | `123456782` | `il_id` | structured |
| Bahrain CPR | `850101234` | `bh_cpr` | structured |
| Iran Melli code | `1234567890` | `ir_melli` | structured |
| Iraq national ID | `123456789012` | `iq_nid` | structured |
| Jordan national ID | `9001012345` | `jo_nid` | structured |
| Kuwait civil ID | `285010112345` | `kw_civil` | structured |
| Lebanon passport | `RL123456` | `lb_pp` | structured |
| Qatar QID | `28501011234` | `qa_qid` | structured |

#### Africa

| Label | Value | Category | Type |
|---|---|---|---|
| Egypt National ID | `28503251234567` | `eg_nid` | structured |
| Ethiopia passport | `EP1234567` | `et_passport` | structured |
| Ghana card | `GHA-123456789-1` | `gh_card` | structured |
| Kenya KRA PIN | `A123456789B` | `ke_kra` | structured |
| Morocco CIN | `AB12345` | `ma_cin` | structured |
| Nigeria BVN | `12345678901` | `ng_bvn` | structured |
| Tanzania NIDA | `12345678901234567890` | `tz_nida` | structured |
| Tunisia CIN | `12345678` | `tn_cin` | structured |
| Uganda NIN | `CM12345678ABCD` | `ug_nin` | structured |

#### Functional

| Label | Value | Category | Type |
|---|---|---|---|
| Session token (32-char hex) | `abc123def456abc123def456abc123de` | `session_id` | structured |
| PIN block (ISO format 0) | `0123456789ABCDEF` | `pin_block` | structured |
| Biometric ID (UUID-style) | `12345678-ABCD-1234-EFGH-123456789ABC` | `biometric_id` | structured |
| Card expiry | `12/26` | `card_expiry` | structured |
| Card track 1 | `%B4532015112830366^SMITH/JOHN^2512101000000000?` | `card_track` | structured |
| MICR check line | `⑈021000021⑈ 123456789012 1234` | `micr` | structured |
| Financial amount | `USD 12,345.67` | `financial_amount` | structured |
| ISO 8601 date | `2024-01-15` | `date_iso` | structured |
| SIM ICCID | `89014103211118510720` | `iccid` | structured |
| Educational email | `john.smith@mit.edu` | `edu_email` | structured |
| Employee ID | `EMP1234567` | `employee_id` | structured |
| GPS coordinates | `40.7128,-74.0060` | `gps_coords` | structured |
| Insurance policy number | `POL123456789` | `insurance_policy` | structured |
| Bank reference | `ACCT12345678` | `bank_ref` | structured |
| Legal case number | `1:24-cv-12345` | `legal_case` | structured |
| Loan/mortgage number | `ABCD00123456789012345678` | `loan_number` | structured |
| National Drug Code | `0069-3190-03` | `ndc_code` | structured |
| Date of birth | `01/15/1985` | `dob` | structured |
| Postal code | `SW1A 1AA` | `postal_code` | structured |
| Masked PAN | `4532 XXXX XXXX 0366` | `masked_pan` | structured |
| Property parcel number | `123-456-789` | `parcel_number` | structured |
| AML case ID | `AML-123456789` | `aml_case_id` | structured |
| ISIN | `US0378331005` | `isin` | structured |
| Twitter/X handle | `@johnsmith` | `twitter_handle` | structured |
| URL with embedded credentials | `https://admin:password123@example.com/api` | `url_with_creds` | structured |
| Vehicle Identification Number | `1HGBH41JXMN109186` | `vin` | structured |
| Fedwire IMAD | `20240101AAAA12345678001234` | `fedwire_imad` | structured |

#### Global

| Label | Value | Category | Type |
|---|---|---|---|
| Bitcoin legacy address | `1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2` | `bitcoin` | structured |
| Ethereum address | `0x742d35Cc6634C0532925a3b844Bc454e4438f44e` | `ethereum` | structured |
| Email address | `test.user@example.com` | `email` | structured |
| US phone number | `+1-555-867-5309` | `phone` | structured |
| AWS Access Key ID | `AKIAIOSFODNN7EXAMPLE` | `aws_key` | heuristic |
| GitHub classic token | `ghp_16C7e42F292c6912E7710c838347Ae178B4a` | `github_token` | heuristic |
| Stripe test secret key | `sk_test_4eC39HqLyjWDarjtT7en6bh8Xy9mPqZ` | `stripe_key` | heuristic |
| Slack bot token | `xoxb-EXAMPLE-BOTTOKEN-abc123def` | `slack_token` | heuristic |
| Sample JWT | *(compact JWT string)* | `jwt` | heuristic |
| Top Secret classification label | `TOP SECRET` | `classification` | heuristic |
| HIPAA privacy label | `HIPAA` | `classification` | heuristic |
| Corporate confidential label | `Company Confidential` | `corp_classification` | heuristic |
| MNPI label | `MNPI` | `mnpi` | heuristic |
| Cardholder name (PCI) | `John Smith` | `cardholder_name` | heuristic |
| Privacy/compliance label | `PCI-DSS` | `privacy_label` | heuristic |
| Attorney-client privilege marker | `Attorney-Client Privileged` | `attorney_client` | heuristic |
| Confidential supervisory info | `Confidential Supervisory Information` | `supervisory_info` | heuristic |
| Random 32-char API key | `xK9mP2nL4qR7vT1w…` | `random_api_key` | heuristic (entropy) |
| Random 48-char base64url token | `eyJhbGciOiJIUzI1NiJ9.dGVzdHBheWxvYWQ…` | `random_token` | heuristic (entropy) |
| Random 64-char hex secret | `a3f8c2e1d4b7a9f0…` | `random_secret` | heuristic (entropy) |
| Base64-encoded credential | `dXNlcm5hbWU6…` | `encoded_credential` | heuristic (entropy) |
| Assignment-form secret | `DATABASE_PASSWORD=xK9mP2nL4qR7vT1w…` | `assignment_secret` | heuristic (entropy) |
| Gated secret | `api_key: xK9mP2nL4qR7vT1w…` | `gated_secret` | heuristic (entropy) |

Heuristic payloads are excluded from the default scan. Use `--include-heuristic` to include them. The `entropy`-labeled categories also have their own dedicated test harness: see [`evadex entropy`](#entropy-mode-testing).

---

## Canadian French support

evadex generates test content in Canadian French (`fr-CA`) so you can verify that your DLP scanner catches sensitive data when surrounded by French-language business text — a common real-world condition in Canadian financial institutions.

### French keyword context

The following French Canadian keywords are used as surrounding context in generated documents and evasion variants:

| Category | Keywords |
|---|---|
| `credit_card` | *carte de crédit*, *numéro de carte*, *mon numéro de carte est*, *carte bancaire*, *numéro de carte bancaire*, *paiement par carte* |
| `sin` | *numéro d'assurance sociale*, *NAS*, *mon NAS est*, *assurance sociale* |
| `iban` | *numéro de compte*, *virement bancaire*, *coordonnées bancaires*, *relevé bancaire* |
| `email` | *courriel*, *adresse courriel*, *mon courriel est* |
| `phone` | *numéro de téléphone*, *composez le*, *téléphone*, *cellulaire* |
| all categories | *renseignements personnels*, *données confidentielles*, *informations personnelles*, *vie privée* |

French keywords are active in two places:
1. **`context_injection` variants** — 10 additional French CA sentence templates are generated alongside the standard English ones during `evadex scan`.
2. **`splitting` variants** — French noise text is prepended/appended in `fr_ca_prefix_noise` and `fr_ca_suffix_noise` variants.

### `--language fr-CA`

Pass `--language fr-CA` to the `generate` command to produce test documents with French keyword context sentences:

```bash
evadex generate --format docx --category credit_card --category sin \
  --count 200 --language fr-CA --output test_fr_ca.docx

evadex generate --format csv --category ca_ramq --count 500 \
  --language fr-CA --output ramq_fr.csv
```

Without `--language`, the default is English (`en`).

---

## False positive rate and the `--require-context` tradeoff

The `evadex falsepos` command generates structurally-plausible but provably-invalid values (Luhn-failing credit card numbers, SSNs with reserved area codes, IBANs with wrong check digits, etc.) and submits them to your scanner. Any match is a false positive.

### What we measured

Three conditions were tested against dlpscan-rs with 100 values per category (7 categories, 700 total):

| Condition | What the scanner receives | What the scanner does |
|---|---|---|
| **Baseline** | Bare invalid value — `4123456789012341` | Matches on structure alone |
| **+`--require-context`** | Bare invalid value — `4123456789012341` | Requires surrounding keywords |
| **+`--wrap-context` + `--require-context`** | Invalid value inside a keyword sentence — `"Please charge my credit card number 4123456789012341 for..."` | Has both pattern match and keyword context |

### Results — false positive rates

| Category | Baseline | `--require-context` | `--wrap-context` + `--require-context` |
|---|---|---|---|
| `credit_card` | 100.0% | 100.0% | 100.0% |
| `ssn` | 100.0% | 100.0% | 100.0% |
| `sin` | 100.0% | 100.0% | 100.0% |
| `iban` | 100.0% | 100.0% | 100.0% |
| `phone` | 100.0% | 100.0% | 100.0% |
| `email` | 95.0% | 98.0% | 100.0% |
| `ca_ramq` | 99.0% | 99.0% | 100.0% |
| **Overall** | **99.1%** | **99.6%** | **100.0%** |

*100 values per category, seed=default, dlpscan-rs rust adapter, text strategy.*

### Key findings

**`--require-context` does not reduce false positives for structurally-similar invalid values.**
The FP rate is statistically unchanged between the baseline (99.1%) and require-context (99.6%) runs — the difference is within normal statistical noise. dlpscan-rs is matching on value *structure* (digit count, prefix, format), not on semantic validity (Luhn check, reserved area codes, mod-97 checksum). The context requirement does not gate out pattern-matched values when the pattern match itself is very confident.

**Adding keyword context makes it worse, not better.**
When invalid values are embedded in realistic keyword sentences (`--wrap-context`), the FP rate rises to 100.0%. This is the most realistic production scenario — real documents that contain a string resembling a credit card number will almost always have surrounding financial language — and it confirms the scanner flags all structurally-plausible values regardless of validity.

**The FP problem is in the pattern layer, not the context layer.**
Reducing false positives against dlpscan-rs requires the scanner to perform checksum validation (Luhn for credit cards and SINs, mod-97 for IBANs, reserved-code filtering for SSNs), not keyword-context gating. `--require-context` is an effective tool for reducing noisy matches in free-form text, but it cannot help when the pattern match itself is the source of the false positive.

### Detection rate tradeoff

To quantify the cost of enabling `--require-context` on real evasion testing, we ran the evadex evasion suite (credit card, SSN, SIN, IBAN — text strategy) under both conditions:

| | Baseline | `--require-context` | Delta |
|---|---|---|---|
| **Overall detection rate** | **94.1%** | **94.0%** | **−0.1 pp** |

Per-technique breakdown:

| Technique | Baseline DR | `--require-context` DR | Delta |
|---|---|---|---|
| `bidirectional` | 100.0% | 100.0% | 0.0 pp |
| `context_injection` | 100.0% | 100.0% | 0.0 pp |
| `delimiter` | 100.0% | 99.1% | −0.9 pp |
| `encoding` | 85.0% | 90.3% | **+5.3 pp** |
| `encoding_chains` | 72.5% | 65.9% | **−6.6 pp** |
| `morse_code` | 65.4% | 55.8% | **−9.6 pp** |
| `regional_digits` | 100.0% | 100.0% | 0.0 pp |
| `soft_hyphen` | 100.0% | 100.0% | 0.0 pp |
| `splitting` | 100.0% | 100.0% | 0.0 pp |
| `structural` | 94.2% | 92.8% | −1.4 pp |
| `unicode_encoding` | 94.6% | 95.4% | +0.8 pp |
| `unicode_whitespace` | 100.0% | 100.0% | 0.0 pp |

**`--require-context` reduces detection of obfuscated forms the most.** Morse code (−9.6 pp) and encoding chains (−6.6 pp) suffer the largest drops — these techniques produce output that contains no recognizable keyword context, so the scanner's context requirement causes it to skip matches it would otherwise make. Conversely, single-layer encoding improves slightly (+5.3 pp) because the decoded context may now satisfy the keyword requirement.

### Recommendation for compliance teams

> dlpscan-rs's ~99% false positive rate on structurally-plausible invalid values is a fundamental property of its **pattern-first** detection model. It is intentional: the scanner is tuned for high recall (catch everything) rather than high precision (avoid flagging invalid data).

For production deployments:

- **Do not rely on `--require-context` to reduce false positives on free-form document content.** It has negligible effect on FP rates when the values are structurally valid-looking, and it costs real detection rate on obfuscated variants (especially morse code and multi-layer encoding).
- **If false positive rate is a concern**, the appropriate mitigation is downstream triage (review queue, confidence thresholding) rather than scanner-level context gating.
- **For evasion testing specifically**, run `evadex scan` without `--require-context`. The baseline detection rate (94.1% on these categories) represents the scanner's real-world behavior for the majority of documents.
- **`--require-context` is most useful** when scanning large repositories of generic text where you want to reduce noise from coincidental pattern matches — not when testing against structured financial data.

### Reproducing the results

```bash
# Baseline FP test
evadex falsepos --tool dlpscan-cli \
  --exe /path/to/dlpscan --cmd-style rust \
  --count 100 --format json -o falsepos_baseline.json

# With require-context (scanner-side flag)
evadex falsepos --tool dlpscan-cli \
  --exe /path/to/dlpscan --cmd-style rust \
  --count 100 --require-context --format json -o falsepos_require_context.json

# Most realistic: invalid values embedded in keyword context, with require-context
evadex falsepos --tool dlpscan-cli \
  --exe /path/to/dlpscan --cmd-style rust \
  --count 100 --wrap-context --require-context --format json -o falsepos_full_context.json

# Evasion scan detection rate without require-context
evadex scan --tool dlpscan-cli \
  --exe /path/to/dlpscan --cmd-style rust \
  --strategy text --category credit_card --category ssn --format json -o evasion_baseline.json

# Evasion scan with require-context (detection rate tradeoff)
evadex scan --tool dlpscan-cli \
  --exe /path/to/dlpscan --cmd-style rust \
  --strategy text --category credit_card --category ssn \
  --require-context --format json -o evasion_require_context.json
```

---

## Structured vs heuristic categories

evadex classifies its built-in payload categories into two groups:

**Structured** — formats with well-defined, mathematically or syntactically validatable patterns. DLP scanners typically enforce these patterns precisely (e.g., Luhn check on credit cards, fixed-length digit groups for SSN/SIN, checksum-verified IBAN). Evasion results in this group reflect meaningful signal: a variant that evades detection is a real gap in coverage.

Categories: `credit_card`, `ssn`, `sin`, `us_itin`, `us_ein`, `us_mbi`, `us_dl`, `us_passport`, `iban`, `swift_bic`, `aba_routing`, `bitcoin`, `ethereum`, `au_tfn`, `au_medicare`, `au_passport`, `de_tax_id`, `de_id`, `fr_insee`, `fr_cni`, `uk_nin`, `uk_dl`, `es_dni`, `it_cf`, `nl_bsn`, `se_pin`, `no_fnr`, `fi_hetu`, `pl_pesel`, `ch_ahv`, `at_svn`, `be_nrn`, `bg_egn`, `hr_oib`, `cy_tin`, `cz_rc`, `dk_cpr`, `ee_ik`, `eu_vat`, `gr_amka`, `hu_taj`, `is_kt`, `ie_pps`, `lv_pk`, `li_pp`, `lt_ak`, `lu_nin`, `mt_id`, `pt_nif`, `ro_cnp`, `sk_bn`, `si_emso`, `tr_tc`, `nz_ird`, `sg_nric`, `hk_hkid`, `jp_my_number`, `in_aadhaar`, `in_pan`, `bd_nid`, `id_nik`, `my_mykad`, `pk_cnic`, `ph_philsys`, `kr_rrn`, `lk_nic`, `th_nid`, `vn_cccd`, `br_cpf`, `br_cnpj`, `mx_curp`, `ar_dni`, `cl_rut`, `co_cedula`, `cr_cedula`, `ec_cedula`, `py_ruc`, `pe_dni`, `uy_ci`, `ve_cedula`, `uae_eid`, `sa_nid`, `za_id`, `il_id`, `bh_cpr`, `ir_melli`, `iq_nid`, `jo_nid`, `kw_civil`, `lb_pp`, `qa_qid`, `eg_nid`, `et_passport`, `gh_card`, `ke_kra`, `ma_cin`, `ng_bvn`, `tz_nida`, `tn_cin`, `ug_nin`, `email`, `phone`, `ca_ramq`, `ca_ontario_health`, `ca_bc_carecard`, `ca_ab_health`, `ca_qc_drivers`, `ca_on_drivers`, `ca_bc_drivers`, `ca_passport`, `ca_mb_health`, `ca_sk_health`, `ca_ns_health`, `ca_nb_health`, `ca_pei_health`, `ca_nl_health`, `ca_mb_drivers`, `ca_sk_drivers`, `ca_ns_drivers`, `ca_nb_drivers`, `ca_pei_drivers`, `ca_nl_drivers`, `ca_business_number`, `ca_gst_hst`, `ca_transit_number`, `ca_bank_account`, `session_id`, `pin_block`, `biometric_id`, `card_expiry`, `card_track`, `micr`, `financial_amount`, `date_iso`, `iccid`, `edu_email`, `employee_id`, `gps_coords`, `insurance_policy`, `bank_ref`, `legal_case`, `loan_number`, `ndc_code`, `dob`, `postal_code`, `masked_pan`, `parcel_number`, `aml_case_id`, `isin`, `twitter_handle`, `url_with_creds`, `vin`, `fedwire_imad`

**Heuristic** — formats where detection relies on fixed prefixes, high-entropy pattern matching, or loosely defined structure. DLP rules for these categories vary widely between scanners and configurations, and a "fail" result may simply reflect that the scanner never had a strong rule for that specific format variant — not that a real exfiltration path was found.

Categories: `aws_key`, `jwt`, `github_token`, `stripe_key`, `slack_token`, `classification`, `corp_classification`, `mnpi`, `cardholder_name`, `privacy_label`, `attorney_client`, `supervisory_info`

Heuristic categories are excluded from the default scan to avoid misleading results. Include them with:

```bash
evadex scan --tool dlpscan-cli --include-heuristic
```

A warning is printed to stderr whenever `--include-heuristic` is active reminding you to interpret those results with caution.

---

## Installation

Requires Python 3.10+.

```bash
pip install evadex
```

Or install from source:

```bash
git clone https://github.com/tbustenk/evadex
cd evadex
pip install -e ".[dev]"
```

Optional extras:

```bash
# Barcode / QR image generation for scanners that decode images (Siphon, etc.)
pip install evadex[barcodes]

# Parquet generation (pyarrow). SQLite is stdlib — no extra needed.
# Pair with a scanner built with data-format extractors, e.g. Siphon
# compiled with `--features data-formats`.
pip install evadex[data-formats]

# 7-Zip archive generation (py7zr). ZIP / nested ZIP / mbox / ics / warc
# all use stdlib only — no extra needed for those. Pair 7z with a scanner
# built with archive extractors, e.g. Siphon compiled with `--features archives`.
pip install evadex[archives]
```

For reproducible installs with pinned, hash-verified dependencies (recommended for regulated environments):

```bash
pip install -r requirements.txt        # runtime only
pip install -r requirements-dev.txt    # runtime + test dependencies
```

These lockfiles are generated with `pip-compile --generate-hashes` and updated with each release.

---

## Quick start

By default evadex runs the **banking tier** — ~80 payloads optimised for Canadian banking and RBC's compliance surface. No flags required:

```bash
evadex scan --tool dlpscan-cli --strategy text
```

### Tiers

| Tier | Payloads | Est. runtime (text strategy) | When to use |
|---|---|---|---|
| **`banking`** *(default)* | ~80 Canadian banking focused | ~5 min | Daily checks, quick patches, RBC production testing |
| `core` | ~150 broader PII and financial | ~10 min | Weekly benchmarks, broader compliance checks |
| `regional` | ~350 international coverage | ~20 min | Pre-release validation, international deployments |
| `full` | All 554 payloads | ~30–40 min | Major releases, compliance audits, onboarding new scanners |

```bash
# default — banking tier
evadex scan --tool dlpscan-cli --strategy text

# explicit tiers
evadex scan --tool dlpscan-cli --strategy text --tier core
evadex scan --tool dlpscan-cli --strategy text --tier regional
evadex scan --tool dlpscan-cli --strategy text --tier full
```

Test a single value:

```bash
evadex scan --tool dlpscan-cli --input "4532015112830366" --strategy text
```

Test with all file strategies (slower — exercises DOCX/PDF/XLSX extraction):

```bash
evadex scan --tool dlpscan-cli --input "4532015112830366"
```

Generate an HTML report:

```bash
evadex scan --tool dlpscan-cli --strategy text --format html -o report.html
```

---

## Configuration

evadex supports an optional `evadex.yaml` config file. Config file values are defaults — any CLI flag you pass overrides the corresponding config value.

### Generating a starter config

```bash
evadex init
```

Creates `evadex.yaml` in the current directory:

```yaml
# evadex configuration file
# Run 'evadex scan --config evadex.yaml' to use this file.
# CLI flags take precedence over values in this file.

tool: dlpscan-cli
strategy: text
min_detection_rate: 85
scanner_label: production
exe: null
cmd_style: python
# tier: banking   # banking (default) | core | regional | full
# categories:     # explicit category list — overrides tier when set
#   - credit_card
#   - ssn
#   - iban
include_heuristic: false
concurrency: 20
timeout: 30.0
output: results.json
format: json
```

### Using a config file

Pass it explicitly:

```bash
evadex scan --config evadex.yaml
```

Or drop `evadex.yaml` in the current directory and evadex will pick it up automatically — no flag needed.

CLI flags always win. To override a config value for one run:

```bash
# Config says scanner_label: production — this run uses "staging" instead
evadex scan --config evadex.yaml --scanner-label staging
```

### Config keys

| Key | Type | CLI equivalent | Description |
|---|---|---|---|
| `tool` | string | `--tool` | Adapter name (`dlpscan-cli`, `dlpscan`, `siphon`, `presidio`) |
| `strategy` | string or list | `--strategy` | Submission strategy: `text`, `docx`, `pdf`, `xlsx`. Use a list for multiple. |
| `min_detection_rate` | number | `--min-detection-rate` | CI/CD gate threshold (0–100) |
| `scanner_label` | string | `--scanner-label` | Label recorded in JSON `meta.scanner` |
| `exe` | string or null | `--exe` | Path to scanner executable |
| `cmd_style` | `python` or `rust` | `--cmd-style` | Command format for dlpscan-cli |
| `tier` | string | `--tier` | Payload tier: `banking` (default), `core`, `regional`, `full`. Ignored when `categories` is set. |
| `categories` | list of strings | `--category` | Payload categories to test (overrides `tier`) |
| `include_heuristic` | boolean | `--include-heuristic` | Include heuristic categories |
| `concurrency` | integer | `--concurrency` | Max concurrent requests (default: 20) |
| `timeout` | number | `--timeout` | Request timeout in seconds |
| `output` | string or null | `--output` | Output file path (null = stdout) |
| `format` | `json` or `html` | `--format` | Output format |
| `audit_log` | string or null | `--audit-log` | Append-only audit log file (see [Audit log](#audit-log)) |
| `c2_url` | string or null | `--c2-url` | Siphon-C2 admin-dashboard URL to push results to. See [Siphon-C2 integration](#siphon-c2-integration). |
| `c2_key` | string or null | `--c2-key` | API key sent as `x-api-key` to Siphon-C2. Same format as Siphon's core API key. |

### Validation

evadex validates the config file on load and exits with a clear error for invalid values:

```
Error: Config 'min_detection_rate' must be between 0 and 100, got: 150.0
Error: Invalid strategy value(s): foobar. Valid: docx, pdf, text, xlsx
Error: Unknown config key(s): bad_key. Valid keys: categories, cmd_style, ...
```

---

## Example output

### Terminal summary

```
Running evadex scan against dlpscan-cli at http://localhost:8080...
Done. 590 tests — N detected, N evaded
```

Detection rates depend on your scanner, its version, and how it's configured.

### JSON output (`--format json`, default)

```json
{
  "meta": {
    "timestamp": "2026-04-01T22:01:36.172424+00:00",
    "scanner": "rust-2.0.0",
    "total": 590,
    "pass": 514,
    "fail": 76,
    "error": 0,
    "pass_rate": 87.1,
    "summary_by_category": {
      "credit_card": { "pass": 109, "fail": 15, "error": 0 },
      "ssn":         { "pass": 43,  "fail": 10, "error": 0 },
      "iban":        { "pass": 36,  "fail": 8,  "error": 0 }
    },
    "summary_by_generator": {
      "delimiter":        { "pass": 72, "fail": 10, "error": 0 },
      "unicode_encoding": { "pass": 54, "fail": 13, "error": 0 }
    }
  },
  "results": [
    {
      "payload": {
        "value": "5105105105105100",
        "category": "credit_card",
        "category_type": "structured",
        "label": "Mastercard 16-digit"
      },
      "variant": {
        "value": "5105105105105100",
        "generator": "delimiter",
        "technique": "no_delimiter",
        "transform_name": "All delimiters removed",
        "strategy": "text"
      },
      "detected": true,
      "severity": "pass",
      "duration_ms": 371.01,
      "error": null,
      "raw_response": { "matches": [{ "type": "credit_card", "value": "5105105105105100" }] }
    },
    {
      "payload": {
        "value": "046 454 286",
        "category": "sin",
        "category_type": "structured",
        "label": "Canada SIN"
      },
      "variant": {
        "value": "Ο4б 4Ƽ4 ΚȢб",
        "generator": "unicode_encoding",
        "technique": "homoglyph_substitution",
        "transform_name": "Visually similar Cyrillic/Greek characters substituted",
        "strategy": "text"
      },
      "detected": false,
      "severity": "fail",
      "duration_ms": 378.57,
      "error": null,
      "raw_response": { "matches": [] }
    }
  ]
}
```

**Severity values:**

| Value | Meaning |
|---|---|
| `pass` | Scanner detected the variant (good) |
| `fail` | Scanner missed the variant — evasion succeeded |
| `error` | Adapter error (network, timeout, malformed scanner response, etc.) |

---

## CLI reference

### `evadex scan`

Run DLP evasion tests against a scanner.

```
evadex scan [OPTIONS]
```

| Flag | Default | Description |
|---|---|---|
| `--config` | *(auto-discovered)* | Path to `evadex.yaml` config file. Auto-discovered from current directory if present. CLI flags always override config values. |
| `--tool`, `-t` | `dlpscan-cli` | Adapter to use. Built-in adapters: `dlpscan-cli`, `dlpscan`, `siphon`, `presidio`. |
| `--input`, `-i` | *(banking tier)* | Single value to test. If omitted, runs the banking tier (~80 payloads). Use `--tier` to change. Category is auto-detected (Luhn check, regex patterns for SSN/IBAN/AWS/JWT/email/phone). |
| `--format`, `-f` | `json` | Output format: `json` or `html` |
| `--output`, `-o` | stdout | Write report to file instead of stdout |
| `--strategy` | all four | Submission strategy: `text`, `docx`, `pdf`, `xlsx`. Repeat the flag for multiple. Omit to run all four. |
| `--tier` | `banking` | Payload tier: `banking` (default), `core`, `regional`, `full`. Ignored when `--category` is specified. |
| `--concurrency` | `20` | Max concurrent requests |
| `--timeout` | `30.0` | Request timeout in seconds |
| `--url` | `http://localhost:8080` | Base URL (for HTTP-based adapters: `dlpscan`, `siphon`, `presidio`) |
| `--api-key` | *(env: `EVADEX_API_KEY`)* | API key passed as `Authorization: Bearer`. Use the environment variable in preference to the CLI flag to avoid exposure in shell history and process listings. |
| `--category` | *(overrides --tier)* | Filter built-in payloads by category. Repeat for multiple. When set, `--tier` is ignored. |
| `--variant-group` | *(all)* | Limit to specific generator(s). Repeat for multiple. Values: `unicode_encoding`, `delimiter`, `splitting`, `leetspeak`, `regional_digits`, `structural`, `encoding`, `context_injection`, `unicode_whitespace`, `bidirectional`, `soft_hyphen`, `morse_code` |
| `--include-heuristic` | off | Also run heuristic categories (`aws_key`, `jwt`, `github_token`, `stripe_key`, `slack_token`, `classification`). A warning is printed when enabled — see [Structured vs heuristic categories](#structured-vs-heuristic-categories). |
| `--scanner-label` | *(empty)* | Label recorded in the JSON `meta.scanner` field. Use to tag a specific scanner version, e.g. `python-1.3.0` or `rust-2.0.0`. Useful when comparing results across scanner builds. |
| `--exe` | `dlpscan` | Path to the scanner executable (dlpscan-cli adapter only). Use when `dlpscan` is not on `PATH` or you need to target a specific build. |
| `--cmd-style` | `python` | Command format for dlpscan-cli: `python` (invokes `dlpscan -f json <file>`) or `rust` (invokes `dlpscan --format json scan <file>`). |
| `--min-detection-rate` | *(off)* | Exit with code 1 if the detection rate falls below this threshold (0–100). Intended for CI/CD pipeline gating. Report is always written before the exit. |
| `--baseline` | *(off)* | Save this run's JSON results to a file for future comparison. |
| `--compare-baseline` | *(off)* | Compare this run against a previously saved baseline and print a regression summary to stderr. |
| `--audit-log` | *(off)* | Append a one-line JSON audit record for this run to a file. Parent directories are created if they do not exist. Can also be set via `audit_log` in `evadex.yaml`. |
| `--feedback-report` | *(off)* | Save a structured JSON feedback report to PATH. Contains per-technique evasion counts with example variant values, actionable fix suggestions, and the generated regression test code as a string field. Always written when specified, even if there are no evasions. |
| `--require-context` | off | Pass `--require-context` to dlpscan-rs: only flag matches when surrounding keywords are present. Reduces false positives but may reduce detection rate for obfuscated variants lacking keyword context. Requires `--cmd-style rust`. Can also be set via `require_context` in `evadex.yaml`. |
| `--wrap-context` | **auto (rust)** | Embed every variant value in a realistic keyword sentence before submission. **Automatically enabled when `--cmd-style rust` is used** — dlpscan-rs requires surrounding context keywords to flag most matches; submitting a bare value produces artificially low detection rates. Pass `--no-wrap-context` to suppress. Can also be set via `wrap_context` in `evadex.yaml`. |
| `--no-wrap-context` | off | Explicitly disable context wrapping even when `--cmd-style rust` is active. |

> **Note for dlpscan-rs users:** dlpscan-rs requires surrounding context keywords (e.g. "credit card", "SSN", "IBAN") to be present near a matched value before it will flag it. Submitting a bare value like `4532015112830366` without context will produce a false *no-match* — the scanner can see the number but will not fire without contextual evidence. `evadex scan` auto-enables `--wrap-context` when `--cmd-style rust` is used so every variant is embedded in a realistic business sentence. Use `--no-wrap-context` only if you are deliberately testing bare-value behaviour or your dlpscan-rs build is configured with context matching disabled.

### `evadex generate`

Generate test documents filled with synthetic sensitive data for DLP scanner testing. Values are embedded in realistic business sentences, tables, and paragraphs. Evasion variants use the same obfuscation techniques as `evadex scan`.

```
evadex generate (--format FORMAT | --formats FMT,FMT,...) --output PATH [OPTIONS]
```

| Flag | Default | Description |
|---|---|---|
| `--format` | *(one of format/formats required)* | Single output format: `xlsx`, `docx`, `pdf`, `csv`, `txt`, `eml`, `msg`, `json`, `xml`, `sql`, `log`, `png`, `jpg`, `multi_barcode_png`, `edm_json`, `parquet`, `sqlite`, `zip`, `zip_nested`, `7z`, `mbox`, `ics`, `warc` |
| `--formats` | *(one of format/formats required)* | Comma-separated list of formats. Output is a path stem; extensions are appended. `--formats xlsx,docx,pdf --output dir/test` → `test.xlsx`, `test.docx`, `test.pdf` |
| `--barcode-type` | `qr` | Barcode encoding for `png`/`jpg`/`multi_barcode_png`: `qr` (unicode, up to 4296 chars), `code128` (ASCII 1D), `ean13` (13 digits, zero-padded), `pdf417` (2D, requires optional `pdf417gen`), `datamatrix` (2D, requires optional `pylibdmtx`), or `random`. |
| `--output` | *(required)* | Output file path (with `--format`) or path stem (with `--formats`) |
| `--tier` | `banking` | Payload tier when `--category` is not set: `banking` (default), `core`, `regional`, `full` |
| `--category` | *(overrides --tier)* | Payload category to include. Repeat for multiple. |
| `--count` | `100` | Number of test values to generate **per category** |
| `--evasion-rate` | `0.5` | Fraction of values that are evasion variants (0.0–1.0) |
| `--keyword-rate` | `0.5` | Fraction of values wrapped in keyword context sentences (0.0–1.0) |
| `--technique` | *(all)* | Limit evasion variants to specific technique names. Repeat for multiple. |
| `--language` | `en` | Language for keyword context sentences: `en` (English) or `fr-CA` (Canadian French) |
| `--random` | off | Randomise categories, evasion rate, and keyword rate |
| `--seed` | *(none)* | Integer seed for reproducible output |
| `--include-heuristic` | off | Also include heuristic categories (AWS keys, tokens, JWT, etc.) |
| `--count-per-category` | *(uses --count)* | Override count for a specific category. Repeat for multiple. Example: `--count-per-category credit_card:200 --count-per-category sin:50` |
| `--total` | *(off)* | Generate exactly N records distributed evenly across selected categories. Example: `--total 1000` |
| `--density` | `medium` | How frequently sensitive values appear in filler text: `low` (one per paragraph), `medium` (one per 2-3 sentences), `high` (almost every sentence) |
| `--technique-group` | *(all)* | Limit evasion variants to a specific generator family. Repeat for multiple. Example: `--technique-group unicode_encoding` |
| `--technique-mix` | *(off)* | Exact proportion per technique group, comma-separated. Proportions must sum to 1.0. Example: `--technique-mix unicode_encoding:0.4,encoding:0.3,splitting:0.3` |
| `--evasion-per-category` | *(uses --evasion-rate)* | Override evasion rate for a specific category. Repeat for multiple. Example: `--evasion-per-category credit_card:0.7 --evasion-per-category sin:0.2` |
| `--template` | `generic` | Document template controlling structure and tone: `generic`, `invoice`, `statement`, `hr_record`, `audit_report`, `source_code`, `config_file`, `chat_log`, `medical_record`, `env_file`, `secrets_file`, `code_with_secrets` (entropy-focused: `.env` / YAML secrets / bare-value source code — pair with entropy categories) |
| `--noise-level` | `medium` | Ratio of filler text to sensitive values: `low` (mostly values), `medium` (balanced), `high` (lots of business text) |

**Format details:**

- **`xlsx`** — Multiple sheets: one `Summary` sheet plus one sheet per category. Columns include embedded text, plain value, variant value, technique, and generator. Evasion rows are highlighted yellow.
- **`docx`** — Title page with disclaimer; one heading per category; two-thirds prose paragraphs, one-third tabular layout. Supports `--template` for alternate document structures.
- **`pdf`** — Sections per category with header/footer; evasion rows highlighted.
- **`csv`** — Flat CSV with columns: `category`, `plain_value`, `variant_value`, `technique`, `generator`, `transform_name`, `has_keywords`, `embedded_text`.
- **`txt`** — Plain-text document with section headings and numbered entry list. Supports `--template` for alternate document structures.
- **`eml`** — RFC 2822 email file with From/To/Subject headers, realistic names, and sensitive values in the body. Example: `"Please find attached the statement for card 4532015112830366"`.
- **`msg`** — Outlook message format. Currently generates EML-format content with `.msg` extension (DLP scanners extract text identically).
- **`json`** — Structured JSON data export. Array of records with realistic field names: `customer_id`, `card_number`, `name`, `email`, plus filler fields. Pretty-printed.
- **`xml`** — Financial messaging format resembling ISO 20022 (pain.001) payment messages. Sensitive values in appropriate XML elements (`<IBAN>`, `<BIC>`, `<Ustrd>`).
- **`sql`** — Database dump format with `CREATE TABLE` and `INSERT INTO` statements. Example: `INSERT INTO customers (id, name, sin, card_number) VALUES (1, 'John Smith', '046 454 286', '4532015112830366');`.
- **`log`** — Application log format with timestamps, log levels, and services. Mixes plaintext, structured, and JSON log formats.
- **`png` / `jpg`** — Image grid of barcodes/QR codes, one per entry. Targets scanners that extract text from images via barcode decoding (e.g. Siphon's `extract_barcode` pipeline, which decodes QR, Data Matrix, PDF417, Code 128, EAN-13, etc.). Capped at 60 barcodes per image for decompression-bomb safety and to stay under Siphon's 100-codes-per-image decode cap. Value is rendered with a quiet zone plus a human-readable label. *Requires `pip install evadex[barcodes]`.*
- **`multi_barcode_png`** — PNG styled like a scanned form with a header bar, body text, and a mixed grid of QR, Code 128, and EAN-13 codes carrying different sensitive values. Exercises multi-format decoding in one pass. *Requires `pip install evadex[barcodes]`.*
- **`edm_json`** — flat JSON file (`{"values": [{"value", "category", "label"}, …]}`) matching the shape of Siphon's `POST /v1/edm/register` request body. Use for bulk EDM registration — see [EDM testing](#edm-testing).
- **`parquet`** — Apache Parquet with a flat customer/banking schema (`customer_id`, `name`, `email`, `phone`, `sin`, `card_number`, `iban`, `swift_bic`, …) and snappy compression. Each sensitive payload lands in its category-appropriate column; remaining columns are filled with realistic fake data. Written in 1000-row row groups so large files exercise multi-group Parquet readers. Targets scanners with Parquet extractors (e.g. Siphon built with `--features data-formats`, which reads the first 10,000 rows). *Requires `pip install evadex[data-formats]`.*
- **`sqlite`** — SQLite database with three realistic banking tables (`customers`, `transactions`, `accounts`). Payloads route to whichever table owns their category. Uses Python's stdlib `sqlite3` so no extra install is needed on evadex's side — the scanner still needs its own SQLite support (Siphon's `extract_sqlite` requires the `data-formats` feature and reads up to 5,000 rows per table).
- **`zip`** — ZIP archive containing 4–12 inner files (`customer_data.csv`, `transactions_q1.csv`, `audit_log.txt`, `config.json`, …) with sensitive payloads spread across them, plus a `manifest.xml` index. Stdlib `zipfile`. **Note:** Siphon's plain-ZIP extractor in `crates/siphon-core/src/extractors.rs` only walks `*.xml` entries — text inside non-OOXML ZIPs is currently not extracted, so detection on this format mostly serves to document the gap. Use `7z` (below) when you need detection to actually fire end to end.
- **`zip_nested`** — ZIP-inside-ZIP-inside-ZIP, three levels deep, with sensitive data only in the innermost archive. Tests recursive-archive extraction (which Siphon does not currently perform). Stdlib `zipfile`.
- **`7z`** — 7-Zip / LZMA2 archive with the same banking-filename inner structure as `zip`. Siphon's `extract_7z` *does* read txt/csv/json content (1 MB per file, 100 KB content cap), so this is the right choice when you want detection to fire on the archive contents. *Requires `pip install evadex[archives]`.*
- **`mbox`** — Unix mailbox file with one realistic email per entry (sensible From/To/Subject/Date headers, banking-domain prose). Roughly one in three messages uses `Content-Transfer-Encoding: base64` so Siphon's `extract_mbox` decode path gets exercised. Stdlib `mailbox` / `email`.
- **`ics`** — iCalendar (RFC 5545) file with one VEVENT per entry. Sensitive payloads land in `SUMMARY`, `DESCRIPTION`, and `ATTENDEE` properties — exactly what Siphon's `extract_ics` walks. CRLF-terminated and 75-octet line-folded so any conformant calendar parser will read it.
- **`warc`** — Web ARChive (ISO 28500 / WARC 1.1) with one `warcinfo` record plus one HTTP-`response` record per entry. Sensitive values are embedded in synthetic HTML banking-portal bodies inside the captured responses — exercises Siphon's `extract_warc`.

**Template details:**

- **`generic`** (default) — Mixed prose and table format (existing behaviour).
- **`invoice`** — Payment invoice layout with line items, amounts, HST, and totals.
- **`statement`** — Bank statement with account details, transaction history, and balance.
- **`hr_record`** — HR employee records with personal information fields grouped per employee.
- **`audit_report`** — Internal audit report with executive summary, detailed findings (severity-rated), and recommendations.
- **`source_code`** — Realistic source code with sensitive values as hardcoded strings, variable assignments, and comments. Mixes Python, JavaScript, and generic syntax.
- **`config_file`** — Application config (randomly INI, YAML, or ENV format) with sensitive values as configuration parameters.
- **`chat_log`** — Messaging/chat export with timestamps, participant names, and sensitive values shared in conversation.
- **`medical_record`** — Clinical notes and patient records with MRN, DOB, diagnoses, medications, and sensitive identifiers.

**Examples:**

```bash
# Banking tier (default) — all three formats in one pass
evadex generate --formats xlsx,docx,pdf --tier banking --count 100 \
  --evasion-rate 0.3 --output reports/banking_en

# Canadian French — banking tier
evadex generate --formats xlsx,docx,pdf --tier banking --count 100 \
  --evasion-rate 0.3 --language fr-CA --output reports/banking_frca

# Large CSV for bulk testing
evadex generate --format csv --tier banking --count 500 \
  --evasion-rate 0.5 --output reports/banking_large.csv

# 100 credit cards, 40% evasion variants → XLSX
evadex generate --format xlsx --category credit_card --count 100 \
  --evasion-rate 0.4 --output test_cards.xlsx

# Mixed categories → DOCX
evadex generate --format docx \
  --category credit_card --category ssn --category iban \
  --count 50 --evasion-rate 0.5 --output test_mixed.docx

# Reproducible random document
evadex generate --format xlsx --random --count 500 --seed 42 --output random.xlsx

# CSV for programmatic inspection
evadex generate --format csv --category ssn --count 1000 \
  --evasion-rate 0.3 --output ssn_variants.csv

# New formats — email, JSON, XML, SQL, log
evadex generate --format eml --tier banking --count 50 --output test_email.eml
evadex generate --format json --tier banking --count 200 --output export.json
evadex generate --format xml --category iban --category credit_card --count 100 --output payments.xml
evadex generate --format sql --tier banking --count 500 --output dump.sql
evadex generate --format log --tier banking --count 1000 --output app.log
evadex generate --formats eml,json,xml,sql,log --tier banking --output reports/multi

# Parquet / SQLite (Parquet requires: pip install evadex[data-formats]; SQLite is stdlib)
evadex generate --format parquet --tier banking --count 1000 --evasion-rate 0.3 --output test.parquet
evadex generate --format sqlite  --tier banking --count 1000 --evasion-rate 0.3 --output test.db
# French-Canadian column/table names
evadex generate --format parquet --tier banking --count 100 --language fr-CA --output test_frca.parquet
evadex generate --format sqlite  --tier banking --count 100 --language fr-CA --output test_frca.db

# Archive and message formats (zip / mbox / ics / warc are stdlib;
# 7z requires: pip install evadex[archives])
evadex generate --format zip        --tier banking --count 100 --output test_output/test.zip
evadex generate --format zip_nested --tier banking --count 50  --output test_output/test_nested.zip
evadex generate --format 7z         --tier banking --count 100 --output test_output/test.7z
evadex generate --format mbox       --tier banking --count 50  --output test_output/test.mbox
evadex generate --format ics        --tier banking --count 30  --output test_output/test.ics
evadex generate --format warc       --tier banking --count 20  --output test_output/test.warc
# Archive evasion variants (password, double-extension, deep nesting, mixed formats)
evadex generate --format zip --category credit_card --count 20 --evasion-rate 1.0 \
  --technique-group archive_evasion --output archive_evasion.zip

# Barcode / QR code images (requires: pip install evadex[barcodes])
evadex generate --format png --category credit_card --count 10 --barcode-type qr       --output qr.png
evadex generate --format png --category credit_card --count 10 --barcode-type code128  --output code128.png
evadex generate --format png --category credit_card --count 12 --barcode-type ean13    --output ean13.png
evadex generate --format jpg --category credit_card --count 10 --evasion-rate 0.3      --output cards.jpg
evadex generate --format multi_barcode_png --category credit_card --category ssn --count 10 --output form.png
# Barcode image evasions (split across two codes, noise overlay, rotation, embed-in-document)
evadex generate --format png --category credit_card --count 4 --evasion-rate 1.0 \
  --technique-group barcode_evasion --output evasion.png

# Per-category count overrides
evadex generate --format xlsx --tier banking --count 100 \
  --count-per-category credit_card:500 --count-per-category sin:50 --output overrides.xlsx

# Total record distribution
evadex generate --format json --tier banking --total 1000 --output distributed.json

# Evasion technique control
evadex generate --format xlsx --tier banking --evasion-rate 0.5 \
  --technique-group unicode_encoding --output unicode_only.xlsx
evadex generate --format xlsx --tier banking --evasion-rate 0.5 \
  --technique-mix unicode_encoding:0.4,encoding:0.3,splitting:0.3 --output mixed.xlsx

# Per-category evasion rates
evadex generate --format csv --tier banking --evasion-rate 0.3 \
  --evasion-per-category credit_card:0.9 --evasion-per-category sin:0.1 --output targeted.csv

# Document templates
evadex generate --format docx --tier banking --template statement --count 100 --output statement.docx
evadex generate --format txt --tier banking --template invoice --output invoice.txt
evadex generate --format txt --category credit_card --template source_code --count 50 --output leaked_code.txt
evadex generate --format txt --category credit_card --template chat_log --count 20 --output chat_export.txt
evadex generate --format txt --tier banking --template audit_report --noise-level high --output audit.txt

# Density and noise control
evadex generate --format docx --tier banking --density high --count 100 --output dense.docx
evadex generate --format pdf --tier banking --noise-level high --count 100 --output noisy.pdf
```

**Value generation:**

evadex generates values two ways:

- **Synthetic generators** (preferred, unlimited) — Produce structurally valid values algorithmically, so `--count 1000` always returns 1000 distinct values. Registered for:
  - `credit_card` — Luhn-valid numbers for Visa, Mastercard, Amex, Discover
  - `sin` — Valid Canadian SINs (Luhn checksum, NNN NNN NNN format)
  - `iban` — Valid IBANs for GB, DE, and FR (ISO 13616 mod-97 checksum)
  - `phone` — Canadian E.164 numbers (`+1-NPA-NXX-XXXX`) from real area codes
  - `email` — Realistic addresses with common Canadian and international domains
  - `ca_ramq` — Quebec RAMQ health card numbers (XXXX YYMM DDSS format)
  - `ca_mb_health`, `ca_sk_health` — 9-digit Manitoba/Saskatchewan health cards
  - `ca_ns_health` — Nova Scotia 10-digit health card (NNNN NNN NNN format)
  - `ca_nb_health`, `ca_nl_health` — 10-digit NB/NL health cards
  - `ca_pei_health` — 12-digit PEI health card
  - `ca_mb_drivers` — Manitoba licence (LL-NNN-NNN-NNN format)
  - `ca_sk_drivers` — Saskatchewan 8-digit licence
  - `ca_ns_drivers` — Nova Scotia licence (2 letters + 7 digits)
  - `ca_nb_drivers` — New Brunswick 7-digit licence
  - `ca_pei_drivers` — PEI 6-digit licence
  - `ca_nl_drivers` — Newfoundland licence (1 letter + 9 digits)
  - `ca_business_number` — Canadian Business Number (9 digits, CRA)
  - `ca_gst_hst` — GST/HST registration (9-digit BN + RT + 4 digits)
  - `ca_transit_number` — Transit/routing number (NNNNN-NNN format)
  - `ca_bank_account` — Bank account (7–12 random digits)
  - `ssn` — Valid US Social Security Numbers (`AAA-BB-CCCC`, no reserved area / group / serial blocks). *(v3.13.0)*
  - `uk_nin` — Valid UK National Insurance Numbers (`XX NNNNNN X`, HMRC-compliant prefix and suffix rules). *(v3.13.0)*
  - `br_cpf` — Valid Brazilian CPFs (`NNN.NNN.NNN-DD`, two-pass Receita Federal checksum, all-same-digit base rejected). *(v3.13.0)*
  - `au_medicare` — Valid Australian Medicare cards (`NNNN NNNNN N`, weighted check digit per Services Australia). *(v3.13.0)*
  - `de_tax_id` — Valid German Steuer-IdNr (11 digits, ISO 7064 MOD 11,10 check digit, exactly-twice duplicate-digit rule). *(v3.13.0)*
  - `us_dl` — US driver-licence numbers cycling through all 50 state + DC formats (shape only — most state DLs have no public checksum). *(v3.13.0)*
- **Seed rotation fallback** — Categories without a synthetic generator rotate through the built-in seed values.
- **Evasion variants** — Drawn from all 12 evadex generators (same techniques as `evadex scan`). Use `--technique` to restrict to specific techniques.

---

### `evadex compare`

Diff two evadex scan result JSON files and report what changed between them.

```
evadex compare [OPTIONS] FILE_A FILE_B
```

| Flag | Default | Description |
|---|---|---|
| `--format`, `-f` | `json` | Output format: `json` or `html` |
| `--output`, `-o` | stdout | Write report to file instead of stdout |
| `--label-a` | *(from JSON meta.scanner)* | Override the label for the first file |
| `--label-b` | *(from JSON meta.scanner)* | Override the label for the second file |

The compare report includes:
- Overall delta in detection rate (percentage points)
- Per-category detection rate changes
- Per-technique detection rate changes (only techniques where the rate changed)
- Per-variant diff list (variants where severity changed between the two runs)

### `evadex init`

Generate a default `evadex.yaml` config file in the current directory.

```
evadex init
```

Creates `evadex.yaml` with sensible defaults. Edit the file and run `evadex scan --config evadex.yaml`, or drop it in the working directory for auto-discovery.

### `evadex falsepos`

Measure scanner false positive rate — values that look like sensitive data but are provably invalid.

Generates structurally plausible but mathematically invalid values (Luhn-failing credit card numbers, SSNs with reserved area codes, SINs with wrong checksums, IBAN-shaped strings with invalid mod-97 checks, etc.) and submits them to the scanner. Any value the scanner flags is a false positive.

```
evadex falsepos [OPTIONS]
```

| Flag | Default | Description |
|---|---|---|
| `--tool`, `-t` | `dlpscan-cli` | Adapter to use |
| `--category` | *(all)* | Category to test. Repeat for multiple. Supported: `credit_card`, `ssn`, `sin`, `iban`, `email`, `phone`, `ca_ramq` |
| `--count` | `100` | Number of false positive values per category |
| `--format`, `-f` | `table` | Output format: `table` (summary to stderr) or `json` (full report) |
| `--output`, `-o` | stdout | Write JSON report to file |
| `--exe` | `dlpscan` | Path to scanner executable (dlpscan-cli only) |
| `--cmd-style` | `python` | Command format for dlpscan-cli: `python` or `rust` |
| `--timeout` | `30.0` | Request timeout in seconds |
| `--concurrency` | `5` | Max concurrent scanner requests |
| `--seed` | *(random)* | Integer seed for reproducible false positive values |
| `--require-context` | off | Pass `--require-context` to dlpscan-rs: only flag matches when surrounding keywords are present. Requires `--cmd-style rust`. See [False positive rate and the `--require-context` tradeoff](#false-positive-rate-and-the---require-context-tradeoff) for measured impact. |
| `--wrap-context` | off | Embed each invalid value in a realistic category-specific sentence before submitting. Simulates how sensitive data appears in real documents. Use with `--require-context` for the most realistic false positive measurement. |

**Examples:**

```bash
# Test false positive rate for credit cards
evadex falsepos --tool dlpscan-cli --category credit_card --count 100

# All categories
evadex falsepos --tool dlpscan-cli --count 100

# Save JSON report
evadex falsepos --tool dlpscan-cli --count 100 --format json -o falsepos_report.json
```

**Output:**

```
  credit_card            0/100 flagged  (0.0%)
  ssn                    2/100 flagged  (2.0%)
  sin                    0/100 flagged  (0.0%)
  ...

Overall false positive rate: 0.3%  (2/700)
```

The JSON report includes per-category rates, overall rate, and the list of specific values that were incorrectly flagged:

```json
{
  "tool": "dlpscan-cli",
  "count_per_category": 100,
  "total_tested": 700,
  "total_flagged": 2,
  "overall_false_positive_rate": 0.3,
  "by_category": {
    "credit_card": {
      "total": 100,
      "flagged": 0,
      "false_positive_rate": 0.0,
      "flagged_values": []
    },
    "ssn": {
      "total": 100,
      "flagged": 2,
      "false_positive_rate": 2.0,
      "flagged_values": ["000-12-3456", "666-99-0001"]
    }
  }
}
```

**False positive generators by category:**

| Category | Generation strategy |
|---|---|
| `credit_card` | 16-digit numbers with card-like prefixes (4, 51, 37, 6011) that fail the Luhn check |
| `ssn` | `NNN-NN-NNNN` with reserved area codes: 000, 666, 900–999 |
| `sin` | `NNN NNN NNN` with valid first digit (1–7) but wrong Luhn check digit |
| `iban` | IBAN-shaped strings (GB/DE/FR) with a deliberately wrong mod-97 check digit |
| `email` | `user@domain.invalid` — uses IANA-reserved TLDs (`.invalid`, `.test`, `.example`, `.localhost`) |
| `phone` | `+1-NPA-NXX-XXXX` with invalid NANP area codes (000, 555, 911, etc.) |
| `ca_ramq` | RAMQ-shaped `XXXX YYMM DDSS` with invalid birth month codes (00, 13–50, 63–99) |

---

### `evadex list-payloads`

List all built-in test payloads with their categories and types.

```
evadex list-payloads [--type structured|heuristic]
```

| Flag | Default | Description |
|---|---|---|
| `--type` | *(all)* | Filter to `structured` or `heuristic` payloads only |

### `evadex list-techniques`

List all registered evasion generators and the techniques each one applies. Generator names shown here can be used with `evadex generate --technique-group` and `--technique-mix`.

```
evadex list-techniques [--generator NAME]
```

| Flag | Default | Description |
|---|---|---|
| `--generator`, `-g` | *(all)* | Show techniques for a specific generator only |

### `evadex techniques`

Show per-technique scanner-detection rates from the audit log. Powers the `--evasion-mode weighted` and `--evasion-mode adversarial` selections — and useful on its own for spotting which techniques the scanner has been letting through. *(v3.13.0)*

```
evadex techniques [OPTIONS]
```

| Flag | Default | Description |
|---|---|---|
| `--audit-log` | `results/audit.jsonl` | Source audit log to aggregate. |
| `--last` | `10` | Aggregate only the most recent N audit entries. |
| `--top` | *(all)* | Show only the top N techniques by latest scanner-detection rate. |
| `--category` | *(all)* | Substring filter against technique name (e.g. `unicode`, `encoding`). |
| `--min-runs` | `1` | Require at least N data points before showing a technique. |

Sample output:

```
Technique scanner-detection rates  (last 10 runs, 3 techniques)
┌──────────────────────┬────────┬────────┬──────┬──────────┐
│ Technique            │ Latest │   Avg  │ Runs │ Trend    │
├──────────────────────┼────────┼────────┼──────┼──────────┤
│ unicode_zwsp         │  9.1%  │ 12.3%  │  4   │ ↓ -3.2%  │
│ homoglyph_substitute │ 18.4%  │ 17.6%  │  4   │ → +0.4%  │
│ base64_of_rot13      │ 23.5%  │ 26.7%  │  4   │ ↓ -2.1%  │
└──────────────────────┴────────┴────────┴──────┴──────────┘
```

"Latest" / "Avg" are scanner-detection rates — **lower is better evasion**. Cold-start (no history yet) prints a hint and exits cleanly.

### `--evasion-mode` (in `evadex scan` and `evadex generate`)

Control how techniques are chosen for evasion variants based on what's worked historically. *(v3.13.0)*

| Mode | Behaviour |
|---|---|
| `random` *(generate default)* | Uniform random across applicable techniques. |
| `exhaustive` *(scan default)* | Every applicable variant is run / generated. |
| `weighted` | Bias selection by `1 − historical_detection`. Techniques that have evaded best are picked more often. Falls back to random if no audit history exists. |
| `adversarial` | Restrict to techniques whose historical detection is ≤ 50 %. In `evadex scan`, the variant-group filter narrows accordingly. Falls back to the full pool if the filter leaves no candidates. |

Both `weighted` and `adversarial` read history from `--audit-log` (defaults to `results/audit.jsonl`). Run a few normal scans with `--audit-log` set first to build the history. Until then, `evadex techniques` shows a cold-start hint and `--evasion-mode weighted/adversarial` falls back to random with a warning.

```bash
# Build history with a few baseline runs
evadex scan --tool dlpscan-cli --strategy text --tier banking \
  --audit-log results/audit.jsonl

# Now bias toward techniques that have evaded
evadex scan --tool dlpscan-cli --strategy text --tier banking \
  --evasion-mode adversarial --audit-log results/audit.jsonl

# In generate, focus regression fixtures on the hardest evasions
evadex generate --format xlsx --tier banking \
  --evasion-mode weighted --count 100 --output test_weighted.xlsx
```

### Examples

```bash
# Only test credit card payloads
evadex scan --tool dlpscan-cli --strategy text --category credit_card

# Only run unicode evasion techniques
evadex scan --tool dlpscan-cli --strategy text --variant-group unicode_encoding

# Only run unicode + delimiter techniques on SSN and IBAN
evadex scan --tool dlpscan-cli --strategy text \
  --category ssn --category iban \
  --variant-group unicode_encoding --variant-group delimiter

# Test a custom value (category auto-detected)
evadex scan --tool dlpscan-cli --input "AKIAIOSFODNN7EXAMPLE" --strategy text

# File strategy only — test DOCX extraction pipeline
evadex scan --tool dlpscan-cli --input "4532015112830366" --strategy docx

# Save HTML report
evadex scan --tool dlpscan-cli --strategy text --format html -o report.html

# Target a specific scanner binary, tag the output
evadex scan --tool dlpscan-cli --exe /opt/dlpscan/dlpscan --cmd-style rust \
  --scanner-label "rust-2.0.0" --format json -o rust_results.json

# Compare two scanner builds
evadex scan --tool dlpscan-cli --scanner-label "python-1.3.0" -o python.json
evadex scan --tool dlpscan-cli --exe /opt/rust-dlpscan --cmd-style rust \
  --scanner-label "rust-2.0.0" -o rust.json
evadex compare python.json rust.json --format html -o comparison.html
```

---

## Performance and recommended limits

Benchmarks captured on a Windows / Python 3.13 / 32 GB host running the banking tier with `--evasion-rate 0.5`. Times include the full evadex pipeline — payload selection, evasion variant generation, and writer I/O.

| Format | count=100 | count=1 000 | count=10 000 | Peak RSS (1 k) | Notes |
|---|---|---|---|---|---|
| `csv` | ~1.5 s | ~3 s | ~20 s, 92 MB output | 103 MB | Linear scaling — recommended for large fixtures. |
| `xlsx` | ~3 s | ~13 s | **not recommended** | 259 MB | openpyxl materialises every cell in memory. Linear extrapolation puts 10 k at ~2.5 GB peak. Use `csv` or `sqlite` for larger volumes. |
| `sqlite` | ~1.6 s | ~4 s | ~24 s, 114 MB output | 143 MB / **309 MB at 10 k** | Prior to v3.13.0, the customer table was built in Python before insert and 10 k pushed RSS over 500 MB. Now uses 1000-row chunked `executemany`. |
| `parquet` | n/a | n/a | n/a | n/a | Generation works, but Siphon's extractor hangs on every Parquet file ≥ 1 KB — see `results/format_detection_matrix.md`. Skipped from perf testing. |

**Concurrency tuning** (`evadex scan --concurrency N`) on Windows against the dlpscan binary: `--concurrency 10` → 17.8 variants/s, `--concurrency 20` → 18.9, `--concurrency 50` → 20.6. Process-spawn overhead dominates at high concurrency on Windows. Sweet spot is 20–50; going higher rarely pays for itself.

**Recommended `--count` ceilings per format** to stay under 500 MB peak RSS without further optimisation:

| Format | Safe ceiling |
|---|---|
| `csv`, `txt`, `json`, `xml`, `sql`, `log`, `mbox`, `ics`, `warc` | 50 000 + |
| `sqlite`, `7z` | 25 000 |
| `xlsx`, `docx`, `pdf` | 2 000 (memory-heavy formats; chunk via `--formats` + multiple runs for larger volumes) |
| `parquet` | unlimited generation, but skip if you intend to scan with Siphon ≤ 22f7971 |

---

## CI/CD integration

evadex supports a `--min-detection-rate` flag that exits with code 1 if the scanner's detection rate falls below a threshold. Use it as a pipeline gate to prevent deploying a scanner configuration that regresses detection coverage.

```bash
evadex scan --tool dlpscan-cli \
  --strategy text \
  --scanner-label "$(dlpscan --version)" \
  --format json -o results.json \
  --min-detection-rate 90
```

Exit code 0 means the threshold was met; exit code 1 means it was not. The report is always written before the exit check.

To track regressions against a known-good baseline:

```bash
# Save a baseline from the current production scanner
evadex scan --tool dlpscan-cli --scanner-label "prod-baseline" \
  --baseline baseline.json

# In CI: compare the candidate scanner against the baseline
evadex scan --tool dlpscan-cli --scanner-label "candidate" \
  --compare-baseline baseline.json \
  --min-detection-rate 90
```

The `--compare-baseline` flag prints a regression summary to stderr listing any variants that were previously detected and are now missed, and any improvements.

### GitHub Actions workflows

evadex ships ready-to-drop-in GitHub Actions workflows for the Siphon repo at [`docs/github-actions/`](docs/github-actions/). Both build Siphon with `--features full`, start its API server, and run the evadex banking-tier suite against the binary.

| File | Trigger | What it does |
|---|---|---|
| [`evadex-regression.yml`](docs/github-actions/evadex-regression.yml) | every push to `main` and every PR | banking-tier scan + false-positive suite, baseline diff if `evadex_baseline.json` is committed, posts a per-category breakdown back to the PR |
| [`evadex-daily.yml`](docs/github-actions/evadex-daily.yml) | cron `0 6 * * *` (06:00 UTC) | full banking-tier scan + false-positive suite, posts a one-line summary to Slack if the `SLACK_WEBHOOK` secret is set, fails when detection drops below 85 % |

**To install:**

```bash
# In Siphon's repo
mkdir -p .github/workflows
curl -O https://raw.githubusercontent.com/tbustenk/evadex/main/docs/github-actions/evadex-regression.yml
curl -O https://raw.githubusercontent.com/tbustenk/evadex/main/docs/github-actions/evadex-daily.yml
mv evadex-regression.yml evadex-daily.yml .github/workflows/
git add .github/workflows/
git commit -m "ci: add evadex DLP evasion regression workflows"
```

**To set the regression baseline** (commit it to the Siphon repo so the workflow has something to diff against):

```bash
evadex scan --tool siphon --url http://localhost:8080 --api-key $KEY \
  --tier banking --strategy text \
  --baseline evadex_baseline.json
git add evadex_baseline.json
git commit -m "ci: refresh evadex DLP detection baseline"
```

**To tune the gating threshold** (default 85 %), edit the `--min-detection-rate 85` line in either workflow file. A failing scan exits non-zero and fails the workflow, so the threshold doubles as a deploy gate.

**Slack notifications** (daily workflow only): create an incoming webhook in Slack, then add the URL as a repo secret named `SLACK_WEBHOOK`. Without the secret the Slack step skips silently.

---

## Audit log

evadex can append a one-line JSON record to a log file after every scan. This gives you a durable, append-only history of what was tested, when, and what the result was — useful for compliance reviews, trend tracking, and demonstrating that regular scans are being performed.

```bash
evadex scan --tool dlpscan-cli \
  --scanner-label "rust-2.0.0" \
  --strategy text \
  --audit-log /var/log/evadex/audit.jsonl
```

Or set it in `evadex.yaml` so it fires automatically on every run:

```yaml
audit_log: /var/log/evadex/audit.jsonl
```

### Audit record format

Each run appends exactly one line. Fields:

| Field | Type | Description |
|---|---|---|
| `timestamp` | ISO 8601 string | When the scan ran (UTC) |
| `evadex_version` | string | Installed evadex version |
| `operator` | string | OS username of the person who ran the scan |
| `scanner_label` | string | Value of `--scanner-label` (empty if not set) |
| `tool` | string | Adapter used |
| `strategies` | array | Submission strategies used |
| `categories` | array | Categories filtered to (empty = all structured) |
| `include_heuristic` | bool | Whether heuristic categories were included |
| `total` | int | Total test cases run |
| `pass` | int | Variants detected |
| `fail` | int | Variants that evaded scanner |
| `error` | int | Adapter errors |
| `pass_rate` | float | Detection rate percentage |
| `output_file` | string \| null | Path of the report file written, or null |
| `baseline_saved` | string \| null | Path of baseline saved, or null |
| `compare_baseline` | string \| null | Path of baseline compared against, or null |
| `min_detection_rate` | float \| null | Gate threshold used, or null |
| `exit_code` | int | `0` if scan succeeded, `1` if detection-rate gate failed |

### Notes

- The log file is opened in append mode — existing entries are never modified or deleted.
- Parent directories are created automatically if they do not exist.
- A write failure (permissions, disk full, bad path) is silently ignored. The scan result and exit code are never affected by audit log errors.
- The log contains detection rates and category breakdowns but **not** variant values. It is safe to store in shared log aggregation systems.

---

## Feedback loop

evadex Phase 2 implements a GAN-inspired feedback cycle: evadex is the **adversarial fuzzer** and your DLP scanner is the **discriminator**. When the fuzzer finds an evasion that works, the system automatically surfaces what failed and how to close the gap — without requiring manual triage.

After any scan that produces evasions, evadex does three things automatically:

1. **Prints fix suggestions to stderr** — one concrete, actionable normalisation step per unique bypass technique.
2. **Writes `evadex_regressions.py`** to the current directory — a pytest file with one test function per evasion, using dlpscan's `InputGuard` API. These tests fail until the scanner is fixed.
3. **Optionally writes a structured JSON feedback report** via `--feedback-report PATH`.

### Fix suggestions

Suggestions are printed to stderr after the scan summary whenever evasions are found:

```
=== Fix Suggestions ===
  • homoglyph_substitution (unicode_encoding)
    Add Cyrillic/Greek lookalikes to homoglyph normalisation map: О→0, З→3, ο→0, Α→A, Ζ→Z.
    Apply NFKC normalisation then a homoglyph table lookup before scanning
  • zero_width_zwsp (unicode_encoding)
    Strip U+200B (Zero Width Space) from input in the normalisation pipeline before pattern matching
  • base64_standard (encoding)
    Add a base64 decode pass to the normalisation pipeline; scan the decoded content
```

Each suggestion names the technique, the generator group it belongs to, and a specific normalisation step to add to the scanner's input pipeline.

### Regression test file

`evadex_regressions.py` is written to the current directory whenever there are evasions. Each test function:

- Is named after the payload label and evasion technique (`test_visa_16_digit_homoglyph_substitution`)
- Imports and invokes dlpscan's `InputGuard` with the appropriate preset (`PCI_DSS`, `PII`, or `CREDENTIALS`)
- Scans the exact obfuscated variant value that evaded detection
- Asserts `not result.is_clean` — the test passes once the scanner is fixed

```python
def test_visa_16_digit_homoglyph_substitution():
    """Visa 16-digit evaded via homoglyph_substitution — should be detected"""
    from dlpscan import InputGuard, Preset
    guard = InputGuard(presets=[Preset.PCI_DSS])
    result = guard.scan('4532\u041e15112830366')  # Visually similar Cyrillic/Greek characters substituted
    assert not result.is_clean


def test_canada_sin_zero_width_zwsp():
    """Canada SIN evaded via zero_width_zwsp — should be detected"""
    from dlpscan import InputGuard, Preset
    guard = InputGuard(presets=[Preset.PII])
    result = guard.scan('0\u200b4\u200b6\u200b \u200b4\u200b5\u200b4\u200b \u200b2\u200b8\u200b6')  # Zero-width ZWSP between every character
    assert not result.is_clean
```

Run the generated file with:

```bash
pytest evadex_regressions.py
```

Tests fail until the scanner is patched. Each time you fix a technique and re-run evadex, failing tests disappear and the regression file is regenerated to reflect the remaining gaps.

### `--feedback-report PATH`

Saves a structured JSON report containing everything in one file:

```bash
evadex scan --feedback-report feedback.json
```

**Report structure:**

```json
{
  "meta": {
    "timestamp": "2026-04-07T14:22:01.123456+00:00",
    "scanner": "python-1.6.0",
    "total_tests": 590,
    "total_evasions": 76
  },
  "techniques": [
    {
      "technique": "homoglyph_substitution",
      "generator": "unicode_encoding",
      "count": 23,
      "example_variants": ["4532\u041e15112830366", "4\u03bf32015112830366"]
    },
    {
      "technique": "zero_width_zwsp",
      "generator": "unicode_encoding",
      "count": 18,
      "example_variants": ["0\u200b4\u200b6 4\u200b5\u200b4 2\u200b8\u200b6"]
    }
  ],
  "fix_suggestions": [
    {
      "technique": "homoglyph_substitution",
      "generator": "unicode_encoding",
      "description": "Sensitive values bypassed detection by substituting ASCII digits/letters with visually identical Unicode characters from Cyrillic, Greek, or other scripts",
      "suggested_fix": "Add Cyrillic/Greek lookalikes to homoglyph normalisation map: О→0, З→3, ο→0, Α→A, Ζ→Z. Apply NFKC normalisation then a homoglyph table lookup before scanning"
    }
  ],
  "regression_test_code": "\"\"\"Regression tests generated by evadex.\n...\"\"\"\nimport pytest\n\n\ndef test_visa_16_digit_homoglyph_substitution():\n    ..."
}
```

The report is always written, even when there are no evasions (techniques and fix_suggestions will be empty arrays, regression_test_code will be an empty string).

### Three-phase design

| Phase | Role | Status |
|---|---|---|
| Phase 1 | Adversarial fuzzer — evasion generators test known-sensitive values against the scanner | ✅ Done |
| Phase 2 | Feedback generator — surfaces fix suggestions, regression tests, and structured reports when evasions succeed | ✅ Done |
| Phase 3 | False-positive adversary — generates values that *look* sensitive but aren't, to measure scanner precision | ✅ Done (`evadex falsepos`) |

Together, Phase 1 measures **false negatives** (sensitive values the scanner misses) and Phase 3 measures **false positives** (non-sensitive values the scanner incorrectly flags). Both are needed for a complete picture of scanner accuracy.

---

## Adapters

### Built-in: `dlpscan-cli`

Invokes the [dlpscan](https://github.com/oxide11/dlpscan) CLI directly as a subprocess. evadex was built and tested with dlpscan as the reference scanner. Requires `dlpscan` to be installed and on `PATH` (or provide `--exe`).

```bash
evadex scan --tool dlpscan-cli
```

For file strategies, evadex builds the document in memory and writes it to a temp file, runs the scanner against it, then immediately deletes the temp file. No persistent disk footprint from test data. File extraction support in dlpscan requires `pip install dlpscan[office]`.

### Built-in: `dlpscan`

Generic HTTP adapter for any DLP tool that exposes a REST API. Sends plain text to `POST /scan` with a `{"content": "..."}` body, and file uploads to `POST /scan/file` as multipart form data. Expects a JSON response with a `detected` boolean (configurable via the `response_detected_key` extra config option).

```bash
evadex scan --tool dlpscan --url http://my-dlpscan-server:8080 --api-key my-key
```

### Built-in: `siphon`

Native adapter for [dlpscan-rs / Siphon](https://github.com/oxide11/dlpscan) via its HTTP API. Use this in production environments where the CLI isn't available — for example, when Siphon runs as a sidecar or dedicated DLP service. Talks to `POST /v1/scan` and parses Siphon's full response (findings, confidence scores, and — when present — BIN brand/country, entropy classification, and validator name).

**Start the Siphon API server:**

```bash
# Bind to localhost:8000 with an API key. Use `0.0.0.0` to expose to other hosts.
DLPSCAN_API_HOST=127.0.0.1 \
DLPSCAN_API_PORT=8000 \
DLPSCAN_API_KEY=$SIPHON_API_KEY \
dlpscan serve
```

**Run evadex against it:**

```bash
# --api-key is sent via the `x-api-key` header; EVADEX_API_KEY works too.
evadex scan --tool siphon \
  --url http://localhost:8000 \
  --api-key $SIPHON_API_KEY
```

**Adapter extras** (via `evadex.yaml`):

```yaml
tool: siphon
url: http://localhost:8000
api_key: ${EVADEX_API_KEY}
presets: [pci_dss, pii]      # compliance presets to enable
categories: []               # optional category allowlist
min_confidence: 0.5          # confidence floor forwarded to Siphon
require_context: false       # require surrounding keywords to flag a match
```

When the Siphon adapter reports a match, the result also carries Siphon-specific detail in the JSON output:

| Field | Description |
|---|---|
| `confidence` | Recognizer confidence from 0.0 – 1.0 |
| `bin_brand` | Card-network brand (Visa, Mastercard, …) for credit card findings |
| `bin_country` | Issuing country from the BIN lookup |
| `entropy_classification` | High-entropy heuristic label (e.g. `api_key`) |
| `validator` | Which validator accepted the match (`luhn`, `mod97`, …) |

`evadex compare` surfaces confidence score changes between two Siphon runs alongside severity transitions, so per-variant regressions are visible even when the pass/fail outcome is unchanged.

#### Entropy-mode testing

Siphon's scanner has four high-entropy-token detection modes, each gating the 4.5 bits/char threshold differently:

| Mode | Gate | When to use |
|---|---|---|
| `gated` | Keyword (`secret`, `key`, `token`, `api_key`, `password`, `bearer`, …) within 80 chars | Default — highest precision, lowest recall |
| `assignment` | Token preceded by an assignment (`KEY=`, `"key":`, `export KEY=`) within 60 chars | Catches `.env` and config-file leaks |
| `all` | Any high-entropy token ≥16 chars passes | Highest recall, noisy — source-code audits |
| `off` | Disabled | Default for Siphon — entropy adds latency |

Siphon's token floor is 16 characters and the Shannon threshold is 4.5 bits/char, so pure-hex secrets (max entropy ~4.0 bits/char) pass through **any** mode untouched — a real gap to be aware of.

`evadex entropy` targets all detection modes at once:

```bash
# Sanity-check every mode against a Siphon instance
evadex entropy --tool siphon --url http://localhost:8000 --api-key $EVADEX_API_KEY

# Score coverage against a specific configured mode
evadex entropy --tool siphon --mode gated        # only gated contexts expected to hit
evadex entropy --tool siphon --mode assignment
evadex entropy --tool siphon --mode all
```

The command submits each high-entropy payload in three contexts — **bare** (value alone), **gated** (value next to `api_key:`), and **assignment** (`SECRET_TOKEN=value`) — and reports which context each category was caught in. It also runs the `entropy_evasion` generator and lists which evasion techniques defeated detection (split, comment-injection, concatenation, low-entropy mixing, double encoding, space breaking).

### Adding a custom adapter

1. Create a file anywhere in your project, e.g. `my_adapter.py`.

2. Subclass `BaseAdapter` and implement `submit()`:

```python
from evadex.adapters.base import BaseAdapter
from evadex.core.registry import register_adapter
from evadex.core.result import Payload, Variant, ScanResult


@register_adapter("my-tool")
class MyToolAdapter(BaseAdapter):
    name = "my-tool"

    async def submit(self, payload: Payload, variant: Variant) -> ScanResult:
        # Send variant.value to your scanner however it expects it.
        # variant.strategy is "text", "docx", "pdf", or "xlsx".
        # Return a ScanResult with detected=True/False.
        response = await call_my_scanner(variant.value)
        detected = response.get("found", False)
        return ScanResult(
            payload=payload,
            variant=variant,
            detected=detected,
            raw_response=response,
        )
```

3. Import your adapter before invoking evadex (so the `@register_adapter` decorator fires), then use it:

```bash
python -c "import my_adapter" && evadex scan --tool my-tool
```

Or wire it up properly as a package with an entry point in `pyproject.toml`:

```toml
[project.entry-points."evadex.adapters"]
my-tool = "my_package.my_adapter"
```

**Optional hooks:**

```python
async def setup(self):
    # Called once before the batch — open connections, authenticate, etc.
    self._session = await open_session()

async def teardown(self):
    # Called once after the batch — clean up connections.
    await self._session.close()

async def health_check(self) -> bool:
    # Optional — verify the scanner is reachable.
    return await ping_scanner()
```

**File strategies:** `variant.strategy` tells you which format evadex wants to use. If your scanner only supports one method, handle what you need:

```python
from evadex.adapters.dlpscan.file_builder import FileBuilder

async def submit(self, payload, variant):
    if variant.strategy == "text":
        raw = await self._scan_text(variant.value)
    else:
        data, mime = FileBuilder.build(variant.value, variant.strategy)
        raw = await self._scan_file(data, mime)
    ...
```

`FileBuilder.build(text, fmt)` returns `(bytes, mime_type)` entirely in memory — no disk writes.

---

## EDM testing

Siphon's **Exact Data Match (EDM)** engine catches *specific known values* — real SSNs, account numbers, tokens — rather than just structurally-plausible patterns. Values are HMAC-SHA256 hashed after normalisation and stored as a hash set; scan tokens are hashed the same way and constant-time compared. EDM complements pattern matching: patterns catch the shape of sensitive data, EDM catches the actual records.

**Normalisation applied before hashing** (from `crates/siphon-core/src/edm.rs :: normalize_value`, in order):

1. NFKC Unicode normalisation
2. Lowercase
3. Trim leading/trailing whitespace
4. Remove every character matching `[\s\-./()]+` (whitespace, hyphens, dots, slashes, parens)

That's what EDM absorbs. Anything else — Cyrillic/Greek homoglyphs, zero-width joiners, unicode substitutions — defeats EDM because NFKC does not fold distinct Unicode scripts together.

**The `evadex edm` command:**

```bash
# Register built-in payloads with Siphon EDM, verify each one, then probe
# which evasion transforms its normaliser absorbs.
evadex edm --url http://localhost:8000 --api-key $SIPHON_KEY

# Restrict to specific categories
evadex edm --category credit_card --category sin --limit 25

# Corpus generation only — no Siphon contact — write a bulk-import file
evadex edm --generate-corpus --output edm_corpus.json
evadex edm --generate-corpus --corpus-format csv --count 1000 --output edm_corpus.csv

# Dry run: print what would be registered without sending anything
evadex edm --dry-run --category credit_card
```

**What the command reports:**

| Section | Meaning |
|---|---|
| **EDM exact-value detection** table | Each registered value resubmitted verbatim — should be 100% if EDM is configured correctly |
| **EDM evasion probe** table | Detection rate per transform (exact, uppercase, dashes, spaces, dots, slashes, nbsp_spaces, homoglyph_0, homoglyph_o, zero_width). `yes` = absorbed by normaliser; `no` = defeats EDM; `partial` = depends on the value |

Registration uses the category namespace `evadex_test_<original_category>` so test hashes never collide with production EDM categories. Note: Siphon's HTTP API exposes `POST /v1/edm/register` and `GET /v1/edm/categories` but **no delete endpoint**, so true cleanup requires clearing the server's EDM state file or restarting the server. The namespace prefix keeps stray hashes clearly identifiable.

**Performance note.** Siphon's EDM does a constant-time scan over every registered hash per token to prevent timing leaks — above `MAX_CONSTANT_TIME_HASHES = 50,000` total hashes the scan cost grows linearly. `evadex edm` prints a warning when a registration run would cross that threshold.

**EDM bulk-registration corpus format** (`--format edm_json` on `evadex generate`):

```json
{
  "values": [
    {"value": "4532015112830366", "category": "credit_card", "label": "Visa test"},
    {"value": "046 454 286",      "category": "sin",         "label": "SIN test"}
  ]
}
```

The shape matches Siphon's `POST /v1/edm/register` request body (flat, one `values[]` array). Split by category and POST each slice, or replay the whole file through a small wrapper — the field names line up with Siphon's API.

---

## Siphon-C2 integration

Siphon-C2 is the admin web UI and management plane described in the dlpscan-rs architecture docs — it aggregates operational metrics plus test results so detection quality is visible alongside live scanning. evadex can push its scan, false-positive, comparison, and history reports to C2 in one line of extra flags.

**Setup.** Point evadex at your C2 deployment via either flags, environment variables, or `evadex.yaml`:

```bash
# CLI flags
evadex scan --tool siphon --tier banking \
  --c2-url http://c2.internal:9090 --c2-key $C2_API_KEY

# Environment variables (picked up by scan / falsepos / compare / history)
export EVADEX_C2_URL=http://c2.internal:9090
export EVADEX_C2_KEY=$C2_API_KEY
evadex scan --tool siphon --tier banking
```

Or in `evadex.yaml`:

```yaml
c2_url: http://c2.internal:9090
c2_key: ${C2_API_KEY}
```

**Endpoints pushed to:**

| Command | C2 endpoint | Payload |
|---|---|---|
| `evadex scan` | `POST /v1/evadex/scan` | counts, pass rate, per-category/per-technique breakdown, top 50 failing variants |
| `evadex falsepos` | `POST /v1/evadex/falsepos` | per-category FP rate + flagged-value list |
| `evadex compare` | `POST /v1/evadex/compare` | full comparison dict (overall delta, per-technique diffs, confidence changes) |
| `evadex history --push-c2` | `POST /v1/evadex/history` | batched audit-log entries for dashboard backfill |

Every push is authenticated via an `x-api-key` header — the same format the core Siphon HTTP API uses — so C2 can reuse one key-management surface. Requests also carry a `User-Agent: evadex/<version>` header and an `evadex_version` field so C2 can surface client-version mix on the dashboard.

**Backfill on first connect:**

```bash
# One-shot push of every historical audit-log entry to a fresh C2
evadex history --push-c2 --c2-url http://c2.internal:9090 --c2-key $C2_API_KEY
```

**Graceful degradation.** Siphon-C2 is explicitly documented as *not critical path* — evadex honours that contract:

- A failed push (network error, 4xx/5xx, timeout, auth failure) prints a single-line warning to stderr and continues.
- The scan / falsepos / compare exit code is never affected by a C2 push failure.
- The `--min-detection-rate` CI/CD gate still fires based on the actual scan result.
- The on-disk output file (`--output`), audit log, regression tests, and baseline comparison all complete normally regardless of C2 reachability.

The only exception is `evadex history --push-c2` without `--c2-url` / `EVADEX_C2_URL` set — that's a user error (no target URL to push to) and exits non-zero.

---

## Output schema

### Top-level

```json
{
  "meta": { ... },
  "results": [ ... ]
}
```

### `meta`

| Field | Type | Description |
|---|---|---|
| `timestamp` | ISO 8601 string | When the scan ran (UTC) |
| `scanner` | string | Scanner label from `--scanner-label` (empty string if not set) |
| `total` | int | Total test cases run |
| `pass` | int | Variants detected by scanner |
| `fail` | int | Variants that evaded scanner |
| `error` | int | Adapter errors |
| `pass_rate` | float | `pass / total * 100`, rounded to one decimal |
| `summary_by_category` | object | Per-category pass/fail/error counts, sorted alphabetically by category name |
| `summary_by_generator` | object | Per-generator pass/fail/error counts, sorted alphabetically by generator name |

### `results[]`

| Field | Type | Description |
|---|---|---|
| `payload.value` | string | Original sensitive value |
| `payload.category` | string | Detected category enum value |
| `payload.category_type` | string | `structured` or `heuristic` — see [Structured vs heuristic categories](#structured-vs-heuristic-categories) |
| `payload.label` | string | Human-readable label |
| `variant.value` | string | Transformed/obfuscated value submitted to scanner |
| `variant.generator` | string | Which generator produced this variant |
| `variant.technique` | string | Machine-readable technique name |
| `variant.transform_name` | string | Human-readable description of the transform |
| `variant.strategy` | string | Submission strategy: `text`, `docx`, `pdf`, `xlsx` |
| `detected` | bool | Whether the scanner flagged this variant. `false` for error results — check `severity` to distinguish |
| `severity` | string | `pass` (detected), `fail` (not detected), or `error` (adapter error) |
| `duration_ms` | float | Time for this test case in milliseconds |
| `error` | string \| null | Error message if adapter threw; `null` otherwise |
| `raw_response` | object | Raw parsed response from the adapter. For `dlpscan-cli` this is `{"matches": [...]}`. May contain match objects that include the variant value — treat the output file accordingly. |

---

## Coverage

evadex payload coverage relative to the dlpscan-rs pattern library (**557 individual sub-patterns** across 126 categories).

Each row shows coverage at the **sub-pattern level** — e.g. "Credit Card Numbers — 7/7" means all seven card-network variants (Visa, Amex, Mastercard, Discover, JCB, UnionPay, Diners) have a dedicated seed payload.

### Identity documents

| Region / Category | dlpscan-rs sub-patterns | evadex coverage | Notes |
|---|---:|---:|---|
| Credit Card Numbers | 7 | **7/7** ✓ | Visa, Amex, Mastercard, Discover, JCB, UnionPay, Diners |
| US Driver's Licences | 51 + 1 generic | **52/52** ✓ | All 50 states + DC + generic |
| US — other identifiers | 12 | **12/12** ✓ | SSN, ITIN, EIN, MBI, Passport, Passport Card, NPI, DoD ID, KTN, DEA, USA Routing Number, US Phone Number — completed this release |
| North America — Canada | 29 | **29/29** ✓ | All provincial health/DL/corporate/BN/SIN; 3 DL payloads corrected this release |
| North America — Mexico | 7 | **7/7** ✓ | CURP, RFC, Clave Elector, INE CIC, INE OCR, NSS, Passport — all added this release |
| Europe — United Kingdom | 7 | **7/7** ✓ | NIN, DL, NHS, Passport, Phone, Sort Code, UTR — completed this release |
| Europe — Germany | 6 | **6/6** ✓ | Tax ID, ID, IBAN, Social Insurance, DL, Passport — completed this release |
| Europe — France | 5 | **5/5** ✓ | NIR, CNI, IBAN, DL, Passport — completed this release |
| Europe — Spain | 5 | **5/5** ✓ | DNI, IBAN, NIE, NSS, Passport — completed this release |
| Europe — Italy | 5 | **5/5** ✓ | Codice Fiscale/SSN, DL, Partita IVA, Passport — completed this release |
| Europe — Netherlands | 4 | **4/4** ✓ | BSN, DL, IBAN, Passport — completed this release |
| Europe — Poland | 6 | **6/6** ✓ | PESEL, NIP, REGON, DL, ID Card, Passport — completed this release |
| Europe — Sweden | 4 | **4/4** ✓ | PIN, Org Number, DL, Passport — completed this release |
| Europe — Norway | 4 | **4/4** ✓ | FNR, D-Number, DL, Passport — completed this release |
| Europe — Switzerland | 4 | **4/4** ✓ | AHV, UID, DL, Passport — completed this release |
| Europe — Finland | 3 | **3/3** ✓ | HETU, DL, Passport — completed this release |
| Europe — Austria | 5 | **5/5** ✓ | SVN, Tax, DL, ID Card, Passport — completed this release |
| Europe — Belgium | 4 | **4/4** ✓ | NRN, VAT, DL, Passport — completed this release |
| Europe — (19 other EU/EEA countries) | ~75 | **~75/75** ✓ | Bulgaria, Croatia, Cyprus, Czech, Denmark, EU-ETD, Estonia, Greece, Hungary, Iceland, Ireland, Latvia, Liechtenstein, Lithuania, Luxembourg, Malta, Portugal, Romania, Slovakia, Slovenia, Turkey — all sub-patterns added this release |
| Asia-Pacific — Australia | 11 | **11/11** ✓ | TFN, Medicare, Passport, 8 state DL variants — completed this release |
| Asia-Pacific — China / HK / Macau / TW | 5 | **5/5** ✓ | Resident ID, Passport, HK ID, Macau ID, TW NID — completed this release |
| Asia-Pacific — India | 6 | **6/6** ✓ | Aadhaar, PAN, DL, Passport, Ration Card, Voter ID — completed this release |
| Asia-Pacific — Japan | 6 | **6/6** ✓ | My Number, DL, Health Ins, Juminhyo, Passport, Residence Card — completed this release |
| Asia-Pacific — Singapore | 4 | **4/4** ✓ | NRIC, FIN, DL, Passport — completed this release |
| Asia-Pacific — South Korea | 3 | **3/3** ✓ | RRN, DL, Passport — completed this release |
| Asia-Pacific — New Zealand | 4 | **4/4** ✓ | IRD, NHI, DL, Passport — completed this release |
| Asia-Pacific — Philippines | 6 | **6/6** ✓ | PhilSys, PhilHealth, SSS, TIN, UMID, Passport — completed this release |
| Asia-Pacific — (7 other AP countries) | ~24 | **~24/24** ✓ | Bangladesh, Indonesia, Malaysia, Pakistan, Sri Lanka, Thailand, Vietnam — all sub-patterns added this release |
| Latin America — Brazil | 6 | **6/6** ✓ | CPF, CNPJ, CNH, RG, SUS, Passport — completed this release |
| Latin America — Argentina | 3 | **3/3** ✓ | DNI, CUIL/CUIT, Passport — completed this release |
| Latin America — Chile | 2 | **2/2** ✓ | RUT, Passport — completed this release |
| Latin America — Colombia | 4 | **4/4** ✓ | Cedula, NIT, NUIP, Passport — completed this release |
| Latin America — (8 other LatAm countries) | ~27 | **~27/27** ✓ | Costa Rica, Ecuador, Paraguay, Peru, Uruguay, Venezuela — all sub-patterns added this release |
| Middle East — UAE | 3 | **3/3** ✓ | Emirates ID, Passport, Visa — completed this release |
| Middle East — (10 other ME countries) | ~21 | **~21/21** ✓ | Bahrain, Iran, Iraq, Israel, Jordan, Kuwait, Lebanon, Qatar, Saudi Arabia — all sub-patterns added this release |
| Africa — South Africa | 3 | **3/3** ✓ | ID, DL, Passport — completed this release |
| Africa — (9 other African countries) | ~27 | **~27/27** ✓ | Egypt, Ethiopia, Ghana, Kenya, Morocco, Nigeria, Tanzania, Tunisia, Uganda — all sub-patterns added this release |

### Financial, secrets, and functional

| Category | dlpscan-rs sub-patterns | evadex coverage | Notes |
|---|---:|---:|---|
| Banking & Financial | 5 | **5/5** ✓ | IBAN, SWIFT, ABA, Canada Transit, US Bank Account |
| IBAN (country-specific) | 4 named | **4/4** ✓ | UK, DE, FR, ES, NL IBANs all represented |
| Banking Authentication | 3 | **3/3** ✓ | PIN Block, Encryption Key, HSM Key — completed this release |
| Cryptocurrency | 7 | **7/7** ✓ | Bitcoin (legacy + Bech32), Ethereum, Bitcoin Cash, Litecoin, Monero, Ripple — completed this release |
| Card Track Data | 2 | **2/2** ✓ | Track 1, Track 2 — completed this release |
| Check & MICR | 3 | **3/3** ✓ | MICR, Cashier Check, Check Number — completed this release |
| Cloud Secrets | 3 | **3/3** ✓ | AWS Access Key, AWS Secret Key, Google API Key — completed this release |
| Code Platform Secrets | 5 | **5/5** ✓ | GitHub Classic, OAuth, Fine-Grained PAT, NPM Token, PyPI Token — completed this release |
| Messaging Secrets | 6 | **6/6** ✓ | Slack Bot, Slack User, Slack Webhook, Mailgun, SendGrid, Twilio — completed this release |
| Generic Secrets | 4 | **4/4** ✓ | JWT, Bearer Token, DB Connection String, Private Key — completed this release |
| Payment Secrets | 2 | **2/2** ✓ | Stripe Secret Key, Stripe Publishable Key — completed this release |
| Contact Information | 5 | **5/5** ✓ | Email, Phone (E.164), IPv4, IPv6, MAC Address — completed this release |
| Device Identifiers | 5 | **5/5** ✓ | ICCID, IDFA/IDFV, IMEI, IMEISV, MEID — completed this release |
| Geolocation | 2 | **2/2** ✓ | GPS Coordinates, Geohash — completed this release |
| Securities Identifiers | 6 | **6/6** ✓ | ISIN, CUSIP, FIGI, LEI, SEDOL, Ticker Symbol — completed this release |
| Medical Identifiers | 4 | **4/4** ✓ | NDC Code, DEA Number, Health Plan ID, ICD-10 Code — completed this release |
| Loan & Mortgage | 4 | **4/4** ✓ | Loan Number, ULI, LTV Ratio, MERS MIN — completed this release |
| Legal Identifiers | 2 | **2/2** ✓ | US Federal Case Number, Court Docket Number — completed this release |
| Regulatory Identifiers | 6 | **6/6** ✓ | AML Case ID, CTR, Compliance Case, FinCEN, OFAC SDN, SAR — completed this release |
| Insurance Identifiers | 2 | **2/2** ✓ | Policy Number, Claim Number — completed this release |
| Internal Banking Refs | 2 | **2/2** ✓ | Internal Account Ref, Teller ID — completed this release |
| Property Identifiers | 2 | **2/2** ✓ | Parcel Number, Title Deed — completed this release |
| Social Media | 2 | **2/2** ✓ | Twitter Handle, Hashtag — completed this release |
| Employment | 2 | **2/2** ✓ | Employee ID, Work Permit — completed this release |
| Education | 1 | **1/1** ✓ | EDU Email |
| Dates | 3 | **3/3** ✓ | ISO, US, EU date formats |
| Postal Codes | 5 | **5/5** ✓ | UK, US ZIP+4, Canada, Brazil CEP, Japan — completed this release |
| Personal Identifiers | 2 | **2/2** ✓ | Date of Birth, Gender Marker — completed this release |
| Primary Account Numbers | 2 | **2/2** ✓ | PAN (via credit cards), Masked PAN |
| Customer Financial Data | 4 | **4/4** ✓ | Balance with Currency, Account Balance, DTI Ratio, Income Amount — completed this release |
| Authentication Tokens | 1 | **1/1** ✓ | Session ID |
| Biometric Identifiers | 2 | **2/2** ✓ | Template ID, Biometric Hash (via IDFA payload) |
| VIN | 1 | **1/1** ✓ | Vehicle Identification Number |
| Wire Transfer | 6 | **6/6** ✓ | Fedwire IMAD, CHIPS UID, Wire Reference Number, ACH Trace Number, ACH Batch Number, SEPA Reference — completed this release |

### Classification & governance labels

| Category | dlpscan-rs sub-patterns | evadex coverage | Notes |
|---|---:|---:|---|
| Corporate Classification | 9 | **9/9** ✓ | Confidential, DND, Embargoed, Eyes Only, Highly Conf, Internal Only, NTK, Proprietary, Restricted — completed this release |
| Data Classification Labels | 8 | **8/8** ✓ | Top Secret, CUI, Classified Conf, FOUO, LES, NOFORN, SBU, Secret — completed this release |
| Privacy Classification | 10 | **10/10** ✓ | HIPAA, PCI-DSS, CCPA, FERPA, GDPR, GLBA, NPI, PHI, PII, SOX — completed this release |
| Financial Regulatory Labels | 7 | **7/7** ✓ | MNPI, Draft-Not-for-Circ, Info Barrier, Inside Info, Invest Restricted, Market Sensitive, Pre-Decisional — completed this release |
| Privileged Information | 7 | **7/7** ✓ | Attorney-Client, Legal Privilege, Litigation Hold, Privileged Info, P&C, Protected by Priv, Work Product — completed this release |
| Supervisory Information | 6 | **6/6** ✓ | CSI, Exam Findings, Non-Public, Restricted, Supervisory Conf, Supervisory Ctrl — completed this release |
| URLs with Credentials | 2 | **2/2** ✓ | URL with Password, URL with Token — completed this release |
| PCI Sensitive Data | 1 | **1/1** ✓ | Cardholder Name |

**Summary:** evadex covers **489/557 sub-patterns** (88%) across all 126 dlpscan-rs categories with **554 seed payloads**. Of those 489: 421 structured categories confirmed detected by direct dlpscan-rs seed scan; 68 heuristic categories excluded from scanner verification per design (JWT, API keys, labels). The remaining 68 unrepresented sub-patterns are low-specificity numeric patterns (e.g. 6–9 digit sequences) where the same dlpscan regex fires on dozens of existing payloads — no distinct seed value is feasible without a context keyword. Seed-scan verified against dlpscan-rs — see `new_cat_verification.json` for per-category results.

---

## Security notes

- **API keys:** Prefer the `EVADEX_API_KEY` environment variable over the `--api-key` CLI flag. Command-line arguments are visible in process listings (`ps aux`) and may be saved in shell history.
- **Output files:** The JSON report's `raw_response` fields may contain scanner match objects that echo variant values (transformed versions of sensitive test data). Apply appropriate access controls to report files.
- **Temp files:** The `dlpscan-cli` adapter writes each test variant to a temp file for subprocess invocation and deletes it immediately after the scan. No persistent disk footprint from test data.
- **Network isolation:** Run evadex and the scanner on an isolated test network. Test variant values are obfuscated but structurally derived from real sensitive patterns.

---

## License

MIT — see [LICENSE](LICENSE).
