Metadata-Version: 2.4
Name: exposurecheck
Version: 0.1.0
Summary: Audit your own social-media export for re-identification (mosaic) risk. Local-first, no-dossier, bring-your-own-LLM.
Author: Cora Aegis
License: MIT
Project-URL: Homepage, https://cypherpunkguide.com
Project-URL: Repository, https://github.com/coraaegis/exposurecheck
Project-URL: Issues, https://github.com/coraaegis/exposurecheck/issues
Keywords: privacy,opsec,deanonymization,re-identification,osint-defense,mosaic,reddit,twitter,self-audit
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Security
Classifier: Programming Language :: Python :: 3 :: Only
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Dynamic: license-file

# ExposureCheck

**Audit your own social-media history for re-identification risk — before someone
else does it to you.**

Modern language models can read a few hundred of your ordinary public posts and
infer where you live, where you work, your routine, your family, and link a
pseudonymous account back to your real name — not from one careless post, but
from the *mosaic* of many individually-innocuous ones. Researchers have shown
this works at scale and with unsettling accuracy. `exposurecheck` runs that same
adversarial reading **on your own export, on your own terms**, and shows you
which of *your* posts to generalise or edit.

It is **local-first**, produces **no dossier**, and never writes a profile of you
to disk.

📖 Plain-language explainer of the threat:
**https://cypherpunkguide.com/privacy/social-media-self-audit/** (the companion
article — read it first if "mosaic re-identification" is new to you).

---

## Demo

![ExposureCheck running an offline audit on the bundled sample data](media/demo.gif)

*A ~12-second run on the bundled sample export using the offline `--backend
heuristic` stub — reproducible: clone the repo and run the exact command shown,
and nothing leaves your machine (`--offline` hard-blocks egress). A real
`--backend local` or `cloud` model surfaces far more; the full run also scores
EMPLOYER, FINANCES, FAMILY and SCHEDULE. (Recorded as an
[asciicast](media/demo.cast).)*

## What it does

- Parses your **Reddit** GDPR export and your **X / Twitter** export (directory or `.zip`).
- Runs a **recall-preserving cascade**: a cheap pass ranks every post, an
  expensive pass reads the high-priority ones, and weak signals are *kept* — the
  mosaic is built from weak signals, so throwing them away would be false comfort.
- Extracts the **metadata layer** deterministically (this is where X leaks most):
  the self-set location field, outbound links, image **EXIF/GPS**, device model,
  and posting-time concentration that betrays your timezone.
- Reports **category risk cards** (Location, Employer, Family, Schedule,
  Finances, Account-linkage, …) ranked by **risk contribution**, each with masked
  examples and concrete, *generalise-first* remediation.

## What it deliberately does **not** do

- ❌ No dossier. It never prints "you live in X / work at Y / your name is Z".
  Cards show **masked** snippets; the resolved value only ever appears when *you*
  click through to *your own* original post, **in-session, never saved**.
- ❌ No export of findings. ❌ No scraping (export input only). ❌ No posting/
  deletion on your behalf. ❌ No analysing anyone else's history.
- ❌ It does not make you anonymous. It **reduces** risk. "Low" is not "safe".

## Bring your own model — cloud **or** local

The inference runs on a backend **you** choose:

| backend | what it is | data leaves your machine? |
|---|---|---|
| `local` | a local Ollama (or llama.cpp/LM Studio) model | **no** |
| `cloud` | any OpenAI-compatible endpoint, your own key | **yes** |
| `heuristic` | offline regex stub, **near-zero recall** | no — *dev/CI only, not an audit* |

### ⚠️ The one cloud caveat that actually matters

If the account you are auditing is a **pseudonymous** one you keep separate from
your real identity, **and** your AI/cloud account is registered under your real
name or paid with a real-name method, then sending your history to the cloud lets
the provider link *real identity ↔ anonymous account* on their side (subpoena,
breach, insider). That is the exact deanonymization this tool exists to prevent.

So: **auditing a strictly-anonymous account → use `--backend local`** (or a cloud
account opened and paid for anonymously). Auditing your real-name / public
account → cloud is fine. The CLI states this and requires acknowledgement when it
applies. We never force local (that would shrink the audience to nobody); we make
the trade-off explicit.

---

## Install

Core (parsing, EXIF, cascade, **and** the cloud/local HTTP backends) is **Python
standard library only** — no third-party code touches your export.

```bash
# pipx — isolated, recommended (works today; a PyPI release is coming):
pipx install git+https://github.com/coraaegis/exposurecheck

# or from source:
git clone https://github.com/coraaegis/exposurecheck && cd exposurecheck && pip install -e .

# or run without installing:
python -m exposurecheck --help
```

ExposureCheck is a **command-line tool** today, aimed at people comfortable with a
terminal — which is also where its first reviewers live (GitHub, Hacker News,
privacy forums). A **one-click app for non-technical users** — a packaged build
with a local, in-browser UI and no Python to install — is the next milestone (see
[Status](#status)). The CLI stays for power users.

### Verifying a release

Releases are **not signed with an identity code-signing certificate** — the author
is pseudonymous, and such a certificate would tie the project to a legal identity,
the opposite of the point. Authenticity is cryptographic and verifiable instead:

- Each release is **PGP-signed** by Cora Aegis. Fetch the key via WKD
  (`gpg --locate-keys cora@cypherpunkguide.com`), then
  `gpg --verify exposurecheck-<version>.tar.gz.asc`.
- **SHA-256 checksums** are published with every release.
- Builds are **reproducible** — rebuild from the tagged source and confirm the
  artifact matches.
- Prefer a **package manager** (pip / Scoop / Homebrew) over a downloaded `.exe`.
  An unsigned Windows binary may show a SmartScreen "unknown publisher" prompt;
  that is expected — verify the PGP signature or run from source.

## Usage

Get your data first:
- Reddit → *Settings → Privacy → Request a copy of your data* (the `.zip`).
- X → *Settings → Your account → Download an archive of your data*.

```bash
# Local model — nothing leaves your machine (recommended for anonymous accounts)
exposurecheck audit \
  --reddit ./reddit_export.zip \
  --twitter ./twitter_export \
  --backend local --expensive-model llama3.1 \
  --i-own-this-data

# Cloud (bring your own key; set it in the ENV, never on the command line)
export OPENAI_API_KEY=sk-...
exposurecheck audit --twitter ./twitter_export --backend cloud --i-own-this-data

# See your own posts behind a category (in-session, nothing is saved)
exposurecheck audit --reddit ./reddit_export --backend local -i --i-own-this-data
```

The API key is read from an environment variable on purpose — command-line args
leak into shell history and process listings.

## Cost (cloud)

Roughly **$0.59 per profile** for ~125 posts on a GPT-4-class model; a real
1–3k-post history lands around **$4–15**, trimmed by the recall-preserving
pre-filter. Local models are free (lower accuracy — the tool warns you).

## How it works

```
export ─▶ parse ─▶ prefilter (drop only TRUE-empty) ─┬─▶ deterministic: profile + EXIF + timing ─┐
                                                      └─▶ cascade: cheap route ─▶ expensive read ─┤
                                                                                                  ▼
                                          risk-contribution scoring ─▶ category cards ─▶ no-dossier report
```

See [`docs/THREAT-MODEL.md`](docs/THREAT-MODEL.md) for what is and isn't in scope,
and [`docs/ABUSE-EVAL.md`](docs/ABUSE-EVAL.md) for the dual-use safeguards and the
pre-release abuse evaluation.

## Status

`v0.1` alpha — the **CLI** runs end-to-end (Reddit + X, EXIF/GPS, the
recall-preserving cascade, the no-dossier report), aimed at terminal-comfortable
users for now.

Roadmap:
- A **one-click app** (packaged binary + a hardened, local-only in-browser UI, no
  Python required) so non-technical people can use it too — the CLI stays for
  power users.
- **Image *content* analysis is deliberately not in v1.** Sending images to a
  cloud model is a serious privacy regression, so v1 extracts EXIF/metadata only
  and says so plainly; a local-first multimodal option may come later.
- Real-corpus recall / false-positive evaluation (SynthPAI); more platforms
  (Mastodon); single-post / pre-post checks.

## Security & contact

Found a privacy or safety flaw? See [`SECURITY.md`](SECURITY.md). Reach the author
at `cora@cypherpunkguide.com` (PGP via WKD: `gpg --locate-keys cora@cypherpunkguide.com`).

## License & official source

**MIT** — see [`LICENSE`](LICENSE). Built by **Cora Aegis**
([cypherpunkguide.com](https://cypherpunkguide.com)).

This repository (and the name **ExposureCheck**) is the canonical, official
source. You are free to fork and reuse under the MIT terms — but please don't
present a fork as the official project. ExposureCheck is for auditing **your
own** data; see [`docs/ABUSE-EVAL.md`](docs/ABUSE-EVAL.md) for the dual-use
safeguards that carry that intent (the licence deliberately does not — it can't).
