Metadata-Version: 2.4
Name: icebreak-cli
Version: 0.0.1
Summary: Generate sales ice-breakers for a prospect from public-data signals.
Requires-Python: >=3.11
Requires-Dist: anthropic>=0.40.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: playwright>=1.59.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: rich>=15.0.0
Requires-Dist: trafilatura>=2.0.0
Description-Content-Type: text/markdown

# IceBreak

A CLI that turns a free-text prospect prompt into a sales briefing. You type
something like:

```
"Patrick McKenzie at Stripe -- pitching observability"
```

and IceBreak resolves the candidate, walks you through a field-by-field
verification panel (so you catch wrong anchors before they poison the
brief), fans out to ~9 public sources in parallel (GitHub, Hacker News,
Google News, Wikipedia, the company website, your prospect's personal
site, LinkedIn previews, web-search snippets, Crunchbase), and asks an
LLM to synthesize a dossier and a handful of opener candidates you can
read in 30 seconds.

It's a single-process Python CLI. No server, no signup beyond your own API
keys, all output goes to your terminal. Cache is local SQLite.

---

## Setup (one-time)

You need [uv](https://docs.astral.sh/uv/) installed.

```bash
uv sync
cp .env.example .env
```

Then edit `.env`. The minimum to be useful:

| Key | Why |
|---|---|
| One LLM key (`GROQ_API_KEY` is free) | Required. Runs the prompt parser, dossier, and briefing. |
| `TAVILY_API_KEY` (free, no CC) | Strongly recommended. Without it, identity enrichment can't auto-find LinkedIn URL / Twitter handle / company website from just a name. |
| `GITHUB_TOKEN` | Strongly recommended. Without it, GitHub API caps you at 60 req/hour. |

That's it. Crunchbase, LinkedIn previews, Wikipedia, HN, Google News, and
the company website all run unauthenticated.

## Run it

```bash
uv run icebreaker "Patrick McKenzie at Stripe -- pitching observability"
```

You'll see:
1. **Prompt parse** — what name + company + scenario IceBreak extracted.
2. **Identity enrichment** — what GitHub / LinkedIn / Twitter / website it
   resolved.
3. **Field-by-field verification** — for each of the 8 identity fields
   (name, company, website, **personal site**, linkedin, github, twitter,
   bio_snippet) the panel shows the resolved value, where it came from,
   AND any other candidates that were considered. Per field:
   - `[a]ccept` — keep current value, advance
   - `[e]dit` — type a new value
   - `[1]/[2]/[3]...` — pick one of the other candidates the picker
     considered (rendered with title + snippet so you can compare —
     e.g., choose between several "Shivank Prajapati" LinkedIn URLs)
   - `[r]e-search` — fire a per-field tailored Tavily query
     (`site:linkedin.com` for LinkedIn, `site:github.com` for GitHub,
     etc.). Merges new candidates into the list. ONE Tavily call per
     re-search. For prompt-derived fields (name, company), `[r]` falls
     back to full re-enrichment.
   - `[c]lear` — set the field to None
   - `[q]uit` — abort the run
4. **Connector preview** — every connector lists what URL or query it'll
   fire (`github → github.com/janedoe`, `hackernews → "Jane Doe" "Acme"`).
   Type `s <name>` to skip a source you don't want to run, then `g`o.
5. **Connector fan-out** — N sources in parallel, each bounded by the
   per-source timeout.
6. **Aggregation → dossier → briefing** — facts deduped, profile
   synthesized, openers generated.
7. **Brief** — rendered to your terminal.

### Useful flags
- `--inspect` — render per-source facts in tables and dump structured JSON
  to `.cache/inspect-<ts>.json`. Skips the final LLM briefing call. Use this
  when iterating on a connector to see what it actually contributed.
- `--raw` — also capture the raw response per source (HTML, JSON, RSS) to
  `.cache/raw-<ts>.json`. Combine with `--inspect` for full debugging.
- `--browse` — launch Microsoft Edge with a persistent profile so Crunchbase
  can be scraped past Cloudflare. Solve the challenge once interactively;
  cookies persist in `.cache/edge-profile/`.
- `--github <user>` / `--linkedin-url <url>` — override identity
  enrichment when you already know the handles.
- `--no-verify` — skip the human-in-the-loop confirmation. Required for
  non-interactive runs.

## What each source contributes

| Source | Best for | Failure mode |
|---|---|---|
| `github` | Engineers, OSS maintainers — bio, **pinned repos** (when token set, falls back to top-by-stars), READMEs, recent commits, repo homepages → personal site discovery | 60 req/hr without token |
| `hackernews` | Technical founders, infra/devtools — stories + comments | None worth mentioning |
| `googlenews` | Recent press, funding, launches | Rate limit on the RSS endpoint |
| `wikipedia` | Famous people / public companies | Long-tail prospects → empty |
| `website` | Company description, team page, founding year | Marketing-only sites are thin |
| `personal_site` | Engineer's blog or portfolio (auto-discovered from GitHub `user.blog` or repo `homepage` fields) | Skipped when no personal URL discovered |
| `linkedin` (preview) | Headline + short bio (no login needed) | Authwall (status 999) on most pages |
| `crunchbase` | Funding stage, founding date, HQ | Cloudflare blocks plain HTTP — use `--browse` |
| `web_search` | Synthetic source: re-uses the bio snippet from Tavily/Brave web search | Skipped if no bio_snippet was discovered |

The `Orchestrator` runs them all in parallel with a per-source timeout. A
slow or failing source never blocks the brief — it just shows up in
`sources_missed` in the output.

## Caching

A local SQLite cache (`.cache/icebreaker.db`) sits in front of the two
cost-bearing surfaces:

- **Web search (Tavily / Brave):** 24h TTL, keyed by `(provider, query, max_results)`.
- **GitHub:** 6h TTL on the full per-username response bundle (profile +
  events + repos + READMEs).

Re-running the same prompt within these windows pays nothing for those
calls. Force a refresh by deleting the DB:

```bash
rm .cache/icebreaker.db
```

The other connectors aren't cached — they're cheap public endpoints and
their freshness matters more than the saved roundtrip.

## Tests

```bash
uv run pytest
```

147 tests, runs in ~6s. The suite uses `respx` to mock HTTP, so it doesn't
hit live APIs. Each test gets an isolated cache via `tests/conftest.py`.

## Evals (development-only)

There's a small rubric-based eval harness in `evals/` for **regression
testing during development** — it's not part of the production runtime
and doesn't need to be deployed.

```bash
uv run python -m evals.run                  # facts + dossier only (cheap)
uv run python -m evals.run --with-briefing  # also runs Summarizer
```

It scores each prospect in `evals/prospects.yaml` against expected fields
(name resolved, min facts, dossier mentions specific topic, etc.) and
writes a JSONL artifact under `.cache/evals-<ts>.jsonl`. Use it after
non-trivial pipeline changes (picker tweaks, prompt edits, LLM-provider
swaps) to confirm a known-good prospect still produces the expected brief.

End-users running `uv run icebreaker "..."` never touch evals.

### Optional: pre-push regression hook

There's a checked-in `.githooks/pre-push` that runs the eval rubric
before each `git push`, blocking the push when the overall pass rate
drops below 50% (configurable). Skips automatically when:
- only docs / non-code files changed in the commits being pushed
- `.env` is missing (no API keys to run with)
- `uv` isn't on `PATH`

**One-time setup per clone:**
```bash
git config core.hooksPath .githooks
```
That's it — no `chmod` needed on Windows + Git Bash; on Linux/Mac run
`chmod +x .githooks/pre-push` once.

**Tweak the threshold** without editing the script:
```bash
ICEBREAK_EVAL_MIN=0.8 git push     # require 80% pass rate
ICEBREAK_EVAL_MIN=1.0 git push     # strict — every check must pass
```

**Bypass for a one-off WIP push:**
```bash
git push --no-verify
```

## Architecture

`ProspectInput → PromptParser → IdentityResolver → IdentityEnricher →
Orchestrator (fan-out) → Aggregator → DossierBuilder → Summarizer → Brief`.

Full data-flow walkthrough: [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md).

## Troubleshooting

Quickest answers in [docs/TROUBLESHOOTING.md](docs/TROUBLESHOOTING.md).
Common ones:

| Symptom | Likely cause |
|---|---|
| `No LLM provider configured` | Add a `*_API_KEY` to `.env` (Groq is free) |
| Identity has no LinkedIn / X / website | `TAVILY_API_KEY` missing |
| GitHub 403 rate-limit | Add `GITHUB_TOKEN` to `.env` |
| Crunchbase always misses | Cloudflare-blocked. Re-run with `--browse` |
| LinkedIn shows `outcome: authwall` in `--raw` | Expected — public preview endpoint refuses many profiles |
| Brief feels generic | Run with `--inspect` to see which sources actually contributed |
