Metadata-Version: 2.4
Name: placeharvest
Version: 0.1.0
Summary: Harvest businesses from the Google Places API (New) and filter false positives with a configurable LLM backend.
Project-URL: Homepage, https://github.com/destifo/placeharvest
Project-URL: Repository, https://github.com/destifo/placeharvest
Author-email: Estifanos Bireda <estifanosbireda@gmail.com>
License: MIT
License-File: LICENSE
Keywords: data-labeling,geocoding,google-places,llm,scraper
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: GIS
Classifier: Topic :: Utilities
Requires-Python: >=3.10
Requires-Dist: httpx>=0.27
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: tenacity>=8.2
Provides-Extra: all
Requires-Dist: anthropic>=0.40; extra == 'all'
Requires-Dist: openai>=1.40; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.40; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: pytest-mock>=3.10; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: respx>=0.21; extra == 'dev'
Provides-Extra: openai
Requires-Dist: openai>=1.40; extra == 'openai'
Description-Content-Type: text/markdown

# placeharvest

Harvest businesses from the **Google Places API (New)** and strip false
positives with a **configurable LLM backend**. The motivating use case is "all
golf simulators in the US and Indonesia," but nothing is hard-coded to golf —
the search terms, the regions, and the meaning of a "false positive" are all
inputs.

Two stages with a durable NDJSON file in between, so re-running the (free) filter
never re-spends the (billed) Google API:

```
fetch  ─▶  data/raw/<region>.ndjson  ─▶  filter  ─▶  data/filtered/<region>.ndjson  ─▶  report
```

## Install

```bash
pip install placeharvest                # core (fetch + cli)
pip install "placeharvest[anthropic]"   # + Anthropic SDK for api/anthropic filtering (default)
pip install "placeharvest[openai]"      # + OpenAI SDK for api/openai filtering
pip install "placeharvest[all]"         # both SDKs
```

`cli` filter mode needs no extra — it shells out to a separately-installed
`claude` or `codex` binary.

## Credentials (two independent domains)

| Stage | Reads | When |
|---|---|---|
| **fetch** | `GOOGLE_MAPS_API_KEY` | always (the only secret the fetcher reads) |
| **filter** `api/anthropic` | `ANTHROPIC_API_KEY` | default |
| **filter** `api/openai` | `OPENAI_API_KEY` | `--provider openai` |
| **filter** `cli/anthropic` | `ANTHROPIC_API_KEY`, or a logged-in `claude` session (`--no-cli-bare`) | `--mode cli` |
| **filter** `cli/openai` | `CODEX_API_KEY` (per-invocation), or a logged-in `codex` session | `--mode cli --provider openai` |

Put them in a `.env` file (see `.env.example`) or export them.

## The filter matrix (mode × provider)

| mode | provider | What runs | Auth |
|---|---|---|---|
| `cli` | `anthropic` | `claude -p` subprocess (headless Claude Code) | `ANTHROPIC_API_KEY` or logged-in session |
| `cli` | `openai` | `codex exec` subprocess (headless Codex) | `CODEX_API_KEY` or logged-in session |
| `api` | `anthropic` | Anthropic Messages API via SDK | `ANTHROPIC_API_KEY` |
| `api` | `openai` | OpenAI Responses API via SDK | `OPENAI_API_KEY` |

`api` mode is the default — self-contained, no external binary, right for CI.
`cli` mode exists for users who already run Claude Code or Codex on a paid plan
and want filtering to ride that session. Impossible combos fail at startup with a
message naming the exact missing piece.

## The golf example, end to end

```bash
# 0. Estimate cost before spending anything.
placeharvest fetch --profile examples/golf_us_id.yaml --region indonesia --dry-run

# 1. Cheap run first (Indonesia is sparse) to validate end-to-end.
placeharvest fetch  --profile examples/golf_us_id.yaml --region indonesia \
    --out data/raw/id.ndjson

# 2. Filter false positives. Target description drives "what is a real match".
placeharvest filter --profile examples/golf_us_id.yaml \
    --in data/raw/id.ndjson --out data/filtered/id.ndjson

# 3. Summarize + export keep.csv / uncertain.csv grouped by country.
placeharvest report --in data/filtered/id.ndjson --csv-dir data/exports/id

# 4. Then the expensive US run (resumable if interrupted).
placeharvest fetch  --profile examples/golf_us_id.yaml --region us \
    --out data/raw/us.ndjson --resume
```

Override anything from the profile on the command line:

```bash
placeharvest filter --in data/raw/id.ndjson --out data/filtered/id.ndjson \
    --mode api --provider openai --model gpt-5.1 --batch-size 25 \
    --target "indoor golf simulator venues; exclude courses, ranges, mini golf, shops"
```

## Library use

```python
from placeharvest import (
    PlacesClient, load_region, resolve_queries, run_fetch,
    make_backend, run_filter, build_report,
)

region = load_region("examples/regions/indonesia.yaml")
queries = ["golf simulator", "indoor golf"]
with PlacesClient(api_key="...") as client:
    run_fetch(client, region, queries, "data/raw/id.ndjson")

backend = make_backend("api", "anthropic", "claude-sonnet-4-6")
run_filter(backend, "data/raw/id.ndjson", "data/filtered/id.ndjson",
           target="indoor golf simulator venues; exclude courses and shops")
```

## How it works

**Fetch.** A thin client against the raw `places:searchText` REST endpoint (no
maintained pip package exposes `nextPageToken` on the new Text Search endpoint).
Each region is a bounding box tiled into overlapping circles; every search term
runs against every tile; results dedup on `place.id`. The field mask is the cost
lever — `rating`/`userRatingCount` are included (cheap, help filtering) but
`reviews` are excluded from the bulk pull.

**Adaptive subdivision.** The API caps results at **60 per (query, tile)** (20 ×
3 pages). When a tile returns a full 60 it's saturated, so it's split into four
half-radius sub-tiles and re-searched, up to `--max-depth` (default 3). This
approaches completeness on dense metros without guessing density up front.

**Filter.** The NDJSON is walked in batches (default 50). Each batch is sent to
the configured LLM with a system prompt parameterized by your `--target`. The
model returns a strict JSON verdict per place — `keep` / `reject` / `uncertain`
(three-way, so borderline cases aren't silently dropped). The runner validates
the contract defensively: count mismatches retry then split, hallucinated ids are
dropped, omitted ids become `uncertain`, invalid JSON retries at a smaller batch.

## Cost & coverage caveats (don't ignore these)

- **Completeness is asymptotic.** The 60-result ceiling plus "bias, not restrict"
  location semantics mean some venues are missed even with subdivision. No setting
  guarantees 100%.
- **Cost scales with grid density and term count**, and subdivision fans out on
  dense metros, so real cost can exceed the pre-subdivision `--dry-run` estimate.
  Watch the live counter.
- **Caching terms:** the dump is point-in-time; only `place_id` is legal to retain
  indefinitely.
- **The filter is a heuristic over sparse fields** — with no reviews in the bulk
  pull, some judgments ride on name + type + website alone, hence `uncertain`. For
  higher precision, add a targeted second pass that fetches `reviews` for
  `uncertain` places only and re-filters.

## License

MIT — see `LICENSE`.
