Metadata-Version: 2.4
Name: gmaps-extractor
Version: 2.0.0
Summary: Extract business data from Google Maps at scale using reverse-engineered internal APIs
Author: promisingcoder
License-Expression: MIT
Project-URL: Homepage, https://github.com/promisingcoder/GoogleMapsCollector
Project-URL: Repository, https://github.com/promisingcoder/GoogleMapsCollector
Project-URL: Changelog, https://github.com/promisingcoder/GoogleMapsCollector/blob/main/CHANGELOG.md
Project-URL: Issues, https://github.com/promisingcoder/GoogleMapsCollector/issues
Keywords: google-maps,scraper,business-data,extractor,geocoding,places,reviews,async
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Typing :: Typed
Classifier: Framework :: AsyncIO
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: httpx>=0.25.0
Provides-Extra: server
Requires-Dist: fastapi>=0.104.0; extra == "server"
Requires-Dist: uvicorn>=0.24.0; extra == "server"
Requires-Dist: pydantic>=2.0.0; extra == "server"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21; extra == "dev"
Requires-Dist: coverage>=7.0; extra == "dev"
Requires-Dist: ruff>=0.4.0; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: fastapi>=0.104.0; extra == "dev"
Requires-Dist: uvicorn>=0.24.0; extra == "dev"
Requires-Dist: pydantic>=2.0.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: mkdocs>=1.5; extra == "docs"
Requires-Dist: mkdocs-material>=9.0; extra == "docs"
Requires-Dist: mkdocstrings[python]>=0.24; extra == "docs"
Dynamic: license-file

# Google Maps Business Extractor

[![PyPI version](https://img.shields.io/pypi/v/gmaps-extractor)](https://pypi.org/project/gmaps-extractor/)
[![Python versions](https://img.shields.io/pypi/pyversions/gmaps-extractor)](https://pypi.org/project/gmaps-extractor/)
[![License](https://img.shields.io/pypi/l/gmaps-extractor)](https://github.com/promisingcoder/GoogleMapsCollector/blob/main/LICENSE)
[![CI](https://github.com/promisingcoder/GoogleMapsCollector/actions/workflows/ci.yml/badge.svg)](https://github.com/promisingcoder/GoogleMapsCollector/actions/workflows/ci.yml)
[![Downloads](https://img.shields.io/pypi/dm/gmaps-extractor)](https://pypi.org/project/gmaps-extractor/)

Extract every business in any geographic area from Google Maps -- no browser needed.

`gmaps-extractor` reverse-engineers Google Maps' internal API to collect business data at scale using raw HTTP requests. Point it at a city and a category, and it systematically covers the entire area using grid-based search with automatic deduplication.

**100K+ records/week capable** with parallel processing and proxy support.

## Features

- **Full area coverage** -- Divides any area into a grid of searchable cells. No results missed.
- **No browser required** -- Pure HTTP requests using httpx. No Selenium, no Puppeteer.
- **Async support** -- `async_collect_v2()` and `stream_collect_v2()` for non-blocking I/O.
- **Streaming** -- Async generator yields businesses as they are found.
- **Event system** -- Lifecycle callbacks for monitoring collection progress.
- **Parallel processing** -- Configurable worker pool (up to 50 concurrent requests).
- **Resumable collection** -- V2 collector saves checkpoints and auto-resumes.
- **Enrichment** -- Fetch place details (hours, phone, website) and reviews concurrently.
- **Adaptive rate limiting** -- Exponential backoff with jitter. Auto-adjusts to Google's limits.
- **Smart deduplication** -- Deduplicates by both `place_id` and `hex_id`.
- **Auto cookie management** -- Builds Google sessions automatically, refreshes on failure.
- **Structured logging** -- Uses Python's `logging` module. Silent by default, configurable.
- **Lightweight core** -- Only requires `httpx`. FastAPI server is optional.

## Quick Start

```python
from gmaps_extractor import GMapsExtractor

with GMapsExtractor(proxy="http://user:pass@proxy-host:port") as extractor:
    result = extractor.collect_v2("New York, USA", "lawyers", enrich=True)
    print(f"Found {len(result)} businesses")
    for biz in result:
        print(f"  {biz['name']} - {biz.get('phone', 'N/A')}")
```

## Installation

```bash
# Core library (recommended)
pip install gmaps-extractor

# With FastAPI server support (for CLI or legacy workflows)
pip install gmaps-extractor[server]

# Development
pip install gmaps-extractor[dev]
```

### From Source

```bash
git clone https://github.com/promisingcoder/GoogleMapsCollector.git
cd GoogleMapsCollector
pip install -e ".[dev]"
```

### Requirements

- Python 3.9+
- A residential/sticky proxy (required -- Google blocks datacenter IPs)

## Usage

### Sync Collection (Default)

No server process needed. Requests go directly to Google Maps via `httpx`.

```python
from gmaps_extractor import GMapsExtractor

with GMapsExtractor(proxy="http://user:pass@host:port") as extractor:
    # Basic collection
    result = extractor.collect("London, UK", "dentists")

    # V2 collector with enrichment and reviews
    result = extractor.collect_v2(
        "Paris, France",
        "restaurants",
        enrich=True,
        reviews=True,
        reviews_limit=50,
        workers=30,
    )

    # Access results
    print(result.metadata)      # {"area": "Paris, France", "category": "restaurants", ...}
    print(result.statistics)    # {"total_collected": 1234, ...}
    for biz in result:
        print(biz["name"], biz.get("rating"))
```

### Async Collection

```python
import asyncio
from gmaps_extractor import GMapsExtractor

async def main():
    async with GMapsExtractor(proxy="http://user:pass@host:port") as extractor:
        # Collect all results at once (async)
        result = await extractor.async_collect_v2(
            "Manhattan, NY",
            "lawyers",
            enrich=True,
            reviews=True,
        )
        print(f"Found {len(result)} businesses")

asyncio.run(main())
```

### Streaming Collection

Process businesses as they are found, without waiting for the full collection to finish.

```python
import asyncio
from gmaps_extractor import GMapsExtractor

async def main():
    async with GMapsExtractor(proxy="http://user:pass@host:port") as extractor:
        async for biz in extractor.stream_collect_v2("NYC", "coffee shops"):
            print(f"Found: {biz['name']} at {biz.get('address', 'N/A')}")

asyncio.run(main())
```

### Subdivision Mode

Break large areas into named sub-areas (boroughs, districts, neighborhoods) for better coverage.

```python
with GMapsExtractor(proxy="http://user:pass@host:port") as extractor:
    result = extractor.collect_v2(
        "London, UK",
        "dentists",
        subdivide=True,
        enrich=True,
    )
```

### Event System

Monitor collection progress with lifecycle callbacks.

```python
from gmaps_extractor import GMapsExtractor, EventType, EventEmitter

emitter = EventEmitter()

def on_cell_complete(event):
    print(f"Cell done: +{event.data.get('businesses_found', 0)} businesses")

def on_complete(event):
    total = event.data.get("total_businesses", 0)
    print(f"Collection complete: {total} businesses")

emitter.on(EventType.CELL_COMPLETE, on_cell_complete)
emitter.on(EventType.COLLECTION_COMPLETE, on_complete)

with GMapsExtractor(proxy="http://user:pass@host:port", events=emitter) as extractor:
    result = extractor.collect_v2("NYC", "lawyers")
```

Or use the convenience shortcuts:

```python
with GMapsExtractor(
    proxy="http://user:pass@host:port",
    on_business_found=lambda e: print(f"Found: {e.data}"),
    on_collection_complete=lambda e: print(f"Done: {e.data}"),
) as extractor:
    result = extractor.collect_v2("NYC", "lawyers")
```

### Logging

The library uses Python's `logging` module with a `NullHandler` by default (no output). Set `verbose=True` (the default) to see progress output, or configure logging manually.

```python
import logging

# Option 1: Use verbose=True (default)
with GMapsExtractor(proxy="...", verbose=True) as extractor:
    result = extractor.collect("NYC", "lawyers")  # Progress printed to stdout

# Option 2: Configure logging manually
logging.getLogger("gmaps_extractor").setLevel(logging.DEBUG)
logging.getLogger("gmaps_extractor").addHandler(logging.StreamHandler())

with GMapsExtractor(proxy="...", verbose=False) as extractor:
    result = extractor.collect("NYC", "lawyers")  # DEBUG-level output
```

### Low-Level Client

Use `GMapsClient` or `AsyncGMapsClient` directly for custom workflows.

```python
from gmaps_extractor.client import GMapsClient
from gmaps_extractor.settings import GMapsSettings

settings = GMapsSettings(proxy_url="http://user:pass@host:port")
client = GMapsClient(settings)

# Search
businesses = client.search("lawyers", lat=40.7128, lng=-74.0060)

# Place details
details = client.place_details(hex_id="0x89c259a...:0x25d41...", name="Acme Law")

# Reviews
reviews = client.reviews(hex_id="0x89c259a...:0x25d41...", limit=20)
```

## Configuration

### Constructor Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `proxy` | `str` | `None` | Proxy URL. Falls back to `GMAPS_PROXY_*` env vars. |
| `cookies` | `dict` | `None` | Explicit cookie override. Auto-managed if `None`. |
| `workers` | `int` | `20` | Parallel search workers. |
| `use_server` | `bool` | `False` | Use legacy FastAPI server (requires `[server]` extra). |
| `verbose` | `bool` | `True` | Enable progress output via logging. |
| `events` | `EventEmitter` | auto | Event emitter for lifecycle hooks. |
| `progress` | `bool/ProgressReporter` | auto | Progress reporter (attached when `verbose=True`). |
| `on_business_found` | `callable` | `None` | Shortcut callback for `BUSINESS_FOUND` events. |
| `on_collection_complete` | `callable` | `None` | Shortcut callback for `COLLECTION_COMPLETE` events. |
| `server_port` | `int` | `8000` | Port for legacy server mode. |

### Environment Variables

```bash
export GMAPS_PROXY_HOST="proxy-host:port"
export GMAPS_PROXY_USER="username"
export GMAPS_PROXY_PASS="password"
export GMAPS_COOKIES='{"NID":"...","SOCS":"..."}'
```

### Config Resolution Order

1. Constructor arguments (highest priority)
2. Environment variables
3. `config.py` / `_config_defaults.py` defaults (lowest priority)

## Exception Handling

```python
from gmaps_extractor import GMapsExtractor
from gmaps_extractor.exceptions import (
    GMapsExtractorError,
    BoundaryError,
    ConfigurationError,
    RateLimitError,
    AuthenticationError,
    ServerError,
)

try:
    with GMapsExtractor(proxy="http://user:pass@host:port") as extractor:
        result = extractor.collect_v2("New York, USA", "lawyers")
except BoundaryError:
    print("Could not resolve area boundaries via Nominatim")
except RateLimitError:
    print("Rate limit exceeded after all retries")
except AuthenticationError:
    print("Proxy or cookie authentication failed")
except GMapsExtractorError as e:
    print(f"Extraction failed: {e}")
```

## CLI

After installing, these commands are available:

```bash
# V2 collector (recommended)
gmaps-collect-v2 "Manhattan, New York" "lawyers" --enrich --reviews -l 100

# V1 collector
gmaps-collect "New York, USA" "lawyers" --subdivide

# Add reviews to existing collection
gmaps-enrich-reviews output/lawyers_in_manhattan.json -l 50

# Start FastAPI server (only needed for CLI usage)
gmaps-server
```

**Note:** CLI commands require the FastAPI server to be running (`gmaps-server`). The library API does not.

## Output Format

JSON and CSV files are generated in the `output/` directory.

```json
{
  "metadata": {
    "area": "New York, USA",
    "category": "lawyers",
    "boundary": {"name": "New York", "north": 40.91, "south": 40.49, "east": -73.70, "west": -74.25},
    "search_mode": "grid",
    "enrichment": {"details_fetched": true, "reviews_fetched": true, "reviews_limit": 20}
  },
  "statistics": {
    "total_collected": 1234,
    "duplicates_removed": 89,
    "search_time_seconds": 120.5,
    "total_time_seconds": 340.2
  },
  "businesses": [
    {
      "name": "Smith & Associates",
      "address": "123 Broadway, New York, NY 10006",
      "place_id": "ChIJ...",
      "rating": 4.5,
      "review_count": 123,
      "latitude": 40.7128,
      "longitude": -74.0060,
      "phone": "+1 212-555-0123",
      "website": "https://example.com",
      "category": "Lawyer",
      "hours": {"monday": "9:00 AM - 5:00 PM"},
      "reviews_data": [{"author": "John", "rating": 5, "text": "Excellent!", "date": "2 months ago"}]
    }
  ]
}
```

## Architecture

```
gmaps_extractor/
├── extractor.py          # GMapsExtractor (high-level API) + CollectionResult
├── client.py             # GMapsClient (sync HTTP, default path)
├── async_client.py       # AsyncGMapsClient (async HTTP)
├── settings.py           # GMapsSettings dataclass
├── events.py             # EventEmitter + EventType
├── progress.py           # ProgressReporter
├── exceptions.py         # Exception hierarchy
├── parsers/              # Response parsers (business, place, reviews)
├── geo/                  # Grid generation, Nominatim boundary resolution
├── extraction/           # Collection orchestrators (sync, async, streaming)
├── decoder/              # Protobuf parameter decoder
└── server.py             # Optional FastAPI server
```

## Contributing

See [CLAUDE.md](CLAUDE.md) for architecture details, common tasks, and development commands.

```bash
git clone https://github.com/promisingcoder/GoogleMapsCollector.git
cd GoogleMapsCollector
pip install -e ".[dev]"
pytest
```

## License

MIT License -- See [LICENSE](LICENSE) for details.
