Metadata-Version: 2.4
Name: gov-websites-collector
Version: 0.1.0
Summary: Collect US business and professional license data from government websites across all 50 states
Project-URL: Homepage, https://github.com/promisingcoder/gov_websites_collector
Project-URL: Repository, https://github.com/promisingcoder/gov_websites_collector
Project-URL: Issues, https://github.com/promisingcoder/gov_websites_collector/issues
Project-URL: Documentation, https://github.com/promisingcoder/gov_websites_collector#readme
Author-email: Youssef Nagy <yossefn68@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: business-data,government,lead-generation,license-lookup,professional-license,real-estate,scraper,secretary-of-state
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Office/Business
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: httpx>=0.27.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: selectolax>=0.3.0
Requires-Dist: tenacity>=8.0.0
Provides-Extra: all
Requires-Dist: camoufox[geoip]>=0.4.0; extra == 'all'
Requires-Dist: playwright>=1.50.0; extra == 'all'
Provides-Extra: browser
Requires-Dist: playwright>=1.50.0; extra == 'browser'
Provides-Extra: camoufox
Requires-Dist: camoufox[geoip]>=0.4.0; extra == 'camoufox'
Provides-Extra: dev
Requires-Dist: mypy>=1.10.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
Requires-Dist: pytest-httpx>=0.30.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Description-Content-Type: text/markdown

# gov-websites-collector

[![PyPI version](https://badge.fury.io/py/gov-websites-collector.svg)](https://pypi.org/project/gov-websites-collector/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**Collect US business and professional license data from government websites across all 50 states + DC.**

A Python library and CLI tool that scrapes Secretary of State business registrations, professional licensing boards, and other government databases. Supports HTTP scraping, browser automation (Playwright), and anti-detection browsing (Camoufox) for sites with bot protection.

## Features

- 🏛️ **51 state collectors** — all 50 states + DC
- 📋 **Business entities** — LLC, Corporation, Nonprofit registrations from Secretary of State
- 🪪 **Professional licenses** — Real estate, contractors, medical, and more
- 🌐 **HTTP + Browser** — HTTP-first with Playwright/Camoufox fallback for JS-heavy sites
- 🔄 **Proxy support** — Configurable HTTP/SOCKS5 proxy rotation
- ⚡ **Async** — Built on `httpx` with async/await throughout
- 🛡️ **Anti-detection** — Camoufox integration bypasses Cloudflare/Incapsula
- 📊 **Structured data** — Pydantic models with consistent schema across states
- 🖥️ **CLI included** — `gov-collect` command for terminal use

## Installation

```bash
# Core (HTTP scraping only)
pip install gov-websites-collector

# With browser automation (for JS-rendered sites)
pip install gov-websites-collector[browser]
playwright install chromium

# With Camoufox (anti-detection browser for protected sites)
pip install gov-websites-collector[camoufox]

# Everything
pip install gov-websites-collector[all]
```

## Quick Start

### Python API

```python
import asyncio
from gov_collector import collect

# Search California real estate licenses
results = asyncio.run(collect("CA", "Smith"))
for r in results:
    print(f"{r.data.holder_name} — {r.data.license_number} ({r.data.status})")

# Search with a proxy (for states with bot protection)
results = asyncio.run(collect("OR", "Portland", proxy="http://user:pass@host:port"))

# Use Camoufox for anti-detection browsing
results = asyncio.run(collect("FL", "Realty", use_camoufox=True))

# Search specific category
results = asyncio.run(collect("TX", "Smith", category="businesses"))
```

### Advanced Usage

```python
import asyncio
from gov_collector import get_collector, SearchParams, DataCategory

async def main():
    params = SearchParams(
        query="Smith",
        state="CA",
        max_results=50,
    )
    
    async with get_collector("CA", proxy=None, timeout=30.0) as collector:
        async for result in collector.collect(params):
            if result.category == "licenses":
                lic = result.data
                print(f"License: {lic.license_number} — {lic.holder_name}")
            elif result.category == "businesses":
                biz = result.data
                print(f"Business: {biz.name} — {biz.status}")

asyncio.run(main())
```

### CLI

```bash
# Search California licenses
gov-collect search --state CA --query "Smith" --category licenses

# Search Texas businesses (JSON output)
gov-collect search --state TX --query "Acme" --category businesses --format json

# With proxy and Camoufox
gov-collect search --state FL --query "Realty" --camoufox --proxy "http://user:pass@host:port"

# List all available states
gov-collect states

# List states that support license lookups
gov-collect states --category licenses

# Show info about a specific state
gov-collect info CA
```

## Supported States

### HTTP-only (no browser needed)

| State | Categories | Source |
|-------|-----------|--------|
| AL | Licenses | Alabama Real Estate Commission |
| AZ | Licenses | Arizona Dept of Real Estate |
| CA | Licenses, Businesses* | DRE + bizfile Online* |
| CO | Licenses, Businesses | DORA + Secretary of State |
| DE | Licenses | Delaware Professional Regulation |
| GA | Licenses | Georgia Real Estate Commission |
| ID | Businesses | Idaho Secretary of State API |
| IN | Licenses, Businesses | MyLicense + Secretary of State |
| KY | Businesses | Kentucky Secretary of State |
| LA | Licenses | Louisiana Real Estate Commission |
| ME | Licenses | Maine Real Estate Commission |
| MN | Businesses | Minnesota Secretary of State API |
| MS | Licenses | Mississippi Real Estate Commission |
| ND | Businesses | North Dakota FirstStop API |
| NJ | Licenses | New Jersey MyLicense |
| NY | Businesses | New York Dept of State |
| SC | Businesses | South Carolina Business Filings |
| TX | Licenses, Businesses | TREC + Comptroller |

### Browser required (Playwright or Camoufox)

| State | Categories | Notes |
|-------|-----------|-------|
| AK | Licenses | Commerce license search |
| CT | Licenses, Businesses | eLicense + Concord SOTS |
| FL | Licenses, Businesses | SunBiz + DBPR |
| HI | Licenses, Businesses | PVL (Cloudflare) + HBE |
| IA | Businesses | Secretary of State |
| IL | Businesses | Illinois Secretary of State |
| MT | Businesses | Montana Secretary of State |
| NH | Businesses | New Hampshire QuickStart |
| OR | Businesses | Oregon Secretary of State (needs proxy) |
| TN | Licenses, Businesses | Verify TN + Secretary of State |
| VA | Licenses, Businesses | DPOR + SCC |

\* *CA business search requires browser (JavaScript SPA)*

### Additional states

All remaining states have collector modules but may be blocked by CAPTCHAs, WAFs, or other anti-bot measures. See [STATE_STATUS.md](STATE_STATUS.md) for detailed status of each state.

## Configuration

### Proxy

Many government sites block datacenter IPs or use Cloudflare/Incapsula. A residential proxy significantly improves success rates.

```python
# Via function argument
results = asyncio.run(collect("OR", "Smith", proxy="http://user:pass@host:port"))

# Via environment variable
import os
os.environ["GOV_COLLECTOR_PROXY"] = "http://user:pass@host:port"
results = asyncio.run(collect("OR", "Smith"))
```

```bash
# CLI
gov-collect search --state OR --query "Smith" --proxy "http://user:pass@host:port"

# Or via env var
export GOV_COLLECTOR_PROXY="http://user:pass@host:port"
gov-collect search --state OR --query "Smith"
```

### Browser Automation

For states with JavaScript-rendered pages:

```python
# Playwright (standard browser)
results = asyncio.run(collect("FL", "Smith", use_browser=True))

# Camoufox (anti-detection — recommended for protected sites)
results = asyncio.run(collect("FL", "Smith", use_camoufox=True))
```

### Rate Limiting

Built-in rate limiting prevents overwhelming government servers:

```python
# Default: 1 second between requests
results = asyncio.run(collect("CA", "Smith", rate_limit=1.0))

# Slower for sensitive sites
results = asyncio.run(collect("CA", "Smith", rate_limit=2.0))
```

## Data Models

All results use Pydantic models with a consistent schema:

### `CollectorResult`
Top-level wrapper containing:
- `category` — "licenses", "businesses", or "properties"
- `state` — Two-letter state code
- `data` — One of `License`, `Business`, or `Property`
- `collected_at` — Timestamp
- `source` — Data source name

### `License`
- `license_number`, `license_type`, `status`
- `holder_name`, `holder` (Person)
- `business_name`, `address`
- `issue_date`, `expiration_date`

### `Business`
- `name`, `entity_type`, `status`
- `filing_number`, `formation_date`
- `address`, `registered_agent`
- `officers`

### `Property`
- `parcel_id`, `address`
- `owner_name`, `assessed_value`, `market_value`
- `year_built`, `square_feet`

See [models.py](gov_collector/models.py) for full field definitions.

## API Reference

### `collect(state, query, **kwargs)`

High-level async function that returns a list of results.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `state` | `str` | required | Two-letter state code |
| `query` | `str` | required | Search term |
| `category` | `str \| None` | `None` | "licenses", "businesses", or "properties" |
| `proxy` | `str \| None` | `None` | Proxy URL |
| `timeout` | `float` | `30.0` | Request timeout (seconds) |
| `rate_limit` | `float` | `1.0` | Min seconds between requests |
| `use_browser` | `bool` | `False` | Enable Playwright |
| `use_camoufox` | `bool` | `False` | Enable Camoufox |
| `max_results` | `int` | `100` | Max results to return |

### `get_collector(state, **kwargs)`

Returns a collector instance for advanced usage with `async with`.

### `list_states()`

Returns list of `StateInfo` objects for all registered collectors.

## Contributing

1. Fork the repo
2. Create a feature branch
3. Add or fix a state collector in `gov_collector/states/`
4. Test: `gov-collect search --state XX --query "test" -v`
5. Submit a PR

## License

MIT — see [LICENSE](LICENSE).
