Metadata-Version: 2.4
Name: babygreenflare
Version: 1.0.5
Summary: Technical SEO crawler, CLI, and audit toolkit
Home-page: https://github.com/Grow-Online-Digital/gflareclone
Author: Benjamin Görler
Author-email: ben@greenflare.io
License: GPLv3+
Project-URL: Source, https://github.com/Grow-Online-Digital/gflareclone
Project-URL: Tracker, https://github.com/Grow-Online-Digital/gflareclone/issues
Project-URL: Upstream, https://github.com/beb7/gflare-tk
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests<3,>=2.33.1
Requires-Dist: lxml<7,>=6.1.0
Requires-Dist: cssselect<2,>=1.4.0
Requires-Dist: ua-parser<2,>=1.0.2
Requires-Dist: pillow<13,>=12.2.0
Requires-Dist: packaging<27,>=26.2
Requires-Dist: httpx[http2]<1,>=0.27
Provides-Extra: gsc
Requires-Dist: google-auth<3,>=2.0; extra == "gsc"
Provides-Extra: render
Requires-Dist: playwright<2,>=1.45; extra == "render"
Provides-Extra: accessibility
Requires-Dist: playwright<2,>=1.45; extra == "accessibility"
Provides-Extra: modern-ui
Requires-Dist: ttkbootstrap<2,>=1.10; extra == "modern-ui"
Provides-Extra: cli
Requires-Dist: rich<14,>=13; extra == "cli"
Requires-Dist: argcomplete<4,>=3; extra == "cli"
Provides-Extra: cloud
Requires-Dist: pyarrow<20,>=15; extra == "cloud"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: project-url
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Greenflare SEO Crawler

[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![Tests](https://img.shields.io/badge/tests-CI%20verified-brightgreen.svg)]()

Open-source technical SEO crawler with async engine, 40+ reports, AI-readiness auditing, scheduled crawls, a plugin system, and a REST API. Runs as a desktop app, a headless CLI, an API server, or a Docker container.

## Why Greenflare

Greenflare is positioned as an open, scriptable technical SEO workspace rather than a benchmarked replacement claim against commercial crawlers. Competitor-specific comparisons should be verified in the target environment before being used in sales or migration material.

| Capability | Greenflare |
|---|---|
| **License** | Free, GPLv3 open source |
| **Crawler** | Async engine with HTTP/2 support and a threaded fallback |
| **JS rendering** | Optional Playwright rendering in crawl/enrichment workflows |
| **AI-readiness audit** | Deterministic GEO/AEO modules with labelled evidence sources |
| **REST API** | Local/team API with persisted jobs, cancellation, reports, actions, and token auth |
| **Scheduled crawls** | Built-in scheduler with change detection and webhook alerts |
| **Plugin system** | Python plugin hooks for columns, crawl updates, reports, and completion callbacks |
| **CI/CD integration** | Severity gates, JSON/CSV exports, and API action exports |

## Quick Start

### Install From PyPI

```bash
# One-line macOS/Linux installer
curl -fsSL https://raw.githubusercontent.com/Grow-Online-Digital/gflareclone/master/scripts/install.sh | sh

# Recommended isolated CLI install
pipx install "babygreenflare[cli]"

# Check local setup and optional extras
greenflare-cli doctor

# Create a credential-free sample audit
greenflare-cli demo --db greenflare-demo.gflaredb
greenflare-cli report action-plan --db greenflare-demo.gflaredb

# Crawl a site (async engine, ~200 concurrent connections)
greenflare-cli https://example.com --db audit.gflaredb

# Crawl with live web dashboard
greenflare-cli https://example.com --db audit.gflaredb --web-ui

# Export issues and fail CI on critical problems
greenflare-cli https://example.com --db audit.gflaredb \
    --issues-csv issues.csv --fail-on critical

# Generate HTML report
greenflare-cli report html --db audit.gflaredb --output report.html

# Open desktop GUI
greenflare
```

No `pipx` yet?

```bash
python3 -m pip install --user pipx
python3 -m pipx ensurepath
python3 -m pipx install "babygreenflare[cli]"
```

For a no-Python setup, run the Docker image and open the workspace:

```bash
mkdir -p greenflare-data
docker run --rm -p 8080:8080 \
  -v "$PWD/greenflare-data:/data" \
  ghcr.io/grow-online-digital/babygreenflare:latest \
  greenflare-cli serve --host 0.0.0.0 --port 8080 --output-dir /data
```

Then open `http://localhost:8080/workspace` and click **Create Demo Audit**.

See [docs/INSTALL.md](docs/INSTALL.md) for desktop, Docker, and optional browser-feature installs.

## Core Capabilities

### Async Crawler Engine
Up to 200 concurrent connections via httpx + asyncio with HTTP/2. Falls back to threaded engine with `--sync`.

### 40+ Reports
SEO issues, SERP gaps, keyword maps, cannibalization, content refresh, competitor comparison, version-aware schema validation, accessibility, AI readiness, backlink risk, server logs, and executive summaries.

### pSEO Template Intelligence
Groups generated URL families such as `/locations/:slug` into persisted template rows, then reports thin/duplicate risk, mixed indexability, canonical mismatches, orphan/internal-link weakness, sitemap coverage, and Search Console vs GA4 demand mismatches with representative URL samples.

### AI-Readiness Auditing
Deterministic scoring for extractability, citation readiness, evidence quality, entity clarity, snippet eligibility, and bot access. The `geo-aeo-overview` report groups those findings into technical, content-structure, entity, freshness, evidence, log, query-planning, and citation modules without treating any score as a ranking predictor. Planned AEO/GEO prompt groups can be imported into `fanout-coverage` as planning evidence, while measured AI visibility and Bing AI citation telemetry stay labelled separately.

### JS Rendering in Crawl Loop
Optional Playwright rendering during the crawl. Smart mode auto-renders JS-heavy pages. Stores raw vs rendered evidence for the render-diff report.

```bash
greenflare-cli https://spa-site.com --db audit.gflaredb --rendering-mode smart
```

### Scheduled Crawls + Change Detection
Recurring crawls with automatic comparison against previous runs. Detects new errors, lost pages, indexability changes, title changes, and new redirect chains. Sends Slack-compatible webhook alerts.

```bash
greenflare-cli schedule add --name weekly --url https://example.com --cron "0 9 * * 1" \
    --webhook-url https://hooks.slack.com/services/YOUR/WEBHOOK
greenflare-cli schedule run --name weekly
```

### Live Crawl Dashboard
Real-time web UI during crawls with URL feed, directory tree, issue chips, and stats. No dependencies.

```bash
greenflare-cli https://example.com --db audit.gflaredb --web-ui
```

### Post-Crawl Audit Workspace
The REST API server also serves a no-build audit workspace for existing `.gflaredb` files. It brings together the action model, issue rows, pSEO templates, GEO/AEO report previews, report requirements, review state, and JSON export links.

```bash
greenflare-cli serve --port 8080
# open http://localhost:8080/workspace?db=/absolute/path/to/audit.gflaredb
```

### REST API Server
Run Greenflare as a headless service for CI/CD pipelines and team access.

```bash
greenflare-cli serve --port 8080 --output-dir ./greenflare-api

# Start a crawl via API
curl -X POST http://localhost:8080/api/crawl \
    -H "Content-Type: application/json" \
    -d '{"url": "https://example.com"}'

# Trusted private network use with token auth
GREENFLARE_API_TOKEN="change-me" greenflare-cli serve --host 0.0.0.0 --port 8080 --output-dir ./greenflare-api
curl -H "Authorization: Bearer change-me" http://localhost:8080/api/health
```

### Plugin System
Extend with custom columns, issue checks, reports, and connectors without forking.
During sync and async crawls, `register_columns()` adds columns to the crawl database, `on_page_crawled()` can fill those columns for each parsed URL, and `on_crawl_complete()` runs once after a completed crawl.

```python
from greenflare.plugins.base import BasePlugin

class BrandPlugin(BasePlugin):
    name = "Brand Monitor"
    def register_columns(self):
        return {"brand_in_title": "INT"}
    def on_page_crawled(self, url, row_dict, db=None):
        title = str(row_dict.get("page_title", "") or "").lower()
        return {"brand_in_title": 1 if "mybrand" in title else 0}
```

### Export and Sharing
Export crawl databases to Parquet, JSON Lines, or CSV for team sharing and analytics tools.

```bash
greenflare-cli export jsonl --db audit.gflaredb --output audit.jsonl
greenflare-cli export parquet --db audit.gflaredb --output-dir ./export/
```

### Docker Deployment

```bash
docker run --rm -p 8080:8080 \
  -v "$PWD/greenflare-data:/data" \
  ghcr.io/grow-online-digital/babygreenflare:latest \
  greenflare-cli serve --host 0.0.0.0 --port 8080 --output-dir /data
```

## Installation

### Requirements
- Python 3.10+

### From PyPI

```bash
curl -fsSL https://raw.githubusercontent.com/Grow-Online-Digital/gflareclone/master/scripts/install.sh | sh
```

Or use `pipx` directly:

```bash
pipx install "babygreenflare[cli]"
greenflare-cli doctor
```

For desktop:

```bash
pipx install "babygreenflare[cli,modern-ui]"
greenflare
```

### From Source

```bash
git clone https://github.com/Grow-Online-Digital/gflareclone.git
cd gflareclone
pip install -e ".[cli]"
```

### Optional Extras

```bash
pip install "babygreenflare[modern-ui]"       # Desktop theme (ttkbootstrap)
pip install "babygreenflare[render]"          # Playwright JS rendering
pip install "babygreenflare[accessibility]"   # Browser-side accessibility checks
pip install "babygreenflare[gsc]"             # Google Search Console auth
pip install "babygreenflare[cloud]"           # Parquet export (pyarrow)
pip install "babygreenflare[cli]"             # Rich output + shell completion

# After installing render or accessibility extras:
python -m playwright install chromium
```

## CLI Commands

```
greenflare-cli <command> [options]

crawl          Spider or list-mode crawl (async by default)
compare        Diff two crawl databases
config         Save/list/run named crawl configs
connectors     Validate external connector credentials
demo           Create a sample audit database
doctor         Check setup and optional extras
enrich         Import GSC, GA4, CrUX, DataForSEO, Apify, backlinks, AI/GEO CSVs, logs
export         Parquet/JSONL/CSV export and import
plugins        List and inspect installed plugins
report         40+ reports (SEO, AI search, executive)
schedule       Recurring crawls with change detection
serve          REST API server
snapshot       Health snapshots
```

## Connectors

Enrich crawl data with external sources. All bring-your-own-key, disabled by default.

| Connector | Data | Env Vars |
|-----------|------|----------|
| **DataForSEO** | Keywords, SERPs, backlinks | `DATAFORSEO_LOGIN`, `DATAFORSEO_PASSWORD` |
| **Google Search Console** | Queries, URL inspections | `GOOGLE_SEARCH_CONSOLE_PROPERTY`, `GOOGLE_APPLICATION_CREDENTIALS` |
| **Google Analytics** | Page traffic, conversions | `GOOGLE_ANALYTICS_PROPERTY_ID`, `GOOGLE_APPLICATION_CREDENTIALS` |
| **Chrome UX Report** | Field CWV metrics | `GOOGLE_CRUX_API_KEY` |
| **Apify** | Local SEO, mentions, observations | `APIFY_TOKEN` |
| **LLM (BYO-key)** | AI-generated recommendations | `OPENAI_API_KEY` / `ANTHROPIC_API_KEY` / `GEMINI_API_KEY` |
| **Server Logs** | Bot activity, crawl budget | Local file import (Apache/Nginx/Cloudflare/W3C) |
| **Backlinks** | Link equity risk | CSV import (Ahrefs, Semrush, Moz, Majestic, DataForSEO) |

All connector commands default to dry run. Add `--execute` for paid API calls.

## Desktop GUI

8 tabs: Dashboard, Crawl, Reports, Enrich, Compare, Settings, Exclusions, Extractions.

- **Dashboard** — local evidence health score, top actions, snapshots, last-snapshot comparison, and AI visibility metrics
- **Crawl** — start/pause/resume with sortable data table, column picker, and an Open Web Workspace action after completion
- **Reports** — search, filter, preview, and export from the shared report registry with missing-data states
- **Enrich** — check enrichment readiness, copy CLI commands
- **Compare** — crawl-to-crawl regression review
- **Settings** — threads, user-agent, proxy, auth, PageSpeed, JS rendering mode
- **Extractions** — custom CSS/XPath selectors with 8 presets and live preview

The desktop workflow has headless smoke checks for tab/action wiring, report browsing, dashboard health, enrichment readiness, and the web workspace launcher, so CI can cover the GUI contract without a display server.

## SEO Coverage

Greenflare detects issues across: status codes, redirects (chains, loops, protocol downgrades), robots.txt, canonicals, meta robots, X-Robots-Tag, page titles, meta descriptions, headings (H1/H2/H3), hreflang, XML sitemaps, structured data (JSON-LD validation, Schema.org/ruleset version tracking, visible-content mismatches, stale schema dates, primary entity and sameAs summaries), images (alt text, dimensions, lazy loading, format), internal/external links, crawl depth, word count, content duplicates, PageSpeed/CWV, AI extractability, JS dependency risk, and more.

Each issue includes severity (critical/warning/notice), category, evidence, and a recommendation.

## Server Log Analysis

Auto-detects log format (Apache/Nginx combined, Cloudflare JSON, W3C/IIS). Classifies 28 bot patterns into 7 categories (search, AI, AI-search, AI-training, SEO tool, social, human). Cross-references with crawl data for budget waste, discovery gaps, and orphan page confirmation.

```bash
greenflare-cli enrich server-logs --log access.log --db audit.gflaredb
greenflare-cli report log-analysis --db audit.gflaredb
```

## Development

```bash
python -m venv .venv
source .venv/bin/activate
pip install -e ".[cli,modern-ui,render,accessibility,gsc,cloud]"
python -m playwright install chromium
pytest tests/
```

282 tests covering crawl engines (async + sync), plugin crawl hooks, audit actions, issue explanations, issue rollups, pSEO template grouping/canonical/orphan/sitemap/performance signals, GEO/AEO module overview reporting, AEO query-group planning imports, schema version/visible-content/freshness validation, entity graph primary/sameAs summaries, persisted actions, enrichment-driven action impact signals and coverage actions, web audit workspace routing/report previews, desktop workspace launching/report browsing/dashboard health/comparisons/smoke checks, change detection, scheduler, log analysis, XPath extraction, web UI events, API persistence/cancellation/auth/export endpoints, first-run doctor/demo commands, render/accessibility imports, and export round-trips.

## Architecture

```
greenflare/
  cli/              CLI package (13 lazy-loaded command modules)
  core/             Crawler (async + sync), DB, parser, renderer, log analyzer
  connectors/       6 external data connectors
  reports/          Report registry (40+ reports)
  plugins/          Plugin system (BasePlugin + entry point discovery)
  web/              Live dashboard (SSE) + REST API
  export/           Parquet, JSONL, CSV export/import
  widgets/          Desktop GUI (tkinter/ttkbootstrap)
```

## License

[GPLv3](LICENSE) - Fork of [beb7/gflare-tk](https://github.com/beb7/gflare-tk)

## Links

- [User Guide](docs/USER_GUIDE.md)
- [Product Gap Closeout Plan](docs/PRODUCT_GAP_CLOSEOUT_PLAN.md)
- [Roadmap Closeout Plan](docs/ROADMAP_CLOSEOUT_PLAN.md)
- [Issues](https://github.com/Grow-Online-Digital/gflareclone/issues)
- [Changelog](CHANGELOG.md)
