Metadata-Version: 2.4
Name: screamingfrog
Version: 0.2.4
Summary: Python library for working with Screaming Frog SEO Spider crawl data
Author: Antonio
License-Expression: MIT
Keywords: screaming-frog,seo,crawler,crawl,analysis
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: duckdb>=1.5.0
Requires-Dist: jaydebeapi>=1.2.3
Requires-Dist: JPype1>=1.5.0
Requires-Dist: sf-config-builder>=0.1.6
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Provides-Extra: derby
Requires-Dist: jaydebeapi>=1.2.3; extra == "derby"
Requires-Dist: JPype1>=1.5.0; extra == "derby"
Provides-Extra: config
Requires-Dist: sf-config-builder>=0.1.6; extra == "config"
Provides-Extra: duckdb
Requires-Dist: duckdb>=1.5.0; extra == "duckdb"
Provides-Extra: alpha
Requires-Dist: duckdb>=1.5.0; extra == "alpha"
Requires-Dist: jaydebeapi>=1.2.3; extra == "alpha"
Requires-Dist: JPype1>=1.5.0; extra == "alpha"
Requires-Dist: sf-config-builder>=0.1.6; extra == "alpha"

# screamingfrog

Python library for working with Screaming Frog SEO Spider crawl data programmatically.

Public alpha is focused on DB-backed crawl workflows:
- query `.dbseospider` crawls without manual exports
- access all `628` mapped export/report surfaces
- run sitewide page and link queries, raw SQL, crawl diff, and chain analysis
- convert `.seospider` crawls into queryable DB-backed workflows
- use DuckDB as the default analysis engine, with Derby as the crawl source-of-truth

See `methods.md` for a complete method-level API reference.

## Public alpha status
- `601 / 628` tabs fully mapped
- `15,490 / 15,589` fields mapped
- current `main` passes `195` tests (`2` skipped live/optional tests)

## Known limitations
- Title and meta-description pixel-width filters are not implemented yet.
- Some hreflang edge cases still do not have exact Derby parity (`incorrect language-code` cases).
- `.seospider` conversion requires a local Screaming Frog CLI install.

## Quick start

```python
from screamingfrog import Crawl

crawl = Crawl.load("./exports")
for page in crawl.internal.filter(status_code=404):
    print(page.address)
```

## Loading crawl files

```python
from screamingfrog import Crawl, list_crawls

# CSV exports directory
crawl = Crawl.load("./exports")

# SQLite database
crawl = Crawl.load("./crawl.db")

# DuckDB analytics cache
crawl = Crawl.load("./crawl.duckdb")

# Derby .dbseospider file -> auto-promotes into a sibling DuckDB cache by default
crawl = Crawl.load("./crawl.dbseospider")

# Screaming Frog .seospider crawl (default: convert to DB + DuckDB-backed analysis)
crawl = Crawl.load("./crawl.seospider")

# Disable .dbseospider materialization (still uses Derby from ProjectInstanceData)
crawl = Crawl.load(
    "./crawl.seospider",
    materialize_dbseospider=False,
)

# Force CSV mode for .seospider (CLI export -> CSV backend)
crawl = Crawl.load(
    "./crawl.seospider",
    seospider_backend="csv",
    export_dir="./exports_from_seospider",
    export_tabs=["Internal:All", "External:All", "Response Codes:All"],
)

# Kitchen-sink export profile (all tabs/bulk exports from SF UI)
crawl = Crawl.load(
    "./crawl.seospider",
    seospider_backend="csv",
    export_dir="./exports_kitchen",
    export_profile="kitchen_sink",
)

# DB crawl ID (DB mode) loads DuckDB-backed analysis by default
crawl = Crawl.load("138edb21-61d0-41cd-9e9b-725b592a471c", source_type="db_id")

# DB crawl ID -> export and load a DuckDB analytics cache directly
crawl = Crawl.load(
    "138edb21-61d0-41cd-9e9b-725b592a471c",
    source_type="db_id",
    db_id_backend="duckdb",
    duckdb_path="./crawl.duckdb",
    duckdb_tabs="all",
)

# Discover available DB crawls, then load one by ID
latest = list_crawls()[0]
crawl = Crawl.load(latest.db_id, source_type="db_id")
```

### Loader notes
- `.dbseospider`, DB crawl IDs, and `.seospider` conversions default to DuckDB-backed analysis.
- Use `dbseospider_backend="derby"` / `db_id_backend="derby"` / `seospider_backend="derby"` to stay on Derby.
- `.seospider` defaults to DB conversion (CLI load + Derby source, DuckDB analysis). Use `seospider_backend="csv"` for exports.
- `.seospider` auto-materializes a `.dbseospider` file next to the crawl (overwrite default).
- Set `materialize_dbseospider=False` to avoid creating the `.dbseospider` cache file.
- Set `dbseospider_overwrite=False` to reuse an existing `.dbseospider` cache.
- DB conversion can temporarily set `storage.mode=DB` in `spider.config` (set `ensure_db_mode=False` to skip).
- Internal DB crawl directories (e.g. `ProjectInstanceData/.../results_.../sql`) load via Derby.
- DB crawl IDs can force CSV exports with `db_id_backend="csv"`.
- DuckDB cache refresh defaults to `duckdb_if_exists="auto"` and rebuilds only when the Derby source changed.
- Set `SCREAMINGFROG_CLI` if the CLI executable is not in a standard install path.
- CLI exports default to the `Internal:All` tab unless `export_tabs` is provided.
- `export_profile="kitchen_sink"` uses bundled export lists captured from the SF UI.
- Derby loads can auto-fallback to CSV exports for missing columns or GUI filters (`csv_fallback=True`, `csv_fallback_profile="kitchen_sink"`).
- CSV fallback cache defaults to `csv_fallback_cache_dir` (next to the crawl); set `csv_fallback=False` to disable.
- `.duckdb` loads use the DuckDB analytics backend directly.

## DuckDB analytics cache

DuckDB is the default analysis layer for DB-backed crawl workflows. Derby remains the crawl source-of-truth:

```python
from screamingfrog import Crawl

derby_crawl = Crawl.load("./crawl.dbseospider", dbseospider_backend="derby", csv_fallback=False)
derby_crawl.export_duckdb("./crawl.duckdb", if_exists="auto")

fast = Crawl.load("./crawl.duckdb")

# one DuckDB file can also hold multiple crawls under separate namespaces
derby_crawl.export_duckdb("./portfolio.duckdb", namespace="client-a", if_exists="auto")
other_crawl.export_duckdb("./portfolio.duckdb", namespace="client-b", if_exists="auto")

namespaces = Crawl.duckdb_namespaces("./portfolio.duckdb")
client_a = Crawl.from_duckdb("./portfolio.duckdb", namespace="client-a")

pages_404 = fast.pages().filter(status_code=404).collect()
lightweight = fast.pages().select("Address", "Status Code", "Title 1").collect()
broken_inlinks = fast.links("in").select("Source", "Address", "Status Code").filter(status_code=404).collect()
matching_pages = fast.search("canonical", fields=["Address", "Title 1"]).collect()
links = fast.links("in").filter(status_code=404).collect()
rows = (
    fast.query("APP", "URLS")
    .select("ENCODED_URL", "RESPONSE_CODE")
    .where("RESPONSE_CODE >= ?", 400)
    .collect()
)

# if you want the old wide export (raw APP tables + default tabs), opt in explicitly
derby_crawl.export_duckdb("./crawl-full.duckdb", profile="full", if_exists="replace")
```

Notes:
- Derby remains the source-of-truth crawl store.
- DuckDB is the default analysis engine for DB-backed workflows.
- `crawl.export_duckdb()` now defaults to a portable helper cache instead of a full raw mirror, so exported `.duckdb` files open much faster and still support the main page/link/diff/report workflows.
- Use `profile="full"` when you explicitly want raw `APP.*` tables plus the default materialized tabs inside the exported `.duckdb`.
- Default DB-backed loads now create a tiny sidecar DuckDB cache first, keep Derby prewarmed as the lazy source backend, and only materialize heavier relations if you actually ask for them.
- DuckDB caches can now store multiple crawls in one `.duckdb` file via namespaces; pass `namespace=...` on export and `Crawl.from_duckdb(..., namespace=...)` on load.
- Repeated DB-backed loads in the same Python process now reuse the cached Derby source backend for the same crawl fingerprint, so reopening the same crawl avoids paying Derby startup again.
- High-level page workflows (`crawl.pages()`, page counts, page iteration) now read from the internal model directly instead of forcing `internal_all` tab materialization on a cold cache.
- `crawl.pages().select(...)` now projects narrow page field subsets through a shared `internal_common` helper relation or the prewarmed Derby source backend, so lightweight page workflows avoid wide `internal_all` materialization too.
- `crawl.links(...).select(...)` now does the same against the shared `links_core` helper relation, so lightweight sitewide link queries avoid materializing `all_inlinks` / `all_outlinks` tabs on cold caches.
- Cold-cache projected page/link reads now prefer one-shot source-backed projections before writing helper relations into DuckDB, so first-use lightweight workflows stay closer to direct-query cost instead of paying a cache-write penalty up front.
- `compare()` now uses the same source-backed projection path for its wider internal field set, so cold-cache crawl diffs no longer fall back to full `crawl.internal` scans.
- Cold-cache graph workflows (`broken_links_report`, `broken_inlinks_report`, `nofollow_inlinks_report`) can execute directly from the prewarmed Derby source, so they return without first exporting wide `all_inlinks` tables into DuckDB.
- Generic `crawl.tab(...)` / `crawl.tab_columns(...)` calls also fall back to the prewarmed source backend when a tab is not cached yet, so first-use tab access no longer forces a DuckDB export round-trip.
- When DuckDB does need cached subsets, it now materializes narrow helper relations instead of forcing full `internal_all` / `all_inlinks` exports.
- `compare()` now uses a DuckDB-first projection path too, so crawl diffs only pull the internal fields required for diffing instead of full `internal_all` rows.
- `title_meta_audit()` runs DuckDB-first when `internal_all` is already cached, and otherwise falls back to the same high-level internal model.
- DuckDB `inlinks(url)` / `outlinks(url)` fall back to the source backend or narrow cached link relations, so they still work on lean caches without `all_inlinks` / `all_outlinks`.
- Issue-family helpers read DuckDB issue relations directly when they exist in the cache.
- Chain helpers now fall back to raw DuckDB traversal too, so redirect/canonical chain methods no longer require materialized chain tabs on lean caches.
- `summary()` keeps the core crawl counts fast on cold caches; issue-family and chain counts are `None` until those tab families are materialized.
- You can also export directly from a DB crawl id with `export_duckdb_from_db_id(...)`.
- `.dbseospider`, `.seospider`, and DB crawl ID loaders can all auto-promote to DuckDB.
- Use `tabs="all"` if you want to materialize every currently available mapped tab into the DuckDB cache.

## Search and scoped workflows

```python
from screamingfrog import Crawl

crawl = Crawl.load("./crawl.dbseospider")

matching_pages = crawl.search("blog", fields=["Address", "Title 1"]).collect()
projected = crawl.pages().select("Address", "Status Code", "Title 1").filter(status_code=404).collect()
nofollow_links = crawl.links("in").search("nofollow", fields=["Follow"]).collect()
blog_inlinks = crawl.section("/blog").tab("all_inlinks").collect()
orphans = crawl.orphan_pages_report(only_indexable=True)
broken_inlinks = crawl.broken_inlinks_report()
security_issues = crawl.security_issues_report()
canonical_issues = crawl.canonical_issues_report()
hreflang_issues = crawl.hreflang_issues_report()
redirect_issues = crawl.redirect_issues_report()
summary = crawl.summary()
```

### Discover DB crawls (`list_crawls`)

Use `list_crawls()` to enumerate DB-mode crawls in your local Screaming Frog
`ProjectInstanceData` directory, without opening Derby or starting Java.

```python
from screamingfrog import list_crawls

for info in list_crawls():
    print(info.db_id, info.url, info.urls_crawled, info.modified)
```

`list_crawls(project_root=...)` returns `CrawlInfo` objects with:
- `db_id`: crawl UUID folder name
- `url`: crawl start URL
- `urls_crawled`: number of crawled URLs
- `percent_complete`: crawl completion percentage
- `modified`: last modified timestamp (UTC)
- `path`: absolute path to the crawl folder

## Generic tab access

In addition to the typed `internal` view, you can iterate any exported tab:

```python
from screamingfrog import Crawl

crawl = Crawl.load("./exports")

# List available CSV tabs
print(crawl.tabs)

# Access a tab by file name (extension optional)
for row in crawl.tab("response_codes_all"):
    print(row["Address"], row["Status Code"])

# Filter using column names or snake_case equivalents
for row in crawl.tab("internal_all").filter(status_code="404"):
    print(row["Address"])

# Apply GUI filters (when supported)
for row in crawl.tab("page_titles").filter(gui="Missing"):
    print(row["Address"], row["Title 1"])
```

Notes:
- CSV backend exposes any `*.csv` in the export folder.
- Derby backend exposes tabs mapped in `schemas/mapping.json` (or `SCREAMINGFROG_MAPPING`).
- Hybrid Derby+CSV fallback is enabled by default for `Crawl.load` and will export missing tabs on demand.
- SQLite backend supports only a small set of high-value tabs (response codes, titles, meta description, internal_all).
- For exact GUI filter behavior, use CSV exports (e.g., `export_profile="kitchen_sink"`).
- Derby now natively supports `Response Codes > Internal Redirect Chain` and `Hreflang > Not Using Canonical`.
- HTTP canonical/rel fields in Derby are parsed from `HTTP_RESPONSE_HEADER_COLLECTION` when present.
- Derby-backed `crawl.internal` now materializes computed mapped fields like `Indexability` and `Indexability Status`.
- Derby filters now work against mapped expression fields and header-derived fields in both `crawl.internal` and `crawl.tab(...)`.
- Some link metrics (Link Score, % of Total, JS outlink counts) are not mapped in Derby yet.

## Ergonomic sitewide views

Use first-class page/link views when you do not want to remember tab names:

```python
from screamingfrog import Crawl

crawl = Crawl.load("./crawl.dbseospider")

pages_404 = crawl.pages().filter(status_code=404).collect()
nofollow_inlinks = crawl.links("in").filter(rel="nofollow").collect()
blog_pages = crawl.section("/blog").pages().collect()
blog_outlinks = crawl.section("/blog").links("out").collect()
```

Notes:
- `crawl.pages()` is a mapped sitewide page view backed by the internal page model, with DuckDB/source-backed fast paths for counts and iteration.
- `crawl.links("in")` / `crawl.links("out")` are sitewide mapped link views backed by cached link tabs when available and by the source backend when the cache is still lean.
- `crawl.section("/blog")` matches by URL path prefix; pass a full URL prefix if you want host-specific scoping.

## Inlinks / Outlinks (Derby)

When using a `.dbseospider` crawl, you can read inlinks/outlinks directly from Derby:

```python
from screamingfrog import Crawl

crawl = Crawl.load("./crawl.dbseospider")

for link in crawl.inlinks("https://example.com/page"):
    if link.data.get("NoFollow"):
        print(link.source, "->", link.destination, link.data.get("Rel"))
```

## Chain helpers (redirect/canonical)

Dedicated chain helpers are available on `Crawl`:

```python
from screamingfrog import Crawl

crawl = Crawl.load("./crawl.dbseospider")

# Redirect chains with 3+ hops and no loop
for row in crawl.redirect_chains(min_hops=3, loop=False):
    print(row["Address"], row.get("Number of Redirects"))

# Canonical chains
for row in crawl.canonical_chains(min_hops=2):
    print(row["Address"], row.get("Number of Canonicals"))

# Mixed redirect+canonical chains
for row in crawl.redirect_and_canonical_chains(min_hops=4):
    print(row["Address"], row.get("Number of Redirects/Canonicals"))
```

## Audit helpers

Thin report helpers are available for common workflows:

```python
from screamingfrog import Crawl

crawl = Crawl.load("./crawl.dbseospider")

broken = crawl.broken_links_report()
title_meta = crawl.title_meta_audit()
non_indexable = crawl.indexability_audit()
chains = crawl.redirect_chain_report(min_hops=3)
```

Notes:
- `broken_links_report()` returns broken internal URLs with inlink counts and sampled inlink sources when available.
- `title_meta_audit()` currently surfaces missing titles and missing meta descriptions as flat issue rows.
- `indexability_audit()` returns non-indexable pages with the key indexability fields that explain why.
- `redirect_chain_report()` is a collected helper over `crawl.redirect_chains(...)`.

## Escape hatches (raw SQL)

Mapped fields are stable and documented. Raw access is available for advanced users
who want immediate access to Derby/SQLite columns even when mappings are incomplete.

```python
from screamingfrog import Crawl

crawl = Crawl.load("./crawl.dbseospider", csv_fallback=False)

# Raw table rows (Derby/SQLite only)
for row in crawl.raw("APP.URLS"):
    print(row["ENCODED_URL"], row["RESPONSE_CODE"])

# SQL passthrough (Derby/SQLite only)
for row in crawl.sql(
    "SELECT ENCODED_URL, RESPONSE_CODE FROM APP.URLS WHERE RESPONSE_CODE >= ?",
    [400],
):
    print(row)
```

Notes:
- `raw()` / `sql()` are not supported for CSV/CLI export backends.
- Raw column names may vary by backend and Screaming Frog version.

## Query builder (chainable SQL)

Use a chainable API for common SQL without writing full query strings:

```python
from screamingfrog import Crawl

crawl = Crawl.load("./crawl.dbseospider", csv_fallback=False)

rows = (
    crawl.query("APP", "URLS")
    .select("ENCODED_URL", "RESPONSE_CODE", "TITLE_1")
    .where("RESPONSE_CODE >= ?", 400)
    .order_by("RESPONSE_CODE DESC", "ENCODED_URL ASC")
    .limit(100)
    .collect()
)
```

Notes:
- `crawl.query(...)` uses the backend SQL engine (Derby/Hybrid/SQLite).
- CSV/CLI export backends do not support SQL/query execution.
- Use `.to_sql()` if you want to inspect the generated SQL + params.
- `InternalView`, `TabView`, `LinkView`, `QueryView`, and `CrawlDiff` also support `to_pandas()` / `to_polars()` with optional dependencies installed.

## Crawl diff (crawl-over-crawl)

```python
from screamingfrog import Crawl

old = Crawl.load("./crawl-2024-01.dbseospider")
new = Crawl.load("./crawl-2024-02.dbseospider")

diff = new.compare(old)

print(diff.summary())

for change in diff.status_changes[:5]:
    print(change.url, change.old_status, "->", change.new_status)
```

Notes:
- Title comparison uses `Title 1` by default (override via `compare(..., title_fields=...)`).
- Redirect changes are best-effort and depend on available columns/headers.
- Additional field changes are captured for canonical + canonical status, meta description/keywords/refresh, H1/H2/H3, word count, indexability, and robots + directives summary by default (override via `compare(..., field_groups=...)`).
- `diff.to_rows()` flattens all change buckets into one row list for export/dataframes.

## Examples

Ready-to-run scripts are available in `examples/`:
- `examples/broken_links_report.py`
- `examples/title_meta_audit.py`
- `examples/crawl_diff.py`

### Tab metadata helpers

```python
from screamingfrog import Crawl

crawl = Crawl.load("./exports")

# List GUI filter names for a tab
print(crawl.tab_filters("Page Titles"))

# Inspect columns (CSV header or Derby mapping)
print(crawl.tab_columns("page_titles"))

# Get both in one shot
print(crawl.describe_tab("page_titles"))
```

## Export profiles

You can access the bundled kitchen-sink export lists directly:

```python
from screamingfrog.config import get_export_profile

profile = get_export_profile("kitchen_sink")
print(len(profile.export_tabs), len(profile.bulk_exports))
```

## CLI wrapper (start crawls + exports)

The package includes Python wrappers around the Screaming Frog CLI:

```python
from screamingfrog import export_crawl, start_crawl

# Start a crawl from a URL
start_crawl(
    "https://example.com",
    "./out",
    save_crawl=True,
    export_tabs=["Internal:All", "Response Codes:All"],
)

# Export from an existing crawl file (.seospider / .dbseospider)
export_crawl(
    "./crawl.seospider",
    "./exports",
    export_tabs=["Internal:All", "Page Titles:Missing"],
)
```

## Packaging .dbseospider files

`.dbseospider` files are zip archives of a DB-mode crawl folder. You can pack or
unpack them with helpers:

```python
from screamingfrog import (
    export_dbseospider_from_seospider,
    pack_dbseospider,
    pack_dbseospider_from_db_id,
    unpack_dbseospider,
)

# Package an internal DB crawl folder
dbseospider = pack_dbseospider(
    r"C:\Users\Antonio\.ScreamingFrogSEOSpider\ProjectInstanceData\<project_id>",
    r"C:\Users\Antonio\my-crawl.dbseospider",
)

# Package by DB crawl ID
dbseospider = pack_dbseospider_from_db_id(
    "7c356a1b-ea14-40f3-b504-36c3046432a2",
    r"C:\Users\Antonio\my-crawl.dbseospider",
)

# Convert a .seospider crawl into .dbseospider
dbseospider = export_dbseospider_from_seospider(
    r"C:\Users\Antonio\schema-discovery\actionnetwork_crawl\crawl.seospider",
    r"C:\Users\Antonio\actionnetwork.dbseospider",
)

# Extract a .dbseospider file
unpack_dbseospider(
    r"C:\Users\Antonio\my-crawl.dbseospider",
    r"C:\Users\Antonio\unpacked_crawl",
)
```

Notes:
- `export_dbseospider_from_seospider` runs the Screaming Frog CLI, then packages
  the newly created DB crawl folder. If your DB storage path is custom, set
  `SCREAMINGFROG_PROJECT_DIR` or pass `project_root=...`.
- The helper can force `storage.mode=DB` via `spider.config` (set `ensure_db_mode=False` to skip).

## Config patches (Custom Search + Custom JavaScript)

Use `ConfigPatches` to build patch JSON for the Java ConfigBuilder:

```python
from screamingfrog import ConfigPatches, CustomSearch, CustomJavaScript

patches = ConfigPatches()
patches.set("mCrawlConfig.mRenderingMode", "JAVASCRIPT")
patches.add_custom_search(CustomSearch(name="Filter 1", query=".*", data_type="REGEX"))
patches.add_custom_javascript(
    CustomJavaScript(name="Extractor 1", javascript="return document.title;")
)

patch_json = patches.to_json()
```

Apply patches directly to a `.seospiderconfig` file:

```python
from screamingfrog import ConfigPatches, write_seospider_config

patches = ConfigPatches().set("mCrawlConfig.mMaxUrls", 5000)

write_seospider_config(
    "base.seospiderconfig",
    "alpha.seospiderconfig",
    patches,
)
```

## Installation

Recommended install from PyPI:

```bash
python -m pip install screamingfrog
```

If you want the latest unreleased `main` branch instead:

```bash
python -m pip install "git+https://github.com/Amaculus/screaming-frog-api.git@main"
```

For local development from a clone:

```bash
python -m pip install -e .[dev]
```

Derby support (`.dbseospider`), DuckDB export, and `.seospiderconfig` writing are included in the base install. Optional extras still exist (`[derby]`, `[config]`, `[duckdb]`, `[alpha]`) but are not required for a standard install.

Bundled Derby jars are included with this package (Apache Derby 10.17.1.0), so
`DERBY_JAR` is optional. Set `DERBY_JAR` if you want to override the bundled jars
or use a different Derby install.

### Java runtime setup (for .dbseospider)

The Derby driver jars are bundled, but you still need a Java runtime (`java.exe` / `java`) available.

If Java is missing, Derby loads raise:

`RuntimeError: Java runtime not found. Set JAVA_HOME or add java to PATH.`

Quick checks and fixes:

```bash
java -version
```

- If Screaming Frog desktop is installed, this library already tries these paths automatically:
  - `C:\Program Files (x86)\Screaming Frog SEO Spider\jre`
  - `C:\Program Files\Screaming Frog SEO Spider\jre`
- Otherwise install a JRE/JDK and set `JAVA_HOME` (or add Java to `PATH`).

Windows PowerShell example:

```powershell
$env:JAVA_HOME = "C:\Program Files\Java\jdk-21"
$env:Path = "$env:JAVA_HOME\\bin;$env:Path"
```

Third-party notices for Apache Derby are included in `screamingfrog/vendor/derby/NOTICE`.

Derby tab mapping uses `schemas/mapping.json`. Set `SCREAMINGFROG_MAPPING` if
you store the mapping elsewhere.

## Contributing: tab/column mapping

To help map more GUI tabs to Derby (see [Antonio's LinkedIn](https://www.linkedin.com/in/antoniomaculus/) for progress):

- **Source of truth:** `schemas/mapping.json` (keys = normalized export filenames, e.g. `internal_all.csv`).
- **Workflow:** Compare CSV schema in `schemas/csv/` with Derby schema in `schemas/db/tables/`; prefer `db_column` -> `db_expression` -> `header_extract` / `blob_extract` / `derived_extract` / `multi_row_extract` -> `NULL`; then add/update tests.
- **Automation:** Run from repo root:
  ```bash
  python scripts/suggest_mappings.py --tab hreflang_all.csv   # suggestions for one tab
  python scripts/suggest_mappings.py --tab-family hreflang   # all hreflang_* tabs
  python scripts/suggest_mappings.py --list-unmapped          # tabs with unmapped columns
  python scripts/suggest_mappings.py --patch --tab my_tab    # JSON fragment to merge into mapping.json
  python scripts/suggest_mappings.py --report-nulls          # regenerate mapping_nulls.md content
  ```
- **PRs:** Prefer PRs to `schemas/mapping.json` for new column coverage; for repeated Derby SQL incompatibilities, fix in `screamingfrog/backends/derby_backend.py`; for GUI filter parity, use `screamingfrog/filters/*.py`. See `scripts/README.md`, `schemas/mapping_nulls.md`, `schemas/inlinks_mapping_nulls.md`, and `MAPPING_BACKLOG.md` for current backlog and known hard families.

## Development

```bash
python -m pip install -e .[dev]
pytest
```

Optional live smoke coverage for a real local SF crawl:

```bash
SCREAMINGFROG_RUN_LIVE_SMOKE=1 pytest -q tests/test_live_smoke.py -rs --basetemp .pytest-tmp
```
