Metadata-Version: 2.4
Name: django-icv-sitemaps
Version: 0.4.1
Summary: Scalable sitemap generation and web discovery infrastructure for Django — background XML sitemaps (standard/image/video/news), robots.txt, llms.txt, ads.txt, security.txt, humans.txt.
Author: Nigel Copley
License-Expression: MIT
Project-URL: Homepage, https://github.com/nigelcopley/icv-oss
Project-URL: Documentation, https://github.com/nigelcopley/icv-oss/tree/main/packages/icv-sitemaps
Project-URL: Changelog, https://github.com/nigelcopley/icv-oss/tree/main/packages/icv-sitemaps/CHANGELOG.md
Project-URL: Issue Tracker, https://github.com/nigelcopley/icv-oss/issues
Project-URL: Source Code, https://github.com/nigelcopley/icv-oss/tree/main/packages/icv-sitemaps
Keywords: django,sitemap,robots-txt,seo,discovery,llms-txt,ads-txt
Classifier: Development Status :: 4 - Beta
Classifier: Framework :: Django
Classifier: Framework :: Django :: 5.0
Classifier: Framework :: Django :: 5.1
Classifier: Framework :: Django :: 6.0
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: Django>=5.0
Provides-Extra: celery
Requires-Dist: celery>=5.3; extra == "celery"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-django>=4.8; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: factory-boy>=3.3; extra == "dev"
Requires-Dist: ruff>=0.5.0; extra == "dev"
Requires-Dist: mypy>=1.10; extra == "dev"
Requires-Dist: django-stubs>=5.0; extra == "dev"
Dynamic: license-file

# django-icv-sitemaps

[![CI](https://github.com/nigelcopley/icv-oss/actions/workflows/ci.yml/badge.svg)](https://github.com/nigelcopley/icv-oss/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/django-icv-sitemaps)](https://pypi.org/project/django-icv-sitemaps/)
[![Python](https://img.shields.io/pypi/pyversions/django-icv-sitemaps)](https://pypi.org/project/django-icv-sitemaps/)
[![Django](https://img.shields.io/badge/django-5.1%2B-0C4B33)](https://www.djangoproject.com/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)

Django's built-in `django.contrib.sitemaps` loads every URL into memory at
request time. On a site with tens of thousands of pages that means slow
responses, high memory pressure, and no incremental updates when content
changes. At a million URLs it simply does not work.

`django-icv-sitemaps` replaces that approach entirely. Sitemaps are built in
the background by Celery tasks, written atomically to any Django storage
backend (local, S3, GCS), and served as static files. Only sections whose
content has changed are ever rebuilt. The full protocol is covered: standard,
image, video, and news sitemaps, automatic file splitting, gzip compression,
and search engine pinging — plus a complete set of web discovery files
(`robots.txt`, `llms.txt`, `ads.txt`, `security.txt`, `humans.txt`) managed
from the database.

Part of the [ICV-Django](https://github.com/nigelcopley/icv-oss) ecosystem,
but **fully standalone** — no other ICV packages required.

---

## Features

- **Background generation** — sitemaps are generated by Celery tasks (optional),
  written to Django storage backends (local, S3, GCS), and served statically
- **Incremental updates** — `post_save`/`post_delete` signals mark affected
  sections as stale; only changed sections are regenerated
- **All four sitemap types** — standard, image, video, and news sitemaps with
  correct XML namespaces per the sitemap protocol
- **Automatic splitting** — files are split at 50,000 URLs or 50 MB per the
  protocol limits
- **SitemapMixin** — declare any Django model as sitemap-includable with a small
  set of class attributes
- **Auto-sections** — `ICV_SITEMAPS_AUTO_SECTIONS` wires signal handlers
  automatically, like Django's `ICV_SEARCH_AUTO_INDEX`
- **robots.txt** — dynamic, database-driven rules merged with settings; includes
  `Sitemap:` directive automatically
- **llms.txt** — AI crawler guidance served at `/llms.txt`
- **ads.txt / app-ads.txt** — IAB-format authorised seller declarations
- **security.txt** — RFC 9116 compliant, served at `/.well-known/security.txt`
- **humans.txt** — team credits
- **URL redirects** — database-driven redirect rules (301/302/307/308/410) with
  exact, prefix, and regex matching, priority ordering, expiry, hit tracking,
  and CSV import/export
- **404 tracking** — automatic detection of recurring 404s with hit counts and
  referrer tracking; create redirect rules directly from admin
- **RedirectMiddleware** — opt-in middleware evaluates redirect rules before
  Django's URL resolver; fail-open design never breaks the request cycle
- **Search engine ping** — Google, Bing, Yandex notified on content changes
  (conditional on checksum comparison)
- **Multi-tenancy** — all discovery files are tenant-scoped; sitemap paths
  include tenant prefix to prevent collisions; tenant IDs are sanitised to
  prevent path-traversal attacks
- **Gzip support** — compressed `.xml.gz` output with correct headers
- **Atomic writes** — temp file then rename; no partially-written files served
- **6 management commands** — `setup`, `generate`, `ping`, `validate`, `stats`,
  `redirects`
- **Django admin** — all 8 models registered with actions, list filters, and
  read-only views
- **Celery graceful degradation** — tasks work synchronously when Celery is not
  installed
- **Testing utilities** — 8 factory-boy factories, pytest fixtures, and helpers
  in `icv_sitemaps.testing`

---

## Installation

```bash
pip install django-icv-sitemaps
```

Add to `INSTALLED_APPS`:

```python
INSTALLED_APPS = [
    # ...
    "icv_sitemaps",
]
```

Run migrations:

```bash
python manage.py migrate icv_sitemaps
```

Include the URL configuration:

```python
# urls.py
from django.urls import include, path

urlpatterns = [
    path("", include("icv_sitemaps.urls")),
    # ...
]
```

This registers all discovery file endpoints at the root (`/sitemap.xml`,
`/robots.txt`, `/llms.txt`, `/ads.txt`, `/app-ads.txt`,
`/.well-known/security.txt`, `/humans.txt`).

---

## Quick Start

### 1. Make your model sitemap-includable

```python
# myapp/models.py
from django.db import models
from icv_sitemaps.mixins import SitemapMixin


class Article(SitemapMixin, models.Model):
    sitemap_section_name = "articles"
    sitemap_changefreq = "weekly"
    sitemap_priority = 0.7

    title = models.CharField(max_length=200)
    slug = models.SlugField(unique=True)
    is_published = models.BooleanField(default=True)
    updated_at = models.DateTimeField(auto_now=True)

    def get_absolute_url(self):
        return f"/articles/{self.slug}/"

    @classmethod
    def get_sitemap_queryset(cls):
        return cls.objects.filter(is_published=True)
```

### 2. Configure auto-sections

```python
# settings.py
ICV_SITEMAPS_BASE_URL = "https://example.com"

ICV_SITEMAPS_AUTO_SECTIONS = {
    "articles": {
        "model": "blog.Article",
        "sitemap_type": "standard",
        "changefreq": "weekly",
        "priority": 0.7,
    },
    "product_images": {
        "model": "catalogue.ProductImage",
        "sitemap_type": "image",
    },
    "videos": {
        "model": "media.Video",
        "sitemap_type": "video",
    },
    "breaking_news": {
        "model": "news.BreakingStory",
        "sitemap_type": "news",
    },
}
```

### 3. Set up and generate

```bash
# Create SitemapSection records from config
python manage.py icv_sitemaps_setup

# Generate all sitemaps
python manage.py icv_sitemaps_generate --all

# Validate output
python manage.py icv_sitemaps_validate

# Check stats
python manage.py icv_sitemaps_stats
```

### 4. Automatic regeneration

When an `Article` is saved or deleted, its section is marked stale. The
`regenerate_stale_sitemaps` task picks it up on the next run.

```python
# Celery beat schedule (optional)
from celery.schedules import crontab

CELERY_BEAT_SCHEDULE = {
    "icv-sitemaps-regenerate-stale": {
        "task": "icv_sitemaps.tasks.regenerate_stale_sitemaps",
        "schedule": crontab(minute="*/15"),
    },
    "icv-sitemaps-regenerate-all": {
        "task": "icv_sitemaps.tasks.regenerate_all_sitemaps",
        "schedule": crontab(hour=3, minute=0),
    },
    "icv-sitemaps-cleanup-logs": {
        "task": "icv_sitemaps.tasks.cleanup_generation_logs",
        "schedule": crontab(hour=4, minute=0),
    },
    "icv-sitemaps-cleanup-orphans": {
        "task": "icv_sitemaps.tasks.cleanup_orphan_files",
        "schedule": crontab(day_of_week=0, hour=5, minute=0),
    },
}
```

---

## Sitemap Types

### Standard

Standard XML sitemaps with `<loc>`, `<lastmod>`, `<changefreq>`, and
`<priority>` per the [sitemaps.org](https://www.sitemaps.org/) protocol.

### Image

Uses the `http://www.google.com/schemas/sitemap-image/1.1` namespace.
Configure image fields on your mixin:

```python
class ProductImage(SitemapMixin, models.Model):
    sitemap_section_name = "product_images"
    sitemap_type = "image"
    sitemap_image_field = "image_url"
    sitemap_image_caption_field = "caption"
    sitemap_image_title_field = "title"
```

### Video

Uses the `http://www.google.com/schemas/sitemap-video/1.1` namespace:

```python
class Video(SitemapMixin, models.Model):
    sitemap_section_name = "videos"
    sitemap_type = "video"
    sitemap_video_url_field = "video_url"
    sitemap_video_thumbnail_field = "thumbnail_url"
    sitemap_video_title_field = "title"
    sitemap_video_description_field = "description"
    sitemap_video_duration_field = "duration_seconds"
```

### News

Uses the `http://www.google.com/schemas/sitemap-news/0.9` namespace. Entries
older than `ICV_SITEMAPS_NEWS_MAX_AGE_DAYS` (default 2) are automatically
excluded:

```python
class BreakingStory(SitemapMixin, models.Model):
    sitemap_section_name = "breaking_news"
    sitemap_type = "news"
    sitemap_news_publication_name = "Example News"
    sitemap_news_language = "en"
    sitemap_news_title_field = "headline"
    sitemap_news_date_field = "published_at"
```

---

## Discovery Files

### robots.txt

Database-driven rules managed via Django admin or the service layer:

```python
from icv_sitemaps.services import add_robots_rule

# Block AI crawlers from /private/
add_robots_rule("GPTBot", "disallow", "/private/")
add_robots_rule("CCBot", "disallow", "/")

# Block all bots from /admin/
add_robots_rule("*", "disallow", "/admin/")
```

Extra directives from settings are appended after database rules:

```python
ICV_SITEMAPS_ROBOTS_EXTRA_DIRECTIVES = [
    "Crawl-delay: 10",
]
```

### ads.txt / app-ads.txt

IAB-format authorised seller declarations:

```python
from icv_sitemaps.services import add_ads_entry

add_ads_entry("google.com", "pub-1234567890", "DIRECT", certification_id="f08c47fec0942fa0")
add_ads_entry("adnetwork.com", "pub-9876543210", "RESELLER")

# For app-ads.txt
add_ads_entry("google.com", "pub-1234567890", "DIRECT", is_app_ads=True)
```

### llms.txt, security.txt, humans.txt

Free-form text content managed via `DiscoveryFileConfig`:

```python
from icv_sitemaps.services import set_discovery_file_content

set_discovery_file_content("llms_txt", """# llms.txt
# AI training and crawl guidance for example.com

Allow: /blog/
Disallow: /private/
""")

set_discovery_file_content("security_txt", """Contact: mailto:security@example.com
Expires: 2027-01-01T00:00:00.000Z
Preferred-Languages: en
""")

set_discovery_file_content("humans_txt", """/* TEAM */
Lead: Nigel Copley
Site: example.com
""")
```

---

## URL Redirects & 404 Tracking

### Redirect Rules

Database-driven redirect rules evaluated by `RedirectMiddleware` before
Django's URL resolver:

```python
from icv_sitemaps.services import add_redirect

# Permanent redirect
add_redirect("/old-page/", "/new-page/", 301)

# Temporary redirect
add_redirect("/promo/", "/summer-sale/", 302)

# 410 Gone — page permanently removed
add_redirect("/deleted-product/", "", 410)

# Prefix match — all paths under /blog/2023/ redirect
add_redirect("/blog/2023/", "/archive/2023/", 301, match_type="prefix")

# Regex match
add_redirect(r"/product/\d+/", "/products/", 301, match_type="regex")

# Bulk import from CSV
from icv_sitemaps.services import bulk_import_redirects

with open("redirects.csv") as f:
    import csv
    rows = list(csv.DictReader(f))
    result = bulk_import_redirects(rows)
    # {"created": 150, "updated": 3, "errors": []}
```

### Enable the Middleware

```python
# settings.py
MIDDLEWARE = [
    # ... security/WAF middleware first ...
    "icv_sitemaps.middleware.RedirectMiddleware",
    "django.middleware.common.CommonMiddleware",
    # ...
]

ICV_SITEMAPS_REDIRECT_ENABLED = True
```

### 404 Tracking

Enable automatic 404 tracking to identify broken URLs:

```python
# settings.py
ICV_SITEMAPS_404_TRACKING_ENABLED = True
ICV_SITEMAPS_404_TRACKING_SAMPLE_RATE = 1.0   # Track all 404s (reduce for high traffic)
ICV_SITEMAPS_404_IGNORE_PATTERNS = [
    r"\.(?:css|js|ico|png|jpg|jpeg|gif|svg|woff2?|ttf|eot|map)$",
]
```

Review top 404s and create redirects:

```python
from icv_sitemaps.services import get_top_404s

# Top 50 unresolved 404s with at least 5 hits
for entry in get_top_404s(min_hits=5):
    print(f"{entry.path} — {entry.hit_count} hits, referrers: {entry.referrers}")
```

Or from the command line:

```bash
python manage.py icv_sitemaps_redirects --top-404s
python manage.py icv_sitemaps_redirects --list
python manage.py icv_sitemaps_redirects --import redirects.csv
python manage.py icv_sitemaps_redirects --export redirects.csv
python manage.py icv_sitemaps_redirects --prune   # Remove expired rules
```

---

## Configuration

### Settings Reference

All settings are namespaced under `ICV_SITEMAPS_*`. Every setting has a
sensible default so the package works out of the box for local development.

| Setting | Type | Default | Description |
|---------|------|---------|-------------|
| `ICV_SITEMAPS_BASE_URL` | `str` | `""` | Base URL for absolute sitemap URLs (e.g. `"https://example.com"`). **Required** — raises `ImproperlyConfigured` at generation time if empty |
| `ICV_SITEMAPS_STORAGE_BACKEND` | `str` | `"django.core.files.storage.default_storage"` | Dotted path to Django storage backend for generated files |
| `ICV_SITEMAPS_STORAGE_PATH` | `str` | `"sitemaps/"` | Base path within the storage backend |
| `ICV_SITEMAPS_MAX_URLS_PER_FILE` | `int` | `50000` | Maximum URLs per file (protocol limit: 50,000) |
| `ICV_SITEMAPS_MAX_FILE_SIZE_BYTES` | `int` | `52428800` | Maximum file size in bytes (protocol limit: 50 MB) |
| `ICV_SITEMAPS_BATCH_SIZE` | `int` | `5000` | Queryset iteration batch size |
| `ICV_SITEMAPS_GZIP` | `bool` | `True` | Compress files with gzip |
| `ICV_SITEMAPS_PING_ENGINES` | `list` | `["google", "bing"]` | Engines to ping after regeneration |
| `ICV_SITEMAPS_PING_ENABLED` | `bool` | `True` | Enable/disable pinging |
| `ICV_SITEMAPS_AUTO_SECTIONS` | `dict` | `{}` | Auto-register model sections (see Quick Start) |
| `ICV_SITEMAPS_ROBOTS_EXTRA_DIRECTIVES` | `list` | `[]` | Extra lines appended to `robots.txt` |
| `ICV_SITEMAPS_ROBOTS_SITEMAP_URL` | `str` | `""` | Override sitemap URL in `robots.txt` (auto-detected if empty) |
| `ICV_SITEMAPS_CACHE_TIMEOUT` | `int` | `3600` | Cache TTL for discovery files (seconds) |
| `ICV_SITEMAPS_TENANT_PREFIX_FUNC` | `str` | `""` | Dotted path to tenant prefix callable |
| `ICV_SITEMAPS_ASYNC_GENERATION` | `bool` | `True` | Use Celery for background generation |
| `ICV_SITEMAPS_STREAMING_THRESHOLD` | `int` | `100000` | URL count above which streaming generation is used |
| `ICV_SITEMAPS_NEWS_MAX_AGE_DAYS` | `int` | `2` | Maximum age for news entries (Google requires < 2 days) |
| `ICV_SITEMAPS_REDIRECT_ENABLED` | `bool` | `False` | Enable redirect middleware evaluation (opt-in) |
| `ICV_SITEMAPS_REDIRECT_CACHE_TIMEOUT` | `int` | `300` | Cache TTL for redirect rule lookups (seconds) |
| `ICV_SITEMAPS_404_TRACKING_ENABLED` | `bool` | `False` | Enable 404 tracking in the redirect middleware |
| `ICV_SITEMAPS_404_TRACKING_SAMPLE_RATE` | `float` | `1.0` | Fraction of 404s to track (0.0--1.0) |
| `ICV_SITEMAPS_404_IGNORE_PATTERNS` | `list` | `[r"\.(?:css\|js\|...)$"]` | Regex patterns for paths to ignore when tracking 404s |

### Auto-Sections Configuration

Each key in `ICV_SITEMAPS_AUTO_SECTIONS` is the section name. The value is a
configuration dict:

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `model` | `str` | required | `"app_label.ModelName"` |
| `sitemap_type` | `str` | `"standard"` | `standard`, `image`, `video`, or `news` |
| `changefreq` | `str` | `"daily"` | Default change frequency |
| `priority` | `float` | `0.5` | Default priority (0.0--1.0) |
| `on_save` | `bool` | `True` | Mark section stale on model save |
| `on_delete` | `bool` | `True` | Mark section stale on model delete |

---

## Service Functions

All functions are importable from `icv_sitemaps.services`:

```python
from icv_sitemaps.services import (
    # Sitemap generation
    generate_section,
    generate_all_sections,
    generate_index,
    mark_section_stale,
    get_generation_stats,
    # Section management
    create_section,
    delete_section,
    # Search engine ping
    ping_search_engines,
    # robots.txt
    render_robots_txt,
    add_robots_rule,
    get_robots_rules,
    # ads.txt
    render_ads_txt,
    add_ads_entry,
    # Discovery files
    get_discovery_file_content,
    set_discovery_file_content,
    # Redirects
    check_redirect,
    add_redirect,
    bulk_import_redirects,
    record_404,
    get_top_404s,
)
```

---

## Management Commands

| Command | Purpose |
|---------|---------|
| `icv_sitemaps_setup [--dry-run]` | Create `SitemapSection` records from `ICV_SITEMAPS_AUTO_SECTIONS` and verify storage |
| `icv_sitemaps_generate [--section NAME] [--all] [--index-only] [--force] [--tenant ID]` | Generate sitemaps; defaults to stale sections only |
| `icv_sitemaps_ping [--url URL] [--tenant ID]` | Ping search engines |
| `icv_sitemaps_validate [--section NAME]` | Validate generated sitemaps against protocol |
| `icv_sitemaps_stats [--tenant ID]` | Show generation statistics |
| `icv_sitemaps_redirects [--list] [--import FILE] [--export FILE] [--prune] [--top-404s]` | Manage redirect rules |

---

## Signals

All signals are defined in `icv_sitemaps.signals`:

| Signal | When |
|--------|------|
| `sitemap_section_generated` | After a section is successfully generated |
| `sitemap_generation_complete` | After all sections are generated |
| `sitemap_section_deleted` | After a section and its files are deleted |
| `sitemap_pinged` | After search engines are pinged |
| `sitemap_section_stale` | After a section is marked stale |
| `redirect_rule_saved` | After a redirect rule is saved |
| `redirect_rule_deleted` | After a redirect rule is deleted |
| `redirect_matched` | When a redirect rule matches a request |

---

## Celery Tasks

| Task | Purpose | Schedule |
|------|---------|----------|
| `regenerate_stale_sitemaps` | Regenerate stale sections | Every 15 minutes |
| `regenerate_all_sitemaps` | Full regeneration | Daily at 03:00 |
| `ping_engines_task` | Ping search engines | After generation |
| `cleanup_generation_logs` | Delete old logs (30-day default) | Daily at 04:00 |
| `cleanup_orphan_files` | Remove unreferenced storage files | Weekly |
| `cleanup_expired_redirects` | Delete expired redirect rules | Daily |
| `cleanup_redirect_logs` | Delete old resolved 404 logs (90-day default) | Weekly |

---

## Multi-Tenancy

Enable tenant-scoped discovery files by setting `ICV_SITEMAPS_TENANT_PREFIX_FUNC`
to a dotted path to a callable that returns the tenant identifier:

```python
# myapp/tenancy.py
def get_tenant_id(request):
    return getattr(request, "tenant_id", "")

# settings.py
ICV_SITEMAPS_TENANT_PREFIX_FUNC = "myapp.tenancy.get_tenant_id"
```

Each tenant gets isolated `robots.txt`, `ads.txt`, sitemaps, and all other
discovery files. Sitemap files are stored with tenant-prefixed paths
(e.g. `sitemaps/acme/products-0.xml`).

---

## Production Configuration

```python
# settings.py
ICV_SITEMAPS_BASE_URL = "https://example.com"
ICV_SITEMAPS_STORAGE_BACKEND = "storages.backends.s3boto3.S3Boto3Storage"
ICV_SITEMAPS_STORAGE_PATH = "sitemaps/"
ICV_SITEMAPS_GZIP = True
ICV_SITEMAPS_PING_ENGINES = ["google", "bing"]
ICV_SITEMAPS_BATCH_SIZE = 10000
ICV_SITEMAPS_ASYNC_GENERATION = True
```

---

## Testing

The package provides testing utilities for consuming projects:

```python
from icv_sitemaps.testing import (
    SitemapSectionFactory,
    SitemapFileFactory,
    SitemapGenerationLogFactory,
    RobotsRuleFactory,
    AdsEntryFactory,
    DiscoveryFileConfigFactory,
    RedirectRuleFactory,
    RedirectLogFactory,
)
```

To run the package's own tests:

```bash
cd packages/icv-sitemaps
pytest tests/ -v
```

---

## Models

| Model | Purpose |
|-------|---------|
| `SitemapSection` | Logical sitemap section (e.g. "products", "articles") with staleness tracking |
| `SitemapFile` | Individual generated XML file with URL count and checksum |
| `SitemapGenerationLog` | Audit trail for generation runs |
| `RobotsRule` | Database-driven `robots.txt` directives |
| `AdsEntry` | `ads.txt` / `app-ads.txt` authorised seller entries |
| `DiscoveryFileConfig` | Content store for `llms.txt`, `security.txt`, `humans.txt` |
| `RedirectRule` | HTTP redirect and 410 Gone rules with pattern matching |
| `RedirectLog` | Aggregated 404 tracking with hit counts and referrers |

---

## URL Endpoints

| URL | Content-Type | Description |
|-----|-------------|-------------|
| `/sitemap.xml` | `application/xml` | Sitemap index |
| `/sitemaps/<filename>` | `application/xml` | Individual sitemap files |
| `/robots.txt` | `text/plain` | Robots exclusion protocol |
| `/llms.txt` | `text/plain` | AI crawler guidance |
| `/ads.txt` | `text/plain` | Authorised digital sellers |
| `/app-ads.txt` | `text/plain` | Authorised app sellers |
| `/.well-known/security.txt` | `text/plain` | Security contact (RFC 9116) |
| `/security.txt` | 301 redirect | Redirects to `/.well-known/security.txt` |
| `/humans.txt` | `text/plain` | Team credits |

---

## Requirements

- Python 3.11+
- Django 5.1+
- httpx 0.27+ (for search engine pings)
- Celery 5.3+ (optional, for background generation)

---

## Licence

MIT
