Metadata-Version: 2.4
Name: django-icv-sitemaps
Version: 0.4.0
Summary: Scalable sitemap generation and web discovery infrastructure for Django — background XML sitemaps (standard/image/video/news), robots.txt, llms.txt, ads.txt, security.txt, humans.txt.
Author: Nigel Copley
License-Expression: MIT
Project-URL: Homepage, https://github.com/nigelcopley/icv-oss
Project-URL: Documentation, https://github.com/nigelcopley/icv-oss/tree/main/packages/icv-sitemaps
Project-URL: Changelog, https://github.com/nigelcopley/icv-oss/tree/main/packages/icv-sitemaps/CHANGELOG.md
Project-URL: Issue Tracker, https://github.com/nigelcopley/icv-oss/issues
Project-URL: Source Code, https://github.com/nigelcopley/icv-oss/tree/main/packages/icv-sitemaps
Keywords: django,sitemap,robots-txt,seo,discovery,llms-txt,ads-txt
Classifier: Development Status :: 4 - Beta
Classifier: Framework :: Django
Classifier: Framework :: Django :: 5.0
Classifier: Framework :: Django :: 5.1
Classifier: Framework :: Django :: 6.0
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: Django>=5.0
Provides-Extra: celery
Requires-Dist: celery>=5.3; extra == "celery"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-django>=4.8; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: factory-boy>=3.3; extra == "dev"
Requires-Dist: ruff>=0.5.0; extra == "dev"
Requires-Dist: mypy>=1.10; extra == "dev"
Requires-Dist: django-stubs>=5.0; extra == "dev"
Dynamic: license-file

# django-icv-sitemaps

[![CI](https://github.com/nigelcopley/icv-oss/actions/workflows/ci.yml/badge.svg)](https://github.com/nigelcopley/icv-oss/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/django-icv-sitemaps)](https://pypi.org/project/django-icv-sitemaps/)
[![Python](https://img.shields.io/pypi/pyversions/django-icv-sitemaps)](https://pypi.org/project/django-icv-sitemaps/)
[![Django](https://img.shields.io/badge/django-5.1%2B-0C4B33)](https://www.djangoproject.com/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)

Django's built-in `django.contrib.sitemaps` loads every URL into memory at
request time. On a site with tens of thousands of pages that means slow
responses, high memory pressure, and no incremental updates when content
changes. At a million URLs it simply does not work.

`django-icv-sitemaps` replaces that approach entirely. Sitemaps are built in
the background by Celery tasks, written atomically to any Django storage
backend (local, S3, GCS), and served as static files. Only sections whose
content has changed are ever rebuilt. The full protocol is covered: standard,
image, video, and news sitemaps, automatic file splitting, gzip compression,
and search engine pinging — plus a complete set of web discovery files
(`robots.txt`, `llms.txt`, `ads.txt`, `security.txt`, `humans.txt`) managed
from the database.

Part of the [ICV-Django](https://github.com/nigelcopley/icv-oss) ecosystem,
but **fully standalone** — no other ICV packages required.

---

## Features

- **Background generation** — sitemaps are generated by Celery tasks (optional),
  written to Django storage backends (local, S3, GCS), and served statically
- **Incremental updates** — `post_save`/`post_delete` signals mark affected
  sections as stale; only changed sections are regenerated
- **All four sitemap types** — standard, image, video, and news sitemaps with
  correct XML namespaces per the sitemap protocol
- **Automatic splitting** — files are split at 50,000 URLs or 50 MB per the
  protocol limits
- **SitemapMixin** — declare any Django model as sitemap-includable with a small
  set of class attributes
- **Auto-sections** — `ICV_SITEMAPS_AUTO_SECTIONS` wires signal handlers
  automatically, like Django's `ICV_SEARCH_AUTO_INDEX`
- **robots.txt** — dynamic, database-driven rules merged with settings; includes
  `Sitemap:` directive automatically
- **llms.txt** — AI crawler guidance served at `/llms.txt`
- **ads.txt / app-ads.txt** — IAB-format authorised seller declarations
- **security.txt** — RFC 9116 compliant, served at `/.well-known/security.txt`
- **humans.txt** — team credits
- **Search engine ping** — Google, Bing, Yandex notified on content changes
  (conditional on checksum comparison)
- **Multi-tenancy** — all discovery files are tenant-scoped; sitemap paths
  include tenant prefix to prevent collisions; tenant IDs are sanitised to
  prevent path-traversal attacks
- **Gzip support** — compressed `.xml.gz` output with correct headers
- **Atomic writes** — temp file then rename; no partially-written files served
- **5 management commands** — `setup`, `generate`, `ping`, `validate`, `stats`
- **Django admin** — all 6 models registered with actions, list filters, and
  read-only views
- **Celery graceful degradation** — tasks work synchronously when Celery is not
  installed
- **Testing utilities** — 6 factory-boy factories, pytest fixtures, and helpers
  in `icv_sitemaps.testing`

---

## Installation

```bash
pip install django-icv-sitemaps
```

Add to `INSTALLED_APPS`:

```python
INSTALLED_APPS = [
    # ...
    "icv_sitemaps",
]
```

Run migrations:

```bash
python manage.py migrate icv_sitemaps
```

Include the URL configuration:

```python
# urls.py
from django.urls import include, path

urlpatterns = [
    path("", include("icv_sitemaps.urls")),
    # ...
]
```

This registers all discovery file endpoints at the root (`/sitemap.xml`,
`/robots.txt`, `/llms.txt`, `/ads.txt`, `/app-ads.txt`,
`/.well-known/security.txt`, `/humans.txt`).

---

## Quick Start

### 1. Make your model sitemap-includable

```python
# myapp/models.py
from django.db import models
from icv_sitemaps.mixins import SitemapMixin


class Article(SitemapMixin, models.Model):
    sitemap_section_name = "articles"
    sitemap_changefreq = "weekly"
    sitemap_priority = 0.7

    title = models.CharField(max_length=200)
    slug = models.SlugField(unique=True)
    is_published = models.BooleanField(default=True)
    updated_at = models.DateTimeField(auto_now=True)

    def get_absolute_url(self):
        return f"/articles/{self.slug}/"

    @classmethod
    def get_sitemap_queryset(cls):
        return cls.objects.filter(is_published=True)
```

### 2. Configure auto-sections

```python
# settings.py
ICV_SITEMAPS_BASE_URL = "https://example.com"

ICV_SITEMAPS_AUTO_SECTIONS = {
    "articles": {
        "model": "blog.Article",
        "sitemap_type": "standard",
        "changefreq": "weekly",
        "priority": 0.7,
    },
    "product_images": {
        "model": "catalogue.ProductImage",
        "sitemap_type": "image",
    },
    "videos": {
        "model": "media.Video",
        "sitemap_type": "video",
    },
    "breaking_news": {
        "model": "news.BreakingStory",
        "sitemap_type": "news",
    },
}
```

### 3. Set up and generate

```bash
# Create SitemapSection records from config
python manage.py icv_sitemaps_setup

# Generate all sitemaps
python manage.py icv_sitemaps_generate --all

# Validate output
python manage.py icv_sitemaps_validate

# Check stats
python manage.py icv_sitemaps_stats
```

### 4. Automatic regeneration

When an `Article` is saved or deleted, its section is marked stale. The
`regenerate_stale_sitemaps` task picks it up on the next run.

```python
# Celery beat schedule (optional)
from celery.schedules import crontab

CELERY_BEAT_SCHEDULE = {
    "icv-sitemaps-regenerate-stale": {
        "task": "icv_sitemaps.tasks.regenerate_stale_sitemaps",
        "schedule": crontab(minute="*/15"),
    },
    "icv-sitemaps-regenerate-all": {
        "task": "icv_sitemaps.tasks.regenerate_all_sitemaps",
        "schedule": crontab(hour=3, minute=0),
    },
    "icv-sitemaps-cleanup-logs": {
        "task": "icv_sitemaps.tasks.cleanup_generation_logs",
        "schedule": crontab(hour=4, minute=0),
    },
    "icv-sitemaps-cleanup-orphans": {
        "task": "icv_sitemaps.tasks.cleanup_orphan_files",
        "schedule": crontab(day_of_week=0, hour=5, minute=0),
    },
}
```

---

## Sitemap Types

### Standard

Standard XML sitemaps with `<loc>`, `<lastmod>`, `<changefreq>`, and
`<priority>` per the [sitemaps.org](https://www.sitemaps.org/) protocol.

### Image

Uses the `http://www.google.com/schemas/sitemap-image/1.1` namespace.
Configure image fields on your mixin:

```python
class ProductImage(SitemapMixin, models.Model):
    sitemap_section_name = "product_images"
    sitemap_type = "image"
    sitemap_image_field = "image_url"
    sitemap_image_caption_field = "caption"
    sitemap_image_title_field = "title"
```

### Video

Uses the `http://www.google.com/schemas/sitemap-video/1.1` namespace:

```python
class Video(SitemapMixin, models.Model):
    sitemap_section_name = "videos"
    sitemap_type = "video"
    sitemap_video_url_field = "video_url"
    sitemap_video_thumbnail_field = "thumbnail_url"
    sitemap_video_title_field = "title"
    sitemap_video_description_field = "description"
    sitemap_video_duration_field = "duration_seconds"
```

### News

Uses the `http://www.google.com/schemas/sitemap-news/0.9` namespace. Entries
older than `ICV_SITEMAPS_NEWS_MAX_AGE_DAYS` (default 2) are automatically
excluded:

```python
class BreakingStory(SitemapMixin, models.Model):
    sitemap_section_name = "breaking_news"
    sitemap_type = "news"
    sitemap_news_publication_name = "Example News"
    sitemap_news_language = "en"
    sitemap_news_title_field = "headline"
    sitemap_news_date_field = "published_at"
```

---

## Discovery Files

### robots.txt

Database-driven rules managed via Django admin or the service layer:

```python
from icv_sitemaps.services import add_robots_rule

# Block AI crawlers from /private/
add_robots_rule("GPTBot", "disallow", "/private/")
add_robots_rule("CCBot", "disallow", "/")

# Block all bots from /admin/
add_robots_rule("*", "disallow", "/admin/")
```

Extra directives from settings are appended after database rules:

```python
ICV_SITEMAPS_ROBOTS_EXTRA_DIRECTIVES = [
    "Crawl-delay: 10",
]
```

### ads.txt / app-ads.txt

IAB-format authorised seller declarations:

```python
from icv_sitemaps.services import add_ads_entry

add_ads_entry("google.com", "pub-1234567890", "DIRECT", certification_id="f08c47fec0942fa0")
add_ads_entry("adnetwork.com", "pub-9876543210", "RESELLER")

# For app-ads.txt
add_ads_entry("google.com", "pub-1234567890", "DIRECT", is_app_ads=True)
```

### llms.txt, security.txt, humans.txt

Free-form text content managed via `DiscoveryFileConfig`:

```python
from icv_sitemaps.services import set_discovery_file_content

set_discovery_file_content("llms_txt", """# llms.txt
# AI training and crawl guidance for example.com

Allow: /blog/
Disallow: /private/
""")

set_discovery_file_content("security_txt", """Contact: mailto:security@example.com
Expires: 2027-01-01T00:00:00.000Z
Preferred-Languages: en
""")

set_discovery_file_content("humans_txt", """/* TEAM */
Lead: Nigel Copley
Site: example.com
""")
```

---

## Configuration

### Settings Reference

All settings are namespaced under `ICV_SITEMAPS_*`. Every setting has a
sensible default so the package works out of the box for local development.

| Setting | Type | Default | Description |
|---------|------|---------|-------------|
| `ICV_SITEMAPS_BASE_URL` | `str` | `""` | Base URL for absolute sitemap URLs (e.g. `"https://example.com"`). **Required** — raises `ImproperlyConfigured` at generation time if empty |
| `ICV_SITEMAPS_STORAGE_BACKEND` | `str` | `"django.core.files.storage.default_storage"` | Dotted path to Django storage backend for generated files |
| `ICV_SITEMAPS_STORAGE_PATH` | `str` | `"sitemaps/"` | Base path within the storage backend |
| `ICV_SITEMAPS_MAX_URLS_PER_FILE` | `int` | `50000` | Maximum URLs per file (protocol limit: 50,000) |
| `ICV_SITEMAPS_MAX_FILE_SIZE_BYTES` | `int` | `52428800` | Maximum file size in bytes (protocol limit: 50 MB) |
| `ICV_SITEMAPS_BATCH_SIZE` | `int` | `5000` | Queryset iteration batch size |
| `ICV_SITEMAPS_GZIP` | `bool` | `True` | Compress files with gzip |
| `ICV_SITEMAPS_PING_ENGINES` | `list` | `["google", "bing"]` | Engines to ping after regeneration |
| `ICV_SITEMAPS_PING_ENABLED` | `bool` | `True` | Enable/disable pinging |
| `ICV_SITEMAPS_AUTO_SECTIONS` | `dict` | `{}` | Auto-register model sections (see Quick Start) |
| `ICV_SITEMAPS_ROBOTS_EXTRA_DIRECTIVES` | `list` | `[]` | Extra lines appended to `robots.txt` |
| `ICV_SITEMAPS_ROBOTS_SITEMAP_URL` | `str` | `""` | Override sitemap URL in `robots.txt` (auto-detected if empty) |
| `ICV_SITEMAPS_CACHE_TIMEOUT` | `int` | `3600` | Cache TTL for discovery files (seconds) |
| `ICV_SITEMAPS_TENANT_PREFIX_FUNC` | `str` | `""` | Dotted path to tenant prefix callable |
| `ICV_SITEMAPS_ASYNC_GENERATION` | `bool` | `True` | Use Celery for background generation |
| `ICV_SITEMAPS_STREAMING_THRESHOLD` | `int` | `100000` | URL count above which streaming generation is used |
| `ICV_SITEMAPS_NEWS_MAX_AGE_DAYS` | `int` | `2` | Maximum age for news entries (Google requires < 2 days) |

### Auto-Sections Configuration

Each key in `ICV_SITEMAPS_AUTO_SECTIONS` is the section name. The value is a
configuration dict:

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `model` | `str` | required | `"app_label.ModelName"` |
| `sitemap_type` | `str` | `"standard"` | `standard`, `image`, `video`, or `news` |
| `changefreq` | `str` | `"daily"` | Default change frequency |
| `priority` | `float` | `0.5` | Default priority (0.0--1.0) |
| `on_save` | `bool` | `True` | Mark section stale on model save |
| `on_delete` | `bool` | `True` | Mark section stale on model delete |

---

## Service Functions

All functions are importable from `icv_sitemaps.services`:

```python
from icv_sitemaps.services import (
    # Sitemap generation
    generate_section,
    generate_all_sections,
    generate_index,
    mark_section_stale,
    get_generation_stats,
    # Section management
    create_section,
    delete_section,
    # Search engine ping
    ping_search_engines,
    # robots.txt
    render_robots_txt,
    add_robots_rule,
    get_robots_rules,
    # ads.txt
    render_ads_txt,
    add_ads_entry,
    # Discovery files
    get_discovery_file_content,
    set_discovery_file_content,
)
```

---

## Management Commands

| Command | Purpose |
|---------|---------|
| `icv_sitemaps_setup [--dry-run]` | Create `SitemapSection` records from `ICV_SITEMAPS_AUTO_SECTIONS` and verify storage |
| `icv_sitemaps_generate [--section NAME] [--all] [--index-only] [--force] [--tenant ID]` | Generate sitemaps; defaults to stale sections only |
| `icv_sitemaps_ping [--url URL] [--tenant ID]` | Ping search engines |
| `icv_sitemaps_validate [--section NAME]` | Validate generated sitemaps against protocol |
| `icv_sitemaps_stats [--tenant ID]` | Show generation statistics |

---

## Signals

All signals are defined in `icv_sitemaps.signals`:

| Signal | When |
|--------|------|
| `sitemap_section_generated` | After a section is successfully generated |
| `sitemap_generation_complete` | After all sections are generated |
| `sitemap_section_deleted` | After a section and its files are deleted |
| `sitemap_pinged` | After search engines are pinged |
| `sitemap_section_stale` | After a section is marked stale |

---

## Celery Tasks

| Task | Purpose | Schedule |
|------|---------|----------|
| `regenerate_stale_sitemaps` | Regenerate stale sections | Every 15 minutes |
| `regenerate_all_sitemaps` | Full regeneration | Daily at 03:00 |
| `ping_engines_task` | Ping search engines | After generation |
| `cleanup_generation_logs` | Delete old logs (30-day default) | Daily at 04:00 |
| `cleanup_orphan_files` | Remove unreferenced storage files | Weekly |

---

## Multi-Tenancy

Enable tenant-scoped discovery files by setting `ICV_SITEMAPS_TENANT_PREFIX_FUNC`
to a dotted path to a callable that returns the tenant identifier:

```python
# myapp/tenancy.py
def get_tenant_id(request):
    return getattr(request, "tenant_id", "")

# settings.py
ICV_SITEMAPS_TENANT_PREFIX_FUNC = "myapp.tenancy.get_tenant_id"
```

Each tenant gets isolated `robots.txt`, `ads.txt`, sitemaps, and all other
discovery files. Sitemap files are stored with tenant-prefixed paths
(e.g. `sitemaps/acme/products-0.xml`).

---

## Production Configuration

```python
# settings.py
ICV_SITEMAPS_BASE_URL = "https://example.com"
ICV_SITEMAPS_STORAGE_BACKEND = "storages.backends.s3boto3.S3Boto3Storage"
ICV_SITEMAPS_STORAGE_PATH = "sitemaps/"
ICV_SITEMAPS_GZIP = True
ICV_SITEMAPS_PING_ENGINES = ["google", "bing"]
ICV_SITEMAPS_BATCH_SIZE = 10000
ICV_SITEMAPS_ASYNC_GENERATION = True
```

---

## Testing

The package provides testing utilities for consuming projects:

```python
from icv_sitemaps.testing import (
    SitemapSectionFactory,
    SitemapFileFactory,
    SitemapGenerationLogFactory,
    RobotsRuleFactory,
    AdsEntryFactory,
    DiscoveryFileConfigFactory,
)
```

To run the package's own tests:

```bash
cd packages/icv-sitemaps
pytest tests/ -v
```

---

## Models

| Model | Purpose |
|-------|---------|
| `SitemapSection` | Logical sitemap section (e.g. "products", "articles") with staleness tracking |
| `SitemapFile` | Individual generated XML file with URL count and checksum |
| `SitemapGenerationLog` | Audit trail for generation runs |
| `RobotsRule` | Database-driven `robots.txt` directives |
| `AdsEntry` | `ads.txt` / `app-ads.txt` authorised seller entries |
| `DiscoveryFileConfig` | Content store for `llms.txt`, `security.txt`, `humans.txt` |

---

## URL Endpoints

| URL | Content-Type | Description |
|-----|-------------|-------------|
| `/sitemap.xml` | `application/xml` | Sitemap index |
| `/sitemaps/<filename>` | `application/xml` | Individual sitemap files |
| `/robots.txt` | `text/plain` | Robots exclusion protocol |
| `/llms.txt` | `text/plain` | AI crawler guidance |
| `/ads.txt` | `text/plain` | Authorised digital sellers |
| `/app-ads.txt` | `text/plain` | Authorised app sellers |
| `/.well-known/security.txt` | `text/plain` | Security contact (RFC 9116) |
| `/security.txt` | 301 redirect | Redirects to `/.well-known/security.txt` |
| `/humans.txt` | `text/plain` | Team credits |

---

## Requirements

- Python 3.11+
- Django 5.1+
- httpx 0.27+ (for search engine pings)
- Celery 5.3+ (optional, for background generation)

---

## Licence

MIT
