Metadata-Version: 2.4
Name: django-icv-sitemaps
Version: 0.2.1
Summary: Scalable sitemap generation and web discovery infrastructure for Django — background XML sitemaps (standard/image/video/news), robots.txt, llms.txt, ads.txt, security.txt, humans.txt.
Author: Nigel Copley
License-Expression: MIT
Project-URL: Homepage, https://github.com/nigelcopley/icv-django
Project-URL: Documentation, https://github.com/nigelcopley/icv-django/tree/main/packages/icv-sitemaps
Project-URL: Changelog, https://github.com/nigelcopley/icv-django/tree/main/packages/icv-sitemaps/CHANGELOG.md
Project-URL: Issue Tracker, https://github.com/nigelcopley/icv-django/issues
Project-URL: Source Code, https://github.com/nigelcopley/icv-django/tree/main/packages/icv-sitemaps
Keywords: django,sitemap,robots-txt,seo,discovery,llms-txt,ads-txt
Classifier: Development Status :: 3 - Alpha
Classifier: Framework :: Django
Classifier: Framework :: Django :: 4.2
Classifier: Framework :: Django :: 5.0
Classifier: Framework :: Django :: 5.1
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: Django>=4.2
Provides-Extra: celery
Requires-Dist: celery>=5.3; extra == "celery"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-django>=4.8; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: factory-boy>=3.3; extra == "dev"
Requires-Dist: ruff>=0.5.0; extra == "dev"
Requires-Dist: mypy>=1.10; extra == "dev"
Requires-Dist: django-stubs>=5.0; extra == "dev"
Dynamic: license-file

# django-icv-sitemaps

[![PyPI](https://img.shields.io/pypi/v/django-icv-sitemaps)](https://pypi.org/project/django-icv-sitemaps/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)

Scalable sitemap generation and web discovery infrastructure for Django —
background XML sitemaps (standard/image/video/news), `robots.txt`, `llms.txt`,
`ads.txt`, `security.txt`, and `humans.txt`.

Designed for sites with millions of URLs where Django's built-in
`django.contrib.sitemaps` is impractical (memory-hungry querysets, blocking
request-time generation, no incremental updates).

Part of the [ICV-Django](https://github.com/nigelcopley/icv-django) ecosystem,
but **fully standalone** — no other ICV packages required.

---

## Features

- **Background generation** — sitemaps are generated by Celery tasks (optional),
  written to Django storage backends (local, S3, GCS), and served statically
- **Incremental updates** — `post_save`/`post_delete` signals mark affected
  sections as stale; only changed sections are regenerated
- **All four sitemap types** — standard, image, video, and news sitemaps with
  correct XML namespaces per the sitemap protocol
- **Automatic splitting** — files are split at 50,000 URLs or 50 MB per the
  protocol limits
- **SitemapMixin** — declare any Django model as sitemap-includable with a small
  set of class attributes
- **Auto-sections** — `ICV_SITEMAPS_AUTO_SECTIONS` wires signal handlers
  automatically, like Django's `ICV_SEARCH_AUTO_INDEX`
- **robots.txt** — dynamic, database-driven rules merged with settings; includes
  `Sitemap:` directive automatically
- **llms.txt** — AI crawler guidance served at `/llms.txt`
- **ads.txt / app-ads.txt** — IAB-format authorised seller declarations
- **security.txt** — RFC 9116 compliant, served at `/.well-known/security.txt`
- **humans.txt** — team credits
- **Search engine ping** — Google, Bing, Yandex notified on content changes
  (conditional on checksum comparison)
- **Multi-tenancy** — all discovery files are tenant-scoped; sitemap paths
  include tenant prefix to prevent collisions; tenant IDs are sanitised to
  prevent path-traversal attacks
- **Gzip support** — compressed `.xml.gz` output with correct headers
- **Atomic writes** — temp file then rename; no partially-written files served
- **5 management commands** — `setup`, `generate`, `ping`, `validate`, `stats`
- **Django admin** — all 6 models registered with actions, list filters, and
  read-only views
- **Celery graceful degradation** — tasks work synchronously when Celery is not
  installed
- **Testing utilities** — 6 factory-boy factories, pytest fixtures, and helpers
  in `icv_sitemaps.testing`

---

## Installation

```bash
pip install django-icv-sitemaps
```

Add to `INSTALLED_APPS`:

```python
INSTALLED_APPS = [
    # ...
    "icv_sitemaps",
]
```

Run migrations:

```bash
python manage.py migrate icv_sitemaps
```

Include the URL configuration:

```python
# urls.py
from django.urls import include, path

urlpatterns = [
    path("", include("icv_sitemaps.urls")),
    # ...
]
```

This registers all discovery file endpoints at the root (`/sitemap.xml`,
`/robots.txt`, `/llms.txt`, `/ads.txt`, `/app-ads.txt`,
`/.well-known/security.txt`, `/humans.txt`).

---

## Quick Start

### 1. Make your model sitemap-includable

```python
# myapp/models.py
from django.db import models
from icv_sitemaps.mixins import SitemapMixin


class Article(SitemapMixin, models.Model):
    sitemap_section_name = "articles"
    sitemap_changefreq = "weekly"
    sitemap_priority = 0.7

    title = models.CharField(max_length=200)
    slug = models.SlugField(unique=True)
    is_published = models.BooleanField(default=True)
    updated_at = models.DateTimeField(auto_now=True)

    def get_absolute_url(self):
        return f"/articles/{self.slug}/"

    @classmethod
    def get_sitemap_queryset(cls):
        return cls.objects.filter(is_published=True)
```

### 2. Configure auto-sections

```python
# settings.py
ICV_SITEMAPS_BASE_URL = "https://example.com"

ICV_SITEMAPS_AUTO_SECTIONS = {
    "articles": {
        "model": "blog.Article",
        "sitemap_type": "standard",
        "changefreq": "weekly",
        "priority": 0.7,
    },
    "product_images": {
        "model": "catalogue.ProductImage",
        "sitemap_type": "image",
    },
    "videos": {
        "model": "media.Video",
        "sitemap_type": "video",
    },
    "breaking_news": {
        "model": "news.BreakingStory",
        "sitemap_type": "news",
    },
}
```

### 3. Set up and generate

```bash
# Create SitemapSection records from config
python manage.py icv_sitemaps_setup

# Generate all sitemaps
python manage.py icv_sitemaps_generate --all

# Validate output
python manage.py icv_sitemaps_validate

# Check stats
python manage.py icv_sitemaps_stats
```

### 4. Automatic regeneration

When an `Article` is saved or deleted, its section is marked stale. The
`regenerate_stale_sitemaps` task picks it up on the next run.

```python
# Celery beat schedule (optional)
from celery.schedules import crontab

CELERY_BEAT_SCHEDULE = {
    "icv-sitemaps-regenerate-stale": {
        "task": "icv_sitemaps.tasks.regenerate_stale_sitemaps",
        "schedule": crontab(minute="*/15"),
    },
    "icv-sitemaps-regenerate-all": {
        "task": "icv_sitemaps.tasks.regenerate_all_sitemaps",
        "schedule": crontab(hour=3, minute=0),
    },
    "icv-sitemaps-cleanup-logs": {
        "task": "icv_sitemaps.tasks.cleanup_generation_logs",
        "schedule": crontab(hour=4, minute=0),
    },
    "icv-sitemaps-cleanup-orphans": {
        "task": "icv_sitemaps.tasks.cleanup_orphan_files",
        "schedule": crontab(day_of_week=0, hour=5, minute=0),
    },
}
```

---

## Sitemap Types

### Standard

Standard XML sitemaps with `<loc>`, `<lastmod>`, `<changefreq>`, and
`<priority>` per the [sitemaps.org](https://www.sitemaps.org/) protocol.

### Image

Uses the `http://www.google.com/schemas/sitemap-image/1.1` namespace.
Configure image fields on your mixin:

```python
class ProductImage(SitemapMixin, models.Model):
    sitemap_section_name = "product_images"
    sitemap_type = "image"
    sitemap_image_field = "image_url"
    sitemap_image_caption_field = "caption"
    sitemap_image_title_field = "title"
```

### Video

Uses the `http://www.google.com/schemas/sitemap-video/1.1` namespace:

```python
class Video(SitemapMixin, models.Model):
    sitemap_section_name = "videos"
    sitemap_type = "video"
    sitemap_video_url_field = "video_url"
    sitemap_video_thumbnail_field = "thumbnail_url"
    sitemap_video_title_field = "title"
    sitemap_video_description_field = "description"
    sitemap_video_duration_field = "duration_seconds"
```

### News

Uses the `http://www.google.com/schemas/sitemap-news/0.9` namespace. Entries
older than `ICV_SITEMAPS_NEWS_MAX_AGE_DAYS` (default 2) are automatically
excluded:

```python
class BreakingStory(SitemapMixin, models.Model):
    sitemap_section_name = "breaking_news"
    sitemap_type = "news"
    sitemap_news_publication_name = "Example News"
    sitemap_news_language = "en"
    sitemap_news_title_field = "headline"
    sitemap_news_date_field = "published_at"
```

---

## Discovery Files

### robots.txt

Database-driven rules managed via Django admin or the service layer:

```python
from icv_sitemaps.services import add_robots_rule

# Block AI crawlers from /private/
add_robots_rule("GPTBot", "disallow", "/private/")
add_robots_rule("CCBot", "disallow", "/")

# Block all bots from /admin/
add_robots_rule("*", "disallow", "/admin/")
```

Extra directives from settings are appended after database rules:

```python
ICV_SITEMAPS_ROBOTS_EXTRA_DIRECTIVES = [
    "Crawl-delay: 10",
]
```

### ads.txt / app-ads.txt

IAB-format authorised seller declarations:

```python
from icv_sitemaps.services import add_ads_entry

add_ads_entry("google.com", "pub-1234567890", "DIRECT", certification_id="f08c47fec0942fa0")
add_ads_entry("adnetwork.com", "pub-9876543210", "RESELLER")

# For app-ads.txt
add_ads_entry("google.com", "pub-1234567890", "DIRECT", is_app_ads=True)
```

### llms.txt, security.txt, humans.txt

Free-form text content managed via `DiscoveryFileConfig`:

```python
from icv_sitemaps.services import set_discovery_file_content

set_discovery_file_content("llms_txt", """# llms.txt
# AI training and crawl guidance for example.com

Allow: /blog/
Disallow: /private/
""")

set_discovery_file_content("security_txt", """Contact: mailto:security@example.com
Expires: 2027-01-01T00:00:00.000Z
Preferred-Languages: en
""")

set_discovery_file_content("humans_txt", """/* TEAM */
Lead: Nigel Copley
Site: example.com
""")
```

---

## Configuration

### Settings Reference

All settings are namespaced under `ICV_SITEMAPS_*`. Every setting has a
sensible default so the package works out of the box for local development.

| Setting | Type | Default | Description |
|---------|------|---------|-------------|
| `ICV_SITEMAPS_BASE_URL` | `str` | `""` | Base URL for absolute sitemap URLs (e.g. `"https://example.com"`). **Required** — raises `ImproperlyConfigured` at generation time if empty |
| `ICV_SITEMAPS_STORAGE_BACKEND` | `str` | `"django.core.files.storage.default_storage"` | Dotted path to Django storage backend for generated files |
| `ICV_SITEMAPS_STORAGE_PATH` | `str` | `"sitemaps/"` | Base path within the storage backend |
| `ICV_SITEMAPS_MAX_URLS_PER_FILE` | `int` | `50000` | Maximum URLs per file (protocol limit: 50,000) |
| `ICV_SITEMAPS_MAX_FILE_SIZE_BYTES` | `int` | `52428800` | Maximum file size in bytes (protocol limit: 50 MB) |
| `ICV_SITEMAPS_BATCH_SIZE` | `int` | `5000` | Queryset iteration batch size |
| `ICV_SITEMAPS_GZIP` | `bool` | `True` | Compress files with gzip |
| `ICV_SITEMAPS_PING_ENGINES` | `list` | `["google", "bing"]` | Engines to ping after regeneration |
| `ICV_SITEMAPS_PING_ENABLED` | `bool` | `True` | Enable/disable pinging |
| `ICV_SITEMAPS_AUTO_SECTIONS` | `dict` | `{}` | Auto-register model sections (see Quick Start) |
| `ICV_SITEMAPS_ROBOTS_EXTRA_DIRECTIVES` | `list` | `[]` | Extra lines appended to `robots.txt` |
| `ICV_SITEMAPS_ROBOTS_SITEMAP_URL` | `str` | `""` | Override sitemap URL in `robots.txt` (auto-detected if empty) |
| `ICV_SITEMAPS_CACHE_TIMEOUT` | `int` | `3600` | Cache TTL for discovery files (seconds) |
| `ICV_SITEMAPS_TENANT_PREFIX_FUNC` | `str` | `""` | Dotted path to tenant prefix callable |
| `ICV_SITEMAPS_ASYNC_GENERATION` | `bool` | `True` | Use Celery for background generation |
| `ICV_SITEMAPS_STREAMING_THRESHOLD` | `int` | `100000` | URL count above which streaming generation is used |
| `ICV_SITEMAPS_NEWS_MAX_AGE_DAYS` | `int` | `2` | Maximum age for news entries (Google requires < 2 days) |

### Auto-Sections Configuration

Each key in `ICV_SITEMAPS_AUTO_SECTIONS` is the section name. The value is a
configuration dict:

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `model` | `str` | required | `"app_label.ModelName"` |
| `sitemap_type` | `str` | `"standard"` | `standard`, `image`, `video`, or `news` |
| `changefreq` | `str` | `"daily"` | Default change frequency |
| `priority` | `float` | `0.5` | Default priority (0.0--1.0) |
| `on_save` | `bool` | `True` | Mark section stale on model save |
| `on_delete` | `bool` | `True` | Mark section stale on model delete |

---

## Service Functions

All functions are importable from `icv_sitemaps.services`:

```python
from icv_sitemaps.services import (
    # Sitemap generation
    generate_section,
    generate_all_sections,
    generate_index,
    mark_section_stale,
    get_generation_stats,
    # Section management
    create_section,
    delete_section,
    # Search engine ping
    ping_search_engines,
    # robots.txt
    render_robots_txt,
    add_robots_rule,
    get_robots_rules,
    # ads.txt
    render_ads_txt,
    add_ads_entry,
    # Discovery files
    get_discovery_file_content,
    set_discovery_file_content,
)
```

---

## Management Commands

| Command | Purpose |
|---------|---------|
| `icv_sitemaps_setup [--dry-run]` | Create `SitemapSection` records from `ICV_SITEMAPS_AUTO_SECTIONS` and verify storage |
| `icv_sitemaps_generate [--section NAME] [--all] [--index-only] [--force] [--tenant ID]` | Generate sitemaps; defaults to stale sections only |
| `icv_sitemaps_ping [--url URL] [--tenant ID]` | Ping search engines |
| `icv_sitemaps_validate [--section NAME]` | Validate generated sitemaps against protocol |
| `icv_sitemaps_stats [--tenant ID]` | Show generation statistics |

---

## Signals

All signals are defined in `icv_sitemaps.signals`:

| Signal | When |
|--------|------|
| `sitemap_section_generated` | After a section is successfully generated |
| `sitemap_generation_complete` | After all sections are generated |
| `sitemap_section_deleted` | After a section and its files are deleted |
| `sitemap_pinged` | After search engines are pinged |
| `sitemap_section_stale` | After a section is marked stale |

---

## Celery Tasks

| Task | Purpose | Schedule |
|------|---------|----------|
| `regenerate_stale_sitemaps` | Regenerate stale sections | Every 15 minutes |
| `regenerate_all_sitemaps` | Full regeneration | Daily at 03:00 |
| `ping_engines_task` | Ping search engines | After generation |
| `cleanup_generation_logs` | Delete old logs (30-day default) | Daily at 04:00 |
| `cleanup_orphan_files` | Remove unreferenced storage files | Weekly |

---

## Multi-Tenancy

Enable tenant-scoped discovery files by setting `ICV_SITEMAPS_TENANT_PREFIX_FUNC`
to a dotted path to a callable that returns the tenant identifier:

```python
# myapp/tenancy.py
def get_tenant_id(request):
    return getattr(request, "tenant_id", "")

# settings.py
ICV_SITEMAPS_TENANT_PREFIX_FUNC = "myapp.tenancy.get_tenant_id"
```

Each tenant gets isolated `robots.txt`, `ads.txt`, sitemaps, and all other
discovery files. Sitemap files are stored with tenant-prefixed paths
(e.g. `sitemaps/acme/products-0.xml`).

---

## Production Configuration

```python
# settings.py
ICV_SITEMAPS_BASE_URL = "https://example.com"
ICV_SITEMAPS_STORAGE_BACKEND = "storages.backends.s3boto3.S3Boto3Storage"
ICV_SITEMAPS_STORAGE_PATH = "sitemaps/"
ICV_SITEMAPS_GZIP = True
ICV_SITEMAPS_PING_ENGINES = ["google", "bing"]
ICV_SITEMAPS_BATCH_SIZE = 10000
ICV_SITEMAPS_ASYNC_GENERATION = True
```

---

## Testing

The package provides testing utilities for consuming projects:

```python
from icv_sitemaps.testing import (
    SitemapSectionFactory,
    SitemapFileFactory,
    SitemapGenerationLogFactory,
    RobotsRuleFactory,
    AdsEntryFactory,
    DiscoveryFileConfigFactory,
)
```

To run the package's own tests:

```bash
cd packages/icv-sitemaps
pytest tests/ -v
```

---

## Models

| Model | Purpose |
|-------|---------|
| `SitemapSection` | Logical sitemap section (e.g. "products", "articles") with staleness tracking |
| `SitemapFile` | Individual generated XML file with URL count and checksum |
| `SitemapGenerationLog` | Audit trail for generation runs |
| `RobotsRule` | Database-driven `robots.txt` directives |
| `AdsEntry` | `ads.txt` / `app-ads.txt` authorised seller entries |
| `DiscoveryFileConfig` | Content store for `llms.txt`, `security.txt`, `humans.txt` |

---

## URL Endpoints

| URL | Content-Type | Description |
|-----|-------------|-------------|
| `/sitemap.xml` | `application/xml` | Sitemap index |
| `/sitemaps/<filename>` | `application/xml` | Individual sitemap files |
| `/robots.txt` | `text/plain` | Robots exclusion protocol |
| `/llms.txt` | `text/plain` | AI crawler guidance |
| `/ads.txt` | `text/plain` | Authorised digital sellers |
| `/app-ads.txt` | `text/plain` | Authorised app sellers |
| `/.well-known/security.txt` | `text/plain` | Security contact (RFC 9116) |
| `/security.txt` | 301 redirect | Redirects to `/.well-known/security.txt` |
| `/humans.txt` | `text/plain` | Team credits |

---

## Requirements

- Python 3.11+
- Django 4.2+
- httpx 0.27+ (for search engine pings)
- Celery 5.3+ (optional, for background generation)

---

## Licence

MIT
