# PyClickHouseMigrator

> Lightweight, ORM-free Python CLI for ClickHouse schema migrations.
> Advisory locking for concurrent execution protection.
> Supports single-node and cluster (ON CLUSTER) modes with replicated service tables.

## Overview

PyClickHouseMigrator is a CLI tool for managing ClickHouse schema migrations. It uses plain Python files as migrations — each file contains `up()` and `rollback()` functions that return SQL strings. The tool tracks applied migrations in a ClickHouse table, supports advisory locking for concurrent execution protection, checksum validation to detect modified migrations, dry-run mode, and cluster support with `ON CLUSTER` DDL and replicated service tables.

Dependencies: `clickhouse-driver` and `click`. No ORM, no SQLAlchemy, no heavy frameworks.

## Installation

```sh
pip install py-clickhouse-migrator
```

Or with uv:

```sh
uv add py-clickhouse-migrator
```

PyPI package name: `py-clickhouse-migrator`. CLI entry point: `migrator`.

## CLI Commands

### Global Options

These options apply to all commands and can be set via environment variables:

| Option | Env Variable | Default | Description |
|--------|-------------|---------|-------------|
| `--url` | `CLICKHOUSE_MIGRATE_URL` | (required for DB commands) | ClickHouse connection URL |
| `--path` | `CLICKHOUSE_MIGRATE_DIR` | `./db/migrations` | Migrations directory path |
| `--cluster` | `CLICKHOUSE_MIGRATE_CLUSTER` | (empty) | Cluster name for ON CLUSTER DDL |
| `--connect-retries` | `CLICKHOUSE_MIGRATE_CONNECT_RETRIES` | `0` | Connection retry attempts |
| `--connect-retries-interval` | `CLICKHOUSE_MIGRATE_CONNECT_RETRIES_INTERVAL` | `1` | Seconds between retries |
| `--send-receive-timeout` | `CLICKHOUSE_MIGRATE_SEND_RECEIVE_TIMEOUT` | `600` | Query timeout in seconds |
| `-v, --verbose` | — | off | Enable DEBUG logging |
| `-q, --quiet` | — | off | Suppress all output except errors |
| `--version` | — | — | Show version and exit |

Connection URL format: `clickhouse://user:password@host:port/database`

### `init`

Create the migrations directory. Works offline — no ClickHouse connection required.

```sh
migrator init
migrator --path ./my/migrations init
```

Creates `./db/migrations` (or the path specified by `--path` / `CLICKHOUSE_MIGRATE_DIR`).

### `new`

Create a new timestamped migration file. Works offline — no ClickHouse connection required.

```sh
migrator new create_users_table
migrator new                      # creates file without name suffix (warning issued)
```

| Argument | Required | Description |
|----------|----------|-------------|
| `NAME` | no | Migration name suffix. Only letters, digits, and underscores allowed. |

Generated filename format: `YYYYMMDDHHmmss_name.py` (or `YYYYMMDDHHmmss.py` without name).

Generated file template:

```python
def up() -> str:
    return """
    """


def rollback() -> str:
    return """
    """
```

### `up`

Apply pending migrations. Requires ClickHouse connection.

```sh
migrator up              # apply all pending
migrator up 3            # apply next 3 pending
migrator up --dry-run    # print SQL without executing
migrator up --no-lock    # disable distributed lock
migrator up --allow-dirty  # skip checksum validation
```

| Argument/Option | Default | Description |
|-----------------|---------|-------------|
| `N` | all | Number of migrations to apply (positional, optional) |
| `--lock / --no-lock` | `--lock` | Enable/disable distributed lock |
| `--lock-ttl` | `600` | Lock TTL in seconds (min: 1) |
| `--lock-retry` | `3` | Lock acquire retry attempts (min: 0) |
| `--dry-run` | off | Print SQL without executing |
| `--allow-dirty` | off | Skip checksum validation for modified files |

Output example:

```
20250318090000_create_users.py applied [✔]
20250319120000_create_events.py applied [✔]
```

When `--lock` is enabled and there are no pending migrations, the lock is skipped entirely.

### `rollback`

Rollback applied migrations in reverse order. Requires ClickHouse connection.

```sh
migrator rollback        # rollback last 1
migrator rollback 3      # rollback last 3
migrator rollback --dry-run
```

| Argument/Option | Default | Description |
|-----------------|---------|-------------|
| `N` | `1` | Number of migrations to rollback (positional, optional) |
| `--lock / --no-lock` | `--lock` | Enable/disable distributed lock |
| `--lock-ttl` | `600` | Lock TTL in seconds |
| `--lock-retry` | `3` | Lock acquire retry attempts |
| `--dry-run` | off | Print SQL without executing |

Output example:

```
20250319120000_create_events.py rolled back [✔].
```

### `show`

Display migration status, integrity information, and HEAD pointer. Requires ClickHouse connection.

```sh
migrator show            # last 5 applied + all pending
migrator show --all      # all applied + all pending
```

| Option | Default | Description |
|--------|---------|-------------|
| `--all` | off | Show all applied migrations (default: last 5) |

Output example:

```
Applied:
  [X] 20250320143022_add_indexes.py (HEAD)
  [X] 20250319120000_create_events.py
  [X] 20250318090000_create_users.py

Pending:
  [ ] 20250321100000_add_status_column.py

Applied: 3 | Pending: 1
```

`[X]` is colored green, `[ ]` is dimmed. The most recent applied migration is marked `(HEAD)` in cyan. Modified migrations show `(modified)` in yellow, missing files show `(missing)` in red. If integrity issues exist, a warning block is printed to stderr.

### `repair`

Update stored checksums in `db_migrations` to match current migration files. Use after intentionally editing an already-applied migration. Requires ClickHouse connection.

```sh
migrator repair
```

No options. Shows which migrations have mismatched checksums, updates them, and reports the count. Skips migrations whose files are missing.

### `force-unlock`

Manually release a stuck migration lock. Use when a deployment crashed mid-migration and the lock wasn't released. Requires ClickHouse connection.

```sh
migrator force-unlock
```

No options. Inserts a release row into the lock table regardless of who holds the lock.

### `lock-info`

Show current lock holder and expiration time. Requires ClickHouse connection.

```sh
migrator lock-info
```

Output when locked:

```
Locked by: hostname:pid
Locked at: 2025-03-20 14:30:22
Expires at: 2025-03-20 14:35:22
```

Output when unlocked:

```
No active lock.
```

## How It Works

### Migration Files

Migrations live in the migrations directory (default `./db/migrations`). Each migration is a Python file with two functions:

- `up() -> str` — returns SQL to apply the migration
- `rollback() -> str` — returns SQL to reverse the migration

Filename format: `YYYYMMDDHHmmss_optional_name.py`. Files are sorted lexicographically to determine execution order.

Multiple SQL statements can be separated by `;`. Each statement is executed independently.

Example migration:

```python
def up() -> str:
    return """
    CREATE TABLE IF NOT EXISTS users (
        id UInt64,
        name String,
        created_at DateTime DEFAULT now()
    ) ENGINE = MergeTree()
    ORDER BY id
    """


def rollback() -> str:
    return """
    DROP TABLE IF EXISTS users
    """
```

Empty `up()` functions (returning only whitespace) are skipped with a warning.

### Migration Tracking

Applied migrations are stored in a `db_migrations` table, created automatically on first use:

```sql
CREATE TABLE IF NOT EXISTS db_migrations (
    name String,
    up String,
    rollback String,
    dt DateTime64 DEFAULT now(),
    checksum String DEFAULT ''
) ENGINE = MergeTree()
ORDER BY dt
```

The `up` and `rollback` columns store the SQL text at the time of application, so rollbacks work even if the migration file is later deleted. The `dt` column determines ordering.

Rollback deletes the row from `db_migrations` using a lightweight `DELETE` with `mutations_sync=1`.

### Locking

The lock prevents concurrent migration runs (e.g. multiple CI/CD runners deploying simultaneously).

Lock table (`_migrations_lock`):

```sql
CREATE TABLE IF NOT EXISTS _migrations_lock (
    lock_id    String    DEFAULT 'migration',
    locked_by  String,
    locked_at  DateTime64(3) DEFAULT now64(3),
    expires_at DateTime64(3),
    is_locked  UInt8     DEFAULT 1
) ENGINE = ReplacingMergeTree(locked_at)
ORDER BY lock_id
```

How it works:

1. **Acquire**: Insert a row with `is_locked=1` and `expires_at = now + TTL`. Then query with `FINAL` to get the latest row. If `locked_by` matches the current process, lock is acquired.
2. **Release**: Insert a row with `is_locked=0`. The `ReplacingMergeTree` engine ensures the latest row (by `locked_at`) wins on `FINAL` queries.
3. **Expiration**: The `WHERE expires_at > now64(3)` clause in the active lock query means expired locks are automatically ignored.
4. **Retry**: If the lock is held, the process waits `retry_delay` seconds and retries up to `retry_count` times.
5. **Identity**: Lock holder is identified as `hostname:pid`.

### Checksum Validation

When a migration is applied, the SHA-256 hash of the SQL output from its `up()` and `rollback()` functions is stored in the `checksum` column of `db_migrations`. The two outputs are normalized, joined with a null byte separator, and hashed.

On each `up` run, stored checksums are compared with the current SQL output of each migration's functions. If a mismatch is found:

- Without `--allow-dirty`: raises `ChecksumMismatchError` and aborts
- With `--allow-dirty`: logs a warning and continues

Whitespace normalization: blank lines (containing only whitespace) are removed, trailing whitespace is stripped from remaining lines, then lines are joined by `\n`.

The `repair` command recalculates checksums from current files and updates the stored values. Missing files are skipped.

The `show` command displays per-migration integrity status (ok, modified, or missing).

## Configuration

All configuration can be provided via CLI options, environment variables, or a mix of both. CLI options take precedence over environment variables.

| CLI Option | Env Variable | Default | Type |
|------------|-------------|---------|------|
| `--url` | `CLICKHOUSE_MIGRATE_URL` | — | string (URL) |
| `--path` | `CLICKHOUSE_MIGRATE_DIR` | `./db/migrations` | string (path) |
| `--cluster` | `CLICKHOUSE_MIGRATE_CLUSTER` | — | string |
| `--connect-retries` | `CLICKHOUSE_MIGRATE_CONNECT_RETRIES` | `0` | int (min 0) |
| `--connect-retries-interval` | `CLICKHOUSE_MIGRATE_CONNECT_RETRIES_INTERVAL` | `1` | int (min 0) |
| `--send-receive-timeout` | `CLICKHOUSE_MIGRATE_SEND_RECEIVE_TIMEOUT` | `600` | int |
| `-v, --verbose` | — | off | flag |
| `-q, --quiet` | — | off | flag |

Per-command options (`--lock-ttl`, `--lock-retry`, `--dry-run`, `--allow-dirty`, `--all`) do not have environment variable equivalents.

The `--url` option is required for all commands except `init` and `new`, which work offline.

## Cluster Mode

When `--cluster` (or `CLICKHOUSE_MIGRATE_CLUSTER`) is set:

**What changes:**

- `db_migrations` table uses `ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')` engine and is created with `ON CLUSTER`
- `_migrations_lock` table uses `ReplicatedReplacingMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}', locked_at)` engine and is created with `ON CLUSTER`
- All writes to service tables use `insert_quorum: auto` and `select_sequential_consistency: 1` settings for strong consistency across replicas

**What does NOT change:**

- Your migration SQL is executed as-is. If you need `ON CLUSTER` in your DDL, you must include it in the migration yourself.

Example:

```sh
export CLICKHOUSE_MIGRATE_CLUSTER=my_cluster
migrator up
```

Migration with ON CLUSTER:

```python
def up() -> str:
    return """
    CREATE TABLE IF NOT EXISTS users ON CLUSTER my_cluster (
        id UInt64,
        name String
    ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')
    ORDER BY id
    """


def rollback() -> str:
    return """
    DROP TABLE IF EXISTS users ON CLUSTER my_cluster
    """
```

## Python API

```python
from py_clickhouse_migrator import Migrator

migrator = Migrator(
    database_url="clickhouse://default@localhost:9000/mydb",
    migrations_dir="./db/migrations",
)
migrator.up()
```

The `Migrator` class accepts the same parameters as the CLI global options: `database_url`, `migrations_dir`, `cluster`, `connect_retries`, `connect_retries_interval`, `send_receive_timeout`.

Key public methods: `up()`, `rollback()`, `show_migrations()`, `validate_checksums()`, `repair()`.

Public exports from `py_clickhouse_migrator`: `Migrator`, `MigrationLock`, `LockError`, `LockTimeoutError`, `ChecksumMismatchError`, `DatabaseNotFoundError`, `ShowMigrationsResult`, `ChecksumMismatch`, `compute_checksum`, `create_migration_file`, `create_migrations_dir`, `make_migration_filename`, `normalize_content`.

## Limitations and Gotchas

**No DDL transactions.** ClickHouse does not support transactional DDL. If a migration with multiple statements (separated by `;`) fails halfway, some statements will have been applied. Always use `IF NOT EXISTS` / `IF EXISTS` to make migrations idempotent and safe to re-run.

**SQL splitting by `;`.** Migration SQL is split into statements by `;`. Semicolons inside string literals will break parsing. If you need a literal `;` in a value, use a separate migration or encode the value differently.

**Advisory lock.** The locking mechanism is best-effort, not a true distributed mutex. There is a race condition window of a few milliseconds between INSERT and verification where two processes could both acquire the lock. If you need stronger guarantees, run migrations from a single process (e.g. a Kubernetes Job or a CI/CD pipeline step).

**Async mutations.** The `DELETE` used during rollback runs with `mutations_sync=1` to wait for completion. The `ALTER TABLE UPDATE` used by `repair` also uses `mutations_sync=1`. These are synchronous within a single node but may take time on large tables.

**Database must exist.** The migrator does not create the target database. If the database in the connection URL doesn't exist, the tool fails with a `DatabaseNotFoundError` and suggests the `CREATE DATABASE` command.

**Cluster name validation.** Cluster names must match `[a-zA-Z_][a-zA-Z0-9_]*` (SQL identifier pattern). Same validation applies to database names in the lock module.

## Project Info

- Version: 1.0.0
- License: MIT
- Python: 3.11, 3.12, 3.13, 3.14
- Repository: https://github.com/Maksim-Burtsev/PyClickHouseMigrator
- PyPI: https://pypi.org/project/py-clickhouse-migrator/
