Metadata-Version: 2.4
Name: copernicus-mcp
Version: 0.4.0
Summary: MCP server for safe, validated, cost-aware access to Copernicus Earth observation data.
Author: Ivan Kuznetsov, CliDyn
License: BSD 3-Clause License
        
        Copyright (c) 2026, Ivan Kuznetsov
        
        Redistribution and use in source and binary forms, with or without
        modification, are permitted provided that the following conditions are met:
        
        1. Redistributions of source code must retain the above copyright notice, this
           list of conditions and the following disclaimer.
        
        2. Redistributions in binary form must reproduce the above copyright notice,
           this list of conditions and the following disclaimer in the documentation
           and/or other materials provided with the distribution.
        
        3. Neither the name of the copyright holder nor the names of its
           contributors may be used to endorse or promote products derived from
           this software without specific prior written permission.
        
        THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
        AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
        IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
        DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
        FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
        DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
        SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
        CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
        OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
        OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
License-File: LICENSE
Requires-Python: >=3.11
Requires-Dist: aiosqlite<1,>=0.20
Requires-Dist: httpx<1,>=0.27
Requires-Dist: mcp<1.28,>=1.27.0
Requires-Dist: pandas<3,>=2
Requires-Dist: platformdirs<5,>=4
Requires-Dist: pyarrow<22,>=14
Requires-Dist: pydantic-settings<3,>=2.0
Requires-Dist: pydantic<3,>=2.0
Requires-Dist: python-dateutil<3,>=2.8
Requires-Dist: pyyaml<7,>=6
Requires-Dist: rich<14,>=13
Requires-Dist: typer<1,>=0.12
Provides-Extra: all
Requires-Dist: cdsapi<1,>=0.7.7; extra == 'all'
Requires-Dist: copernicusmarine<3,>=2.4; extra == 'all'
Provides-Extra: cds
Requires-Dist: cdsapi<1,>=0.7.7; extra == 'cds'
Provides-Extra: cmems
Requires-Dist: copernicusmarine<3,>=2.4; extra == 'cmems'
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-cov>=4; extra == 'dev'
Requires-Dist: pytest-timeout>=2.3; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Description-Content-Type: text/markdown

# copernicus-mcp

`copernicus-mcp` is a Model Context Protocol (MCP) server that gives LLM agents and CLI users safe, validated, cost-aware, reproducible access to Copernicus Earth observation data. It exposes discovery, estimation and subset-download workflows as MCP tools, returns large scientific data as file descriptors (filepath + metadata + provenance) rather than inline bytes, and produces a deterministic provenance record for every retrieval.

## Status

Two backends shipped:

- **Copernicus Marine (CMEMS)** through the official [`copernicusmarine`](https://help.marine.copernicus.eu/en/articles/7949409-copernicus-marine-toolbox-introduction) toolbox — discovery, estimation, synchronous and async subset retrieval.
- **Climate Data Store family (CDS / ADS / EWDS)** through [`cdsapi`](https://github.com/ecmwf/cdsapi) (single PAT works across all three stores) — discovery via a bundled catalogue snapshot, heuristic estimation, full async lifecycle (submit / poll / download / cancel), and T&C-not-accepted elicitation.

Copernicus Data Space Ecosystem (CDSE), Sentinel Hub and WEkEO are planned for subsequent iterations and are out of scope today.

## Quick start

```bash
# 1. Create and activate a virtual environment.
python -m venv .venv && source .venv/bin/activate

# 2. Install the package with the backends you need.
pip install "copernicus-mcp[cmems,cds]"      # both backends
# pip install "copernicus-mcp[cmems]"        # CMEMS only
# pip install "copernicus-mcp[cds]"          # CDS / ADS / EWDS only

# 3. Configure credentials for the backend(s) you installed.
#    CMEMS (free account at https://data.marine.copernicus.eu/register):
copernicusmarine login
# CDS / ADS / EWDS — single PAT works across all three stores (free
# account at https://cds.climate.copernicus.eu/):
# export CDSAPI_KEY=<your-uuid-pat>
#    or populate ~/.cdsapirc as the cdsapi CLI expects.

# 4. Try a search from the terminal.
copernicus-mcp marine search-datasets --keyword temperature --limit 3
# The `cds` subcommands require opting the backend in — see "Credentials"
# below; otherwise the call exits with `backend_not_configured`.
COPERNICUS_MCP_ENABLED_BACKENDS=cmems,cds \
  copernicus-mcp cds search --keyword reanalysis --limit 3

# 5. Run the MCP server (used by Claude Desktop / Claude Code / any
#    MCP-compatible client over stdio). See "Claude Desktop integration"
#    below.
copernicus-mcp serve
```

## Features

The package provides:

- **Tools (CMEMS)** — `marine_search_groups`, `marine_search_products`, `marine_search_datasets`, `marine_describe_dataset`, `marine_estimate_subset`, `marine_subset_dataset`, `marine_list_files`, `marine_get_files`, `marine_check_status`, `marine_cancel_subset`. The first three implement a three-step hierarchical search (free-text query → groups → products → enriched dataset cards) on top of bundled `groups.json` / `products.json` / `dataset_cards.json` manifests — no LLM, no embeddings, runs offline. `marine_list_files` runs index-driven retrieval for sparse datasets (CORA / EasyCORA / INSITU-BGC): one-time SDK fetch of a per-dataset index, cached locally as Parquet, then offline bbox/time/variable filtering before `marine_get_files` downloads the precise file_list.
- **Tools (CDS / ADS / EWDS)** — `cds_search_datasets`, `cds_describe_dataset`, `cds_estimate_request`, `cds_submit_request`, `cds_check_request_status`, `cds_download_request_result`, `cds_cancel_request`.
- **Diagnostic** — `copernicus_mcp_status` reports configured backends, credential sources, cache metrics, and config snapshot (credential values never appear).
- **Resources** — `copernicus://datasets/cmems/{id}`, `copernicus://files/{cache_key}`, `copernicus://jobs/{request_id}`, `copernicus://provenance/{record_id}`.
- **CLI** (Typer + Rich) — `copernicus-mcp {serve, version, status, marine ..., cds ...}` with a global `--json` flag for scripting.
- **Confirmation flow**: large or approximate-estimate subsets gate on a structured confirmation prompt before any download.
- **Cache + provenance**: each retrieval produces a sidecar JSON record with file MD5, software versions, request envelope, and a deterministic cache key.
- **Sanitisation**: defence-in-depth redaction of credential-shaped strings on every outbound payload.
- **Structured errors**: eleven canonical error classes with recovery hints (e.g. `recovery_action="configure_credentials"`).
- **Cancellation discipline**: `asyncio.CancelledError` propagates without being wrapped, per project invariant.

## Why this exists

LLM agents can already call APIs, but for scientific data three properties matter and are easy to lose:

1. **Reproducibility** — the agent can hand a colleague the exact request and get the exact same file back tomorrow.
2. **Cost-awareness** — multi-gigabyte downloads should be confirmed, not silently triggered by a fuzzy prompt.
3. **Credential isolation** — credentials must never leak into tool output, logs, or provenance, regardless of the prompt or the upstream library's exception messages.

`copernicus-mcp` enforces all three at the protocol layer, so the agent does not need to.

## Tool reference, in brief

### CMEMS (synchronous + async download)

- **`marine_search_groups`** / `marine search-groups` — first step of the hierarchical pipeline. Takes a free-text query, returns 1-5 routing groups (region × domain × intent bundles) with confidence + fallback signal. Offline-only.
- **`marine_search_products`** / `marine search-products` — second step. Takes the chosen `group_ids`, returns the candidate CMEMS products with summaries, optionally re-ranked by an additional keyword.
- **`marine_search_datasets`** / `marine search-datasets` — third step (with `product_ids` + optional `bbox` / `time_range`) returns enriched dataset cards. Without those, falls back to the flat slim catalogue by `keyword` / `product_id`. Bundled snapshot is the default; pass `live=true` / `--live` for the live SDK.
- **`marine_describe_dataset`** / `marine describe DATASET_ID` — full metadata: variables, axes, services, derived `spatial_extent` (from variable bboxes), DOI.
- **`marine_estimate_subset`** / `marine estimate ...` — preview byte size and confirmation status without downloading. Returns a `coverage_advisory` when the user bbox does not align with the dataset's extent.
- **`marine_subset_dataset`** / `marine subset ...` — download a spatio-temporal subset. Returns `{filepath, uri, metadata, provenance}` — never inline bytes. Large requests gate on a structured confirmation; pass `confirmed=true` (MCP) or `--yes` (CLI) on the second call. `async_mode=true` returns a `request_id` immediately and the download runs in the background.
- **`marine_list_files`** *(MCP only — no CLI subcommand)* — list the precise files in a sparse dataset that match a `bbox` / `time_range` / `variables` / `platform_types` filter, **before** calling `marine_get_files`. First call per dataset fetches the index from the SDK (~1–5 s for INSITU-BGC, ~210 s for CORA / EasyCORA); subsequent calls hit a local Parquet cache and return in milliseconds. Antimeridian-crossing bboxes are accepted here (unlike `marine_subset_dataset`).
- **`marine_get_files`** / `marine get-files ...` — download native files for sparse / in-situ datasets (`original-files` and `arco-platform-series` services) that don't support `subset`. Returns `{files: [{filepath, uri, metadata, provenance}, ...]}` — one descriptor per file in the bundle. Same confirmation flow as `subset`; the gate almost always fires because the SDK doesn't surface a precise dry-run size for sparse formats. **For filtered subsets, call `marine_list_files` first** to get a precise `file_list` instead of downloading the whole bundle.
- **`marine_check_status`** / `marine wait REQUEST_ID` — poll the workflow row for an in-flight or completed async submit.
- **`marine_cancel_subset`** / *(no CLI subcommand — use the MCP tool)* — cancel an in-flight async submit; best-effort, the underlying toolbox thread may run to completion.

### CDS / ADS / EWDS (async-by-design, queue-backed)

- **`cds_search_datasets`** / `cds search` — discover dataset ids via a bundled catalogue snapshot covering all three stores. Slim records (~30k tokens for the full catalogue) suitable for LLM context.
- **`cds_describe_dataset`** / `cds describe DATASET_ID` — full STAC item for one dataset (description, extent, keywords, license, store, variables).
- **`cds_estimate_request`** / `cds estimate ...` — heuristic byte-size estimator + queue-tier classification (light / medium / heavy). `epistemic_status` is always `approximate`.
- **`cds_submit_request`** / `cds submit ...` — queue a retrieve. Returns `{status: "queued", request_id, cache_key}` immediately. Confirmation gate fires for large bytes OR queue tier `medium`/`heavy`; second call with `confirmed=true` proceeds. **`area` ordering: CDS uses `[north, west, south, east]`, opposite of common GIS `[w, s, e, n]`** — see the `inputs` field description; sending the wrong order silently retrieves the wrong region.
- **`cds_check_request_status`** / `cds check-status REQUEST_ID` (or `cds wait REQUEST_ID`) — poll the workflow row; `wait` blocks until terminal up to a configurable timeout.
- **`cds_download_request_result`** / `cds download REQUEST_ID` — fetch the result file from the canonical cache once status is `successful`.
- **`cds_cancel_request`** / `cds cancel REQUEST_ID` — cancel a queued or running request; idempotent on already-terminal rows.

A T&C-not-accepted server response is surfaced as the canonical `TermsNotAcceptedError` with `recovery_url` pointing at the licence page — open the URL, accept the licence, and re-submit.

For complete schemas, options and exit codes, run `copernicus-mcp <subcommand> --help` or read the inline tool descriptions surfaced by your MCP client.

## Claude Desktop integration

Add to `~/Library/Application Support/Claude/claude_desktop_config.json` (macOS) or the equivalent on your platform:

```json
{
  "mcpServers": {
    "copernicus": {
      "command": "copernicus-mcp",
      "args": ["serve"]
    }
  }
}
```

Restart Claude Desktop. Every tool whose backend is configured becomes available. Tool results that wrap large data return a `filepath` plus metadata and provenance — never inline bytes.

### Credentials

#### CMEMS

Resolution precedence:

1. **Toolbox credentials file** (recommended): `~/.copernicusmarine/.copernicusmarine-credentials`. Created by running `copernicusmarine login` once. The same file is used by the official CLI and by us — set it once, share it across tools.
2. **Environment variables**: `COPERNICUSMARINE_SERVICE_USERNAME` and `COPERNICUSMARINE_SERVICE_PASSWORD`. Convenient on CI or in a project-local `direnv` setup.
3. (Possible but **not recommended** for the desktop client) `env: {...}` block inside `claude_desktop_config.json`. The file lives in plain text and gets backed up by macOS / cloud sync, so credentials embedded there leave a wider trace than necessary.

#### CDS / ADS / EWDS

A single Personal Access Token (PAT) — a canonical UUID — works across all three stores per ECMWF policy. Resolution precedence:

1. **Environment variable**: `CDSAPI_KEY=<your-uuid-pat>` in your shell profile.
2. **`~/.cdsapirc`** (the location the official `cdsapi` CLI uses). YAML-ish two-line file:
   ```
   url: https://cds.climate.copernicus.eu/api
   key: <your-uuid-pat>
   ```

Get a PAT: log in at <https://cds.climate.copernicus.eu/>, open the user-profile page, and copy the "Personal Access Token" UUID. Each new dataset requires accepting its licence once (the page is linked in the recovery URL of `TermsNotAcceptedError`).

By default only CMEMS is enabled; opt in to CDS with `enabled_backends: [cmems, cds]` in `~/.config/copernicus-mcp/config.yaml`, or `COPERNICUS_MCP_ENABLED_BACKENDS=cmems,cds` in your env.

Verify resolution: `copernicus-mcp status --json | jq '.backends'`. The output reports `credential_source` as `config_file`, `env`, or `missing` for each backend — the actual values are never printed.

## Configuration

The system is usable with no configuration file at all — every Pydantic field has a sensible default. Override via environment variables (`COPERNICUS_MCP_LOG_LEVEL`, `COPERNICUS_MCP_CACHE_DIR`, `COPERNICUS_MCP_STATE_DB`, plus `COPERNICUS_MCP_<SECTION>__<FIELD>` for nested fields), the global CLI flag `--cache-dir PATH`, or a YAML file at `~/.config/copernicus-mcp/config.yaml` or `~/.copernicus-mcp.yaml`. Precedence: CLI > env > yaml > defaults.

State directories are resolved per-OS via `platformdirs`:

| OS      | Cache (downloaded files + `.provenance.json` sidecars)       | State DB (workflow rows, cache index, provenance)                    |
| ------- | ------------------------------------------------------------ | -------------------------------------------------------------------- |
| Linux   | `~/.cache/copernicus-mcp/`                                   | `~/.local/state/copernicus-mcp/state.db`                             |
| macOS   | `~/Library/Caches/copernicus-mcp/`                           | `~/Library/Application Support/copernicus-mcp/state.db`              |
| Windows | `%LOCALAPPDATA%\copernicus-mcp\Cache\`                       | `%LOCALAPPDATA%\copernicus-mcp\state.db`                             |

## Troubleshooting

- **`AuthError`** on tool call → run `copernicus-mcp status` and check `backends.cmems.configured`. If `false`, your env vars are not visible to the running process (common Claude Desktop pitfall — restart the client after editing config) or the credentials file is missing/unreadable.
- **`CoverageUnavailableError`** → bbox or time range is outside the dataset's actual extent. Use `marine_describe_dataset` to inspect coverage and narrow the request.
- **`ValidationError` with `recovery_action="modify_request_parameters"`** → request was structurally invalid (e.g. inverted bbox, antimeridian-crossing bbox, naive datetime). The `next_action_hint` field tells you exactly how to fix it.
- **Subset hangs** → set `COPERNICUS_MCP_LOG_LEVEL=DEBUG` and watch for retry messages. Reduce bbox or time range if request is genuinely large.

## License

BSD 3-Clause. See [`LICENSE`](LICENSE). Dependencies are EUPL-1.2 (`copernicusmarine`), Apache-2.0 (`cdsapi`, most everything else), MIT or BSD. The package does not depend on `sentinelhub-py`; when the Sentinel Hub backend lands in a later iteration, this section will document the relevant CC BY-NC restriction on its SDK.

## Acknowledgements

- [Mercator Ocean International](https://www.mercator-ocean.eu/) for the [`copernicusmarine`](https://github.com/mercator-ocean/copernicus-marine-toolbox) Python toolbox.
- [ECMWF](https://www.ecmwf.int/) for the [`cdsapi`](https://github.com/ecmwf/cdsapi) Python client and for operating the Climate / Atmosphere / Early Warning data stores under the Copernicus programme.
- The [Copernicus Marine Service](https://marine.copernicus.eu/) and the European Commission's Copernicus programme for the underlying data.
- The Anthropic team for the [Model Context Protocol](https://modelcontextprotocol.io/) specification and Python SDK.
