Metadata-Version: 2.4
Name: datagouv_client
Version: 0.3.2
Summary: Wrapper for the data.gouv.fr API
Author-email: Etalab <opendatateam@data.gouv.fr>
License-Expression: MIT
Project-URL: Source, https://github.com/datagouv/datagouv_client
Keywords: api,wrapper,datagouv
Requires-Python: <3.15,>=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: httpx<1,>=0.28.1
Requires-Dist: tenacity<10,>=9.0.0
Requires-Dist: typer<1,>=0.25.0
Provides-Extra: dev
Requires-Dist: httpx<1,>=0.28.1; extra == "dev"
Requires-Dist: pytest-httpx<1,>=0.35.0; extra == "dev"
Requires-Dist: ruff>=0.11.2; extra == "dev"
Dynamic: license-file

![datagouv-client](docs/banner.png)

# **datagouv-client**

[![CircleCI](https://circleci.com/gh/datagouv/datagouv_client.svg?style=svg)](https://circleci.com/gh/datagouv/datagouv_client)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A Python and [CLI](#-cli) wrapper for the data.gouv.fr API that allows you to interact easily with datasets and resources across all three platforms (`prod`/`www`, `demo`, and `dev`). Install it through [PyPI](https://pypi.org/project/datagouv-client/):
```bash
pip install datagouv-client
```

**Requirements:** Python >= 3.10

## 🚀 Use

### 📥 Quick Start
```python
from datagouv import Dataset, Resource, Topic

# Get a dataset and its resources
dataset = Dataset("5d13a8b6634f41070a43dff3")
print(f"Dataset: {dataset.title}")
print(f"Resources: {len(dataset.resources)}")

# Download a resource
resource = dataset.resources[0]
resource.download("my_file.csv")

# Get a topic and its elements
topic = Topic("68b6e6dbdac745f47d4ff6e0")
elements = topic.elements
datasets = topic.datasets
```

### 📊 Getting existing objects
If you only want to retrieve existing objects (aka you don't want to modify them on datagouv), here is what a workflow could look like:
```python
from datagouv import Dataset, Resource, Organization

dataset = Dataset("5d13a8b6634f41070a43dff3")  # you can find a dataset's id in the `Informations` tab of its landing page

# you can now access a bunch of info about the dataset
print(dataset.title)
print(dataset.description)
print(dataset.created_at)
print(dataset.organization)  # this is an instance of Organization
print(dataset)  # this displays all the attributes of the dataset as a dict

# and of course its resources, which are all Resource instances
for res in dataset.resources:
    print(res.title)
    print(res.url)  # this is the download URL of the resource
    print(res.id)  # the id of the resource itself
    print(res.dataset_id)  # the id of the dataset the resource belongs to
    print(res)  # this displays all the attributes of the resource as a dict

# if you are only interested in a specific resource
resource = Resource("f868cca6-8da1-4369-a78d-47463f19a9a3")  # you can find a resource's id in its `Métadonnées` tab
print(resource)

# if the resource contains tabular data that has been APIfied (see https://www.data.gouv.fr/dataservices/673b0e6774a23d9eac2af8ce for more info), it also has:
resource.columns  # the columns of the table, as a list of strings
resource.profile  # the profile of the table (number of rows, detected formats and types...)
for row in resource.rows(
    filters=[
        ("col1", "==", "6"),
        ("col4", "isnotnull"),
    ],  # filters is an optional argument to retrieve only the rows that match conditions
):
    print(row)

# you can also access a dataset from one of its resources
d = resource.dataset  # this returns an instance of Dataset

# you can also download a resource locally (**Note:** if it doesn't exist, parent path will be created)
resource.download("./file.csv")  # this saves the resource in your working directory as "file.csv"

# alternatively, you can load the resource directly into memory as a BytesIO buffer to process the content without writing it to disk
buf = resource.download_buffer()
# by default, in-memory downloads are limited to about 95 MiB to prevent excessive memory usage.
# Note: If you expect larger files and your infrastructure can handle them,
# increase the `max_mib` limit:
buf = resource.download_buffer(max_mib=200)  # allow up to about 200 MiB

# and a subset or all resources of a dataset (**Note:** if it doesn't exist, parent path will be created)
# the files are named `resource_id.format` (for instance f868cca6-8da1-4369-a78d-47463f19a9a3.csv)
d.download_resources(
    folder="data",  # if not specified, saves them into your working directory
    resources_types=["main", "documentation"],  # default is only main resources
)


organization = Organization("646b7187b50b2a93b1ae3d45")  # you can find an organization's id in the `Informations` tab of its landing page, in "Informations techniques"
# you can loop through the organization's datasets, which are Dataset instances
for dat in organization.datasets:
    print(f"{dat.title} has {len(dat.resources)} resources")
```

> **Note:** If you encounter errors during API calls, the client will raise appropriate exceptions (e.g., `PermissionError` for authentication issues, `httpx.HTTPError` for API errors).

> **Note:** If you want to get objects from demo or dev, you must use a client:
```python
from datagouv import Client, Dataset

dataset = Dataset("5d13a8b6634f41070a43dff3", _client=Client("demo"))
```

You can also access objects' metrics (views, downloads) with the `get_monthly_traffic_metrics` function:
```python
for month_metrics in Dataset("5d13a8b6634f41070a43dff3").get_monthly_traffic_metrics(
    start_month="2025-01",  # optional, goes back as far as possible if not set
    end_month="2025-06",  # optional, until today if not set
):
    print(month_metrics)
```
The metrics differ depending on the object:
- for datasets:
```json
{
    "__id": 43110395,
    "dataset_id": "6789251f3a805425afee55e6",
    "metric_month": "2025-01",
    "monthly_visit": 233,
    "monthly_download_resource": 3
}
```
- for resources:
```json
{
    "__id": 58728461,
    "resource_id": "5ffa8553-0e8f-4622-add9-5c0b593ca1f8",
    "dataset_id": "5c4ae55a634f4117716d5656",
    "metric_month": "2025-04",
    "monthly_download_resource": 5669
}
```
- for organizations:
```json
{
    "__id": 7,
    "organization_id": "646b7187b50b2a93b1ae3d45",
    "metric_month": "2023-07",
    "monthly_visit_dataset": 27196,
    "monthly_download_resource": 1085933,
    "monthly_visit_reuse": 123,
    "monthly_visit_dataservice": 456
}
```

### 🛠️ Interacting with objects online
If you want to modify objects on the datagouv platforms, you will need to create an authenticated client:
```python
from datagouv import Client

client = Client(
    environment="www",  # here you can set which platform the client will interact with, default is production
    api_key="MY_SECRET_API_KEY",  # your API key, that grants your rights on the platform
    verbose=True,  # whether or not to display logs in the processes, default is True
)
```
> **Note:** You can find your API key on https://www.data.gouv.fr/fr/admin/me/ (don't forget to change the prefix to get the key from the right environment).

Once your client is set up, you can instantiate datasets and resources from it. Of course, **you will only be allowed to modify objects according to your rights** (so objects created by you or an organization you are part of):
```python
dataset = client.dataset("5d13a8b6634f41070a43dff3")
# this is also a Dataset instance, with all the same attributes as above, but since you're authenticated, you have access to new methods

dataset.update({"title": "A brand new title"})  # update the dataset online with the payload you give, and also update the attributes of the object
print(dataset.title)  # -> "A brand new title"
dataset.delete()  # delete the dataset, use with caution!

# you can also modify the extras
dataset.update_extras(payload)
dataset.delete_extras(payload)

# the methods are the same for resources
for idx, res in enumerate(dataset.resources):
    res.update({"title": f"Resource n°{idx + 1}"})
    print(res.title)  # -> "Resource n°X"
    # delete every third resource
    if idx % 3 == 0:
        res.delete()


# it is also possible to sort a dataset's resources
# either with a specific Resource field and order, or with a custom sorting function that takes and returns a list of Resource objects
dataset.sort_resources(by="title.asc")  # the expected syntax is <field>.<order> (order being 'asc' or 'desc')
```

With an authenticated client, you are also allowed to create datasets and resources on the environment you specified:
```python
dataset = client.dataset().create(
    {
        "title": "New dataset",
        "description": "A description is required",
        "organization": "646b7187b50b2a93b1ae3d45",  # the organization that will own the dataset
    },
)  # this creates a dataset with the values you specified, and instantiates a Dataset
dataset.update({"tags": ["environment", "water"]})

# alternatively you can create a dataset from an organization, and it will be attached to it
organization = client.organization("646b7187b50b2a93b1ae3d45")
dataset = organization.create_dataset(
    {
        "title": "New dataset",
        "description": "A description is a required",
    }
)
```
There are two types of resources on datagouv:
- `static`: a file is uploaded directly on the platform
- `remote`: reference the URL of a file that is stored somewhere else on the internet

You have two options to create a resource (of any type):
- from the client itself, by specifying the id of the dataset you want to include it into (you must have the rights on the dataset):
```python
# to create a static resource from a file
resource = client.resource().create_static(
    file_to_upload="path/to/your/file.txt",
    payload={"title": "New static resource"},
    dataset_id="5d13a8b6634f41070a43dff3",
)  # this creates a static resource with the values you specified, and instantiates a Resource

# to create a remote resource from an url
resource = client.resource().create_remote(
    payload={"url": "http://example.com/file.txt", "title": "New remote resource"},
    dataset_id="5d13a8b6634f41070a43dff3",
)  # this creates a remote resource with the values you specified, and instantiates a Resource
```
- from the dataset you want to include it into (you must have the rights on the dataset), in which case you don't have to specify the `dataset_id`:
```python
dataset = client.dataset("5d13a8b6634f41070a43dff3")
# to create a static resource from a file
resource = dataset.create_static(
    file_to_upload="path/to/your/file.txt",
    payload={"title": "New static resource"},
)  # this creates a static resource with the values you specified, and instantiates a Resource

# to create a remote resource from an url
resource = dataset.create_remote(
    payload={"url": "http://example.com/file.txt", "title": "New remote resource"},
)  # this creates a remote resource with the values you specified, and instantiates a Resource

# to update the file of a static resource
resource.update({"title": "New title"}, file_to_upload="path/to/your/new_file.txt")
```
> **Note:** If you are not planning to use an object's attributes, you may prevent the initial API call using `fetch=False`, in order not to unnecessarily ping the API.
```python
dataset = client.dataset("5d13a8b6634f41070a43dff3", fetch=False)
print(dataset.title)  # -> this will fail because the attributes are not set from the initial call
# but you can update the object as usual
dataset.update({"title": "New title"})
print(dataset.title)  # -> "New title"   because the attributes are set from the response
```

### 🤓 CLI
Once you have installed `datagouv-client`, you can also do most of what's possible in python, through your CLI. First you must set up your config with:
```bash
datagouv setup
```
You will be asked the environment you want to interact with, and your API key. They will be stored in a config file, in your home directory. If you only intend to get data, you may leave the API key blank.
> Note: you may skip this setup step if you intend to target the production platform and fetch data.

You can see all available actions with:
```bash
datagouv --help
```
The `--help` command is available for all methods.

#### Displaying data
All objects have a `display` command, that shows the object's main metadata in a human-readable way, for instance:
```bash
datagouv organization display "534fff81a3a7292c64a77e5c"
> badges: [{'kind': 'public-service'}, {'kind': 'certified'}]
> ────────────────────
> business_number_id: 12002701600563
> ────────────────────
> created_at: 2014-04-17T18:21:21.523000+00:00
> ...
```

#### Getting data
All objects also have a `get` command, that outputs all the object's metadata in JSON (directly fed from datagouv's API). You may for instance give the output to `jq` like:
```bash
datagouv organization get "534fff81a3a7292c64a77e5c" | jq .name
> "Institut national de la statistique et des études économiques (Insee)
```

#### Modifying objects
If you have run the `setup` command and filled in your API key, you may interact with objects (according to your rights on the platform), for instance:
```bash
datagouv dataset create --title "New dataset" --description "Nice description" --organization_id "646b7187b50b2a93b1ae3d45"
> Dataset created successfully ✓ id is 69fb46c2bdeef492539acd61
# use the `--set` argument to update keys (can be used multiple times in one call)
datagouv dataset update "69fb46c2bdeef492539acd61" --set title="New title" --set private=true
> Dataset updated successfully ✓
datagouv resource create "69fb46c2bdeef492539acd61" "First resource" --file-to-upload file.csv --set type=main
> Resource created successfully ✓ id is 49e370df-cd09-4792-915b-95d25c2adc08
datagouv resource delete "49e370df-cd09-4792-915b-95d25c2adc08"
> Resource deleted successfully ✓
```
> NB: you can delete your config file with `datagouv delete-config`

### ⚡ Advanced features
Many datagouv endpoints are paginated, which can make it tedious to retrieve all objects. An instance of `Client` has a method to create an iterator from any endpoint that returns paginated data:
```python
for obj in client.get_all_from_api_query(
    "api/1/datasets/?organization=534fff81a3a7292c64a77e5c",  # get all datasets from a specific organization
    mask="data{id,title,resources{id,title}}",  # you can apply a mask to retrieve only specific fields of the objects
    cast_as=Dataset,  # you can get the results as objects to manipulate them more easily
):
    print(f"Dataset {obj['title']} has {len(obj['resources'])} resources")  # if cast_as is not used, otherwise `obj.id` and `obj.resources`
```

You can also check if resources have been updated more recently than others:
```python
# Check if any resource in a dataset has been updated more recently than a specific resource
resource = Resource("f868cca6-8da1-4369-a78d-47463f19a9a3")
has_newer_updates = resource.check_if_more_recent_update("5d13a8b6634f41070a43dff3")
```

## 🤝 Contribution
Contributions and feedback are welcome! Main guidelines:
- as few API calls as possible (use responses to create/update objects)
- build on the existing

Remember to format, lint, and sort imports with [Ruff](https://docs.astral.sh/ruff/) before committing (checks will remind you anyway):
```bash
pip install .[dev]
ruff check --fix .
ruff format .
```

### 🏷️ Release

The release process uses the [`tag_version.sh`](tag_version.sh) script to create git tags and update [CHANGELOG.md](CHANGELOG.md) and [pyproject.toml](pyproject.toml) automatically.

**Prerequisites**: [GitHub CLI](https://cli.github.com/) (`gh`) must be installed and authenticated, and you must be on the main branch with a clean working directory.

```bash
# Create a new release
./tag_version.sh <version>

# Example
./tag_version.sh 2.5.0

# Dry run to see what would happen
./tag_version.sh 2.5.0 --dry-run
```

The script automatically:
- Updates the version in `pyproject.toml`
- Extracts commits since the last tag and formats them for `CHANGELOG.md`
- Identifies breaking changes (commits with `!:` in the subject)
- Creates a git tag and pushes it to the remote repository
- Creates a GitHub release with the changelog content
