Metadata-Version: 2.4
Name: ckanpy
Version: 0.2.7
Summary: Pythonic wrapper for downloading data from CKAN databases.
Author-email: Patrick Thomas Perrin <ptpdev@duck.com>
License-Expression: MIT
Project-URL: Homepage, https://codeberg.org/ptpdev/ckanpy
Classifier: Development Status :: 3 - Alpha
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.14
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: ckanapi>=4.11
Requires-Dist: numpy>=2.4.6
Requires-Dist: pandas>=3.0.3
Requires-Dist: pydantic>=2.13.4
Requires-Dist: requests>=2.34.2
Dynamic: license-file

# ckanpy

### Purpose
`ckanpy` aims to simplify the process of downloading datasets from
CKAN databases. Existing CKAN python packages (namely `ckanapi`) seem
designed with system administrators in mind, not so much data consumers.
By contrast, this package is designed solely for data consumers.

There are two intended audiences:
- Data engineers, who wish to create a wrapper around a specific CKAN Package,
making it trivial for data analysts to then download from it.
- Data analysts, who, seeking to download data from a CKAN Package that does not
yet have a python wrapper, wish to hack together a simple script that satisfies
their specific use-case.

### Dependencies

`ckanpy` only truly depends on `pydantic` and `requests`. All other dependencies
may be removed in the future, for the sake of supply chain cybersecurity.

#### pydantic
`pydantic` is used for creating type-validated, easy-to-access schemas representing
CKAN data structures.

#### requests
Certain requests do not seem possible through `ckanapi`,
and are thus done instead through `requests`.

#### ckanapi
`ckanapi` simplifies downloading most CKAN data. In the future,
it may be removed, as `ckanapi` is primarily a sysadmin package, such that
`ckanpy` barely uses it.

#### pandas
`pandas` is used for parsing CSVs. This may be replaced in the
future by an in-house solution.

#### numpy
`numpy` is used solely to access np.nan, when cleaning
downloaded CSVs of None elements. May be replaced in the future.

### Use Cases

#### Downloading Tabular Data
The whole point of `ckanpy` is to facilitate downloading
tabular data, be that through a SQL database or a CSV file.
- `download_sql(ckan_url, query)` → download from a CKAN Resource using a SQL query. This
is the preferred method of download, so that as much data cleaning may be
done server-side as possible.
- `download_csv(url)` → download a CSV. Unless the user wants to download
literally all the data available, this option serves more as a fallback in
case either: 
  1) a given `Resource` lacks a Resource ID, and therefore cannot be,
  SQL-queried, or
  2) the user, for whatever reason, cannot generate their desired SQL query,
and therefore must filter the data client-side.

#### Modeling CKAN Data Structures
To facilitate downloading tabular data, `ckanpy` creates type-validated
models of CKAN data structures. Below are examples of each modeled
data structure from the CCRS CKAN package:
- `Package` → CKAN package, e.g. California Crash Reporting System
- `Resource` → CKAN resource, e.g. Crashes_2021
- `ResourceCollection` → group of CKAN `Resource`s with a name pattern,
e.g. r"Crashes_[0-9]+"
- `DatastoreField` → Maps each column / field of a given Resource to its SQL type,
e.g. {"NumberInjured": "numeric"}
- `DatastoreInfo` → Collection of `DatastoreField`s pertaining to a given `Resource`

Downloading Package info is easy:

```python
from ckanpy import Package

package_ccrs = Package(
  ckan_url="https://data.ca.gov/",
  name_or_id="ccrs"
)
# Download occurs when Pacakge.resources is called, and is cached afterward
print(package_ccrs.resources)

# Information about each Resource may then be easily accessed
# Note that resources are stored as a list; this is because Resource names,
# for whatever reason, are not necessarily unique
# (e.g. the sysadmin uploaded a test duplicate)
print(package_ccrs.resources[0].resource_id)

package_duplicate = Package(
  ckan_url="https://data.ca.gov/",
  name_or_id="ccrs"
)
# Package downloads are cached, meaning this second package triggered no superfluous downloads
print(package_duplicate.resources)
```

#### Utility
- `download_package_names(ckan_url)` → Downloads list of `Package` names within a CKAN
database. Though the CKAN web GUI is very useful, it does not seem easy to
find the internal name of a given `Package`, so this function fills the gap.

#### Constructing SQL Statements
Although `ckanpy` allows the user to input custom queries, doing so is somewhat
unwieldy, in large part due to tables being named after `Resource` IDs. As an
alternative for users who wish to make SQL queries through a pythonic interface,
`ckanpy` comes packaged with the following tools:
- `StatementAssembler` → given inputs, it outputs a simple SQL query, SELECT'ing
data from a single table, and filtering with zero or more WHERE statements
(see `StatementWhere`).
- `StatementWhere` → given inputs, it outputs a WHERE statement. Depending on
the inputted assumption of what type the column is (as it can vary over time),
it automatically CAST's the column so that the operation may take place (e.g.
filtering by longitude, but the longitude is a string column, so it's CAST
as a numeric column instead.)

Examples with and without WHERE statements:
```python
from uuid import UUID
from ckanpy import (
    StatementAssembler,
    StatementWhere,
)

ckan_url = "https://data.ca.gov/"
resource_id_crashes_2025 = UUID("9f4fc839-122d-4595-a146-43bc4ed16f46")
columns_to_select = ["CollisionId","City Name"]

# Without WHERE statements
assembler = StatementAssembler(
    column_names=columns_to_select,
    resource_id=resource_id_crashes_2025,
    ckan_url=ckan_url
)
print(assembler.assemble())
# returns: 
# 'SELECT "col1", "col2", "col3" FROM "f57a81da-32ba-4306-8be1-1bf27ced5a03"'



# With WHERE statements
where1 = StatementWhere(
    column_name="City Name",
    column_value="San Diego",
    operator="equals"
)
where2 = StatementWhere(
    column_name="Day Of Week",
    column_value="Monday",
    operator="equals"
)

assembler2 = StatementAssembler(
    column_names=columns_to_select,
    resource_id=resource_id_crashes_2025,
    ckan_url=ckan_url,
    where_statements=[
        where1,
        where2
    ]
)
print(assembler2.assemble())
# returns:
# SELECT "CollisionId", "City Name" FROM "9f4fc839-122d-4595-a146-43bc4ed16f46" WHERE ("City Name" = 'San Diego') AND ("Day Of Week" = 'Monday')
```
#### Developing CKAN Package Wrappers
The main enterprise of `ckanpy`, however, is serving as a framework for
creating Package-specific wrappers. In fact, I wrote `ckanpy` in order
to write a wrapper for the CCRS package, `pyccrs`.

- `ResourceMapper` → Maps `ResourceCollection`s to named attributes. For
example, `pyccrs` uses `ResourceMapper` to map a `ResourceCollection` for
each CCRS table, so Crashes, Parties, and InjuredWitnessPassenger
(or "People", as I renamed it).
- `Downloader` → Implements, however necessary, a
`download_records(**kwargs) → list[dict]` method. Additional download
public methods may be added, as is the case with `pyccrs`, but they should
be derived from `download_records`.
- Pydantic Table Models → each table in the Resource should have a Pydantic model
representing a given row of data. This serves as the parsing and validation layer.
- `ColumnNames` → `EnumStr` which maps pythonic column names to the CKAN Resource's
actual names. This mapping is necessary not just for cleanliness, but also
compatibility with `pydantic`, as well as facilitating context-switching when a given
representation is needed (e.g. the user provides Pythonic key names, which are
then translated to the original when constructing the SQL query).

### Contributing
Though functional, there are still many ways this package can be improved.
Feel free to look around for leftover TODOs, send a pull request suggesting
a change, or reach out to me by email to discuss specific improvements. 
My dream is for `ckanpy` to help improve data quality for public datasets,
enabling data analysts to focus on what they know best: analyzing the data!
