Metadata-Version: 2.4
Name: dsjframe
Version: 0.9.0
Summary: Accurately reading, writing, and validating Dataset-JSON (.json, .ndjson, and .dsjc) with Apache Arrow.
Project-URL: Repository, https://github.com/k-nkmt/dsjframe
Author-email: Ken Nakamatsu <ken-nakamatsu@knworx.com>
License: AGPL-3.0-or-later
License-File: LICENSE
Keywords: arrow,cdisc,dataset-json
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Healthcare Industry
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: jsonschema>=4.26.0
Requires-Dist: pyarrow>=23.0.1
Provides-Extra: file-support
Requires-Dist: pandas>=2.3.3; extra == 'file-support'
Requires-Dist: pyreadstat>=1.3.1; extra == 'file-support'
Description-Content-Type: text/markdown

# dsjframe

`dsjframe` is a Python library for accurately reading, writing, and validating Dataset-JSON with Apache Arrow.
It supports plain JSON, NDJSON, and DSJC (`.json`, `.ndjson`, and `.dsjc`).

## Installation

Install the base package:

```bash
pip install dsjframe
```

Install the optional dependencies to support additional file formats:

```bash
pip install dsjframe[file-support]
```

Python 3.10 or newer is required.

## Quick Start

```python
import dsjframe
# Read
table = dsjframe.read_dataset("adsl.json")

# Write
metadata = {
    "datasetJSONVersion": "1.1.0",
    "label": "Subject Level Analysis Dataset",
}

dsjframe.write_dataset(table, "adsl.ndjson", metadata)

# validate
report = dsjframe.validate_dataset("adsl.dsjc")
```

See [example.ipynb](example.ipynb) for a longer walkthrough.

## Metadata for Writing

Dataset-JSON allows extensions, but `dsjframe` targets the standard structure by default.
Unexpected metadata fields are treated as errors.

When writing, metadata is merged in this priority order:

1. Explicit `metadata`
2. `define.xml`
3. Embedded Arrow schema metadata
4. `readstat_meta`
5. Library defaults such as `datasetJSONVersion` and `itemGroupOID`

`datasetJSONCreationDateTime` is filled automatically.
In many cases, you can omit most or all of `metadata` if `define.xml`, embedded schema metadata, or `readstat_meta` already provide the required fields.
When `define.xml` is used, `metaDataRef` is set to `define.xml` automatically.
If you need a different path or reference value, set it explicitly in `metadata`.

When writing from a `pyarrow.Table` or pandas DataFrame, column `dataType` is derived from the actual column type in the frame.
Provided column metadata is still used for fields such as `label`, `length`, `displayFormat`, and `keySequence`, but it does not override the real data type.
Compatible `targetDataType` values are preserved where allowed, and decimal exports are normalized to `"targetDataType": "decimal"`.

## Missing Values

`dsjframe` follows Arrow conventions and represents missing values as null.
In practice, especially when Dataset-JSON is used as an XPT replacement, character missing values are often written as `""` instead of null.
Because Arrow is better aligned with nulls for missing data, empty strings in string-like columns are converted to null by default when reading.
If you need to preserve the distinction between null and an empty string, you can disable that conversion with an option.

## More Than Metadata Validation
Because Dataset-JSON is a text format, its readability and editability are often treated as advantages. Those same properties can also make type drift, malformed values, or accidental edits harder to catch.

To address that, `dsjframe` includes strict validation for both metadata and row data.

In addition to JSON Schema validation for metadata, `dsjframe` also checks combinations of `dataType` and `targetDataType`, row structure, value conversion, record counts, and consistency between file content and file extension.


## DSJC Support

DSJC is treated as gzip-compressed NDJSON.

The current implementation follows the available examples rather than the still-evolving compressed Dataset-JSON v1.1 wording. The specification may change, so DSJC behavior may need to change with it.

## API Reference

### `read_dataset(source, *, as_pandas=False, out_metadata=False, empty_to_null=True)`

Read Dataset-JSON, NDJSON, or DSJC from a path, bytes object, or file-like object.
It returns a `pyarrow.Table` by default, or a pandas DataFrame when `as_pandas=True`.
Set `out_metadata=True` to also receive `pyreadstat`-compatible metadata.
By default, empty strings in string-like columns are converted to null on read; pass `empty_to_null=False` to keep them as empty strings.

### `write_dataset(frame, destination, metadata=None, *, output_format=None, define_xml=None, readstat_meta=None, compression_level=6, json_indent=2)`

Write a `pyarrow.Table` or pandas DataFrame as JSON, NDJSON, or DSJC.
The output format is inferred from the file suffix unless you pass `output_format` explicitly.
Metadata can come from `metadata`, `define.xml`, Arrow schema metadata, or `readstat_meta`.

### `detect_format(source)`

Inspect a source and return a lightweight report describing the detected format.
This is useful when you want to check the input before reading or validating it.

### `validate_dataset(source, *, validate_metadata=True, validate_data=True)`

Validate a dataset and return a diagnostic report.
You can validate only metadata, or validate both metadata and row data.

### `build_metadata(*, frame=None, metadata=None, define_xml=None, readstat_meta=None)`

Merge metadata from the available sources and return a validated Dataset-JSON metadata dictionary.
Use this when you want to inspect or prepare export metadata before writing rows.

### `build_schema(source_or_dataset)`

Build a `pyarrow.Schema` from a Dataset-JSON source or from a Dataset-JSON payload dictionary.
Use this when you need the inferred Arrow schema without loading the dataset rows.

## Common Errors

| Case | Example message |
| --- | --- |
| Unsupported input source | `unsupported input source` |
| Empty file or payload | `empty input payload` |
| Format detection failed | `could not determine dataset format` |
| Invalid DSJC payload | `invalid DSJC payload` |
| Missing required metadata | `missing required field: label` |
| JSON Schema validation failed | `validation failed at records: ...` |
| Decimal column missing required `targetDataType` | `targetDataType is required for dataType decimal` |
| Invalid `targetDataType` and `dataType` combination | `targetDataType is not allowed for dataType float` |
| Row shape does not match columns | `row does not match columns schema` |
| `records` does not match the actual row count | `records does not match actual row count` |
| Value conversion failed | `failed to convert value for column TRTSDT` |
| Unsupported Dataset-JSON `dataType` | `unsupported column type: binary` |
| pandas output requested without pandas installed | `pandas support is not installed` |

## Common Export Errors

| Case | Example message |
| --- | --- |
| `metadata` is not a dictionary | `metadata must be a dictionary` |
| Required export metadata is missing | `missing required export metadata: label` |
| Unexpected metadata key | `unexpected metadata fields: unexpected` |
| Invalid `datasetJSONCreationDateTime` | `datasetJSONCreationDateTime has invalid format` |
| Invalid `datasetJSONVersion` | `datasetJSONVersion must be 1.1.x` |
| Incomplete `sourceSystem` object | `sourceSystem requires name and version` |
| Unexpected column metadata key | `unexpected column metadata fields: badField` |
| Unsupported `targetDataType` | `unsupported targetDataType: float` |
| Decimal value exceeds configured `length` | `decimal value exceeds configured length` |
| Unsupported Arrow or pandas type | `unsupported Arrow type` |
| Invalid Define-XML syntax | `failed to parse define.xml` |
| Missing `itemGroupOID` in Define-XML | `itemGroupOID not found in define.xml` |

## Development

See [AGENTS.md](AGENTS.md).

## License

This project is licensed under the AGPL.

Code from this repository that is provided to an AI system, and code produced from that input, is treated as a derivative work.
Redistributing an AI-based reimplementation without preserving this license is considered copyright infringement.

This project is developed and maintained independently.  
To help keep it maintained, consider supporting it through sponsorship or by engaging me for contract work.   
Contact: info@knworx.com.

`tests/data/official_example` and `dsjframe/schema` are taken from https://github.com/cdisc-org/DataExchange-DatasetJson (Copyright (c) 2022 cdisc) under the MIT license.
