Metadata-Version: 2.4
Name: schema-sanitizer
Version: 0.1.1
Summary: Spec-driven data sanitization for CSV, JSON, JSONL, XML, Parquet, and Python objects.
Keywords: arrow,pyarrow,json,xml,csv,schema,sanitization
Author: bgallan
License-Expression: Apache-2.0
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: C++
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Programming Language :: Python :: Implementation :: CPython
Project-URL: Homepage, https://github.com/bgallan/schema-sanitizer
Project-URL: Repository, https://github.com/bgallan/schema-sanitizer
Project-URL: Changelog, https://github.com/bgallan/schema-sanitizer/releases
Project-URL: Issues, https://github.com/bgallan/schema-sanitizer/issues
Requires-Python: >=3.11
Provides-Extra: dev
Requires-Dist: pytest>=8; extra == "dev"
Requires-Dist: pre-commit>=3.7; extra == "dev"
Requires-Dist: build>=1.2; extra == "dev"
Requires-Dist: ruff>=0.6.0; extra == "dev"
Requires-Dist: mypy>=1.10; extra == "dev"
Requires-Dist: pyarrow>=14.0.0; extra == "dev"
Requires-Dist: pandas>=2.0; extra == "dev"
Requires-Dist: polars>=0.20; extra == "dev"
Requires-Dist: duckdb>=1.0; extra == "dev"
Provides-Extra: pyarrow
Requires-Dist: pyarrow>=14.0.0; extra == "pyarrow"
Provides-Extra: polars
Requires-Dist: polars>=0.20; extra == "polars"
Provides-Extra: pandas
Requires-Dist: pandas>=2.0; extra == "pandas"
Provides-Extra: duckdb
Requires-Dist: duckdb>=1.0; extra == "duckdb"
Provides-Extra: all
Requires-Dist: pyarrow>=14.0.0; extra == "all"
Requires-Dist: polars>=0.20; extra == "all"
Requires-Dist: pandas>=2.0; extra == "all"
Requires-Dist: duckdb>=1.0; extra == "all"
Description-Content-Type: text/markdown

# schema-sanitizer

**Version 0.1.1:** this project is still in a testing phase. Expect the core
behavior to be exercised heavily before treating it as a stable production
dependency.

`schema-sanitizer` turns extremely messy semistructured data into stable,
consistent tables. It is built for CSV, JSON, JSON Lines, XML, Parquet, and
Python rows whose real-world values do not agree on one neat schema: fields
appear late, arrays and objects change shape, timestamps arrive in several
formats, scalars collide with nested values, and malformed records still need a
place to go.

The library's main purpose is to make ingestion predictable before data reaches
analytics engines, warehouses, or incremental pipelines. It scans source data,
infers a reconciled Arrow schema, converts compatible values into that schema,
and isolates rows that cannot be represented cleanly. The result is a table
that downstream tools can consume without rediscovering schema drift on every
run.

The hard parts are handled explicitly:

- **Turning messy semistructured data into tables**: mixed scalar, list, struct,
  null, date/time, and string values are reconciled into stable columns.
- **Schema reconciliation for incremental pipelines**: `base_schema` lets later
  batches align to a previous PyArrow schema. Additive mode keeps known field
  types while accepting newly observed fields, and strict mode is available
  when the schema must not drift.
- **Memory safety**: readers and converters use bounded batches, streaming
  writers, spill-to-disk paths where needed, depth limits, row-size budgets, and
  quarantine output so large or malformed inputs do not require loading the
  whole cleaned dataset into memory.
- **Max depth enforcement**: Arrow and Parquet depth budgets can cap deeply
  nested records before they exceed downstream limits such as warehouse nesting
  constraints.

Every public reader and converter returns a `Result` object with clean data,
bad rows, and stats.

It has two public workflows:

- **In-Memory Analytics**: `read_*` functions return a `Result` whose
  `clean_data` is PyArrow, pandas, Polars, or DuckDB data.
- **File-To-File Converters**: `to_*` functions stream sanitized files to CSV,
  JSON Lines, or Parquet and return a `Result` whose `clean_data` is `None`.

```python
import schema_sanitizer as ss

events = ss.read_jsonl("raw/events.jsonl")
customers = ss.read_csv("raw/customers.csv", output_format="pandas")

table = events.clean_data
df = customers.clean_data

ss.to_parquet("raw/events.jsonl", "clean/events.parquet")
```

## Index

- [Install](#install)
- [In-Memory Analytics](#in-memory-analytics)
- [File-To-File Converters](#file-to-file-converters)
- [Result Object](#result-object)
- [Error Handling](#error-handling)
- [Schema Control](#schema-control)
- [Custom Tokens and Date/Time Patterns](#custom-tokens-and-datetime-patterns)
- [In-Memory Analytics Options](#in-memory-analytics-options)
  - [`read_csv(path, ...)`](#read_csvpath-)
  - [`read_json(path, ...)`](#read_jsonpath-)
  - [`read_json_folder(path, ...)`](#read_json_folderpath-)
  - [`read_jsonl(path, ...)`](#read_jsonlpath-)
  - [`read_xml(path, ...)`](#read_xmlpath-)
  - [`read_xml_folder(path, ...)`](#read_xml_folderpath-)
  - [`read_parquet(path, ...)`](#read_parquetpath-)
  - [`read_python(rows, ...)`](#read_pythonrows-)
- [File-To-File Converter Options](#file-to-file-converter-options)
  - [`to_csv(input_path, output_path, ...)`](#to_csvinput_path-output_path-)
  - [`to_jsonl(input_path, output_path, ...)`](#to_jsonlinput_path-output_path-)
  - [`to_parquet(input_path, output_path, ...)`](#to_parquetinput_path-output_path-)
- [Schema Inference Heuristics](#schema-inference-heuristics)
- [Base Schema Enforcement](#base-schema-enforcement)
- [Max Depth Enforcement](#max-depth-enforcement)
- [Quarantine Rows Pipeline](#quarantine-rows-pipeline)
- [Memory Safety Measures](#memory-safety-measures)
- [PyArrow Filesystem Integration](#pyarrow-filesystem-integration)
- [Supported Inputs](#supported-inputs)
- [Unsupported Inputs](#unsupported-inputs)
- [Examples](#examples)
- [Platform Notes](#platform-notes)
- [Development](#development)
- [License](#license)

## [Install](#index)

`schema-sanitizer` supports Python `>=3.11`.

For Arrow reads and file-to-file converters:

```bash
pip install 'schema-sanitizer[pyarrow]'
```

Install adapter extras for the in-memory analytics tools you use:

```bash
pip install 'schema-sanitizer[pyarrow,pandas]'
pip install 'schema-sanitizer[pyarrow,polars]'
pip install 'schema-sanitizer[pyarrow,duckdb]'
pip install 'schema-sanitizer[all]'
```

Import with an underscore:

```python
import schema_sanitizer as ss
```

## [In-Memory Analytics](#index)

Use `read_*` when you want clean data back in Python with stats.

| Function | Input | Typical use |
|---|---|---|
| `read_csv(path, ...)` | Local or PyArrow FS `.csv` file | Inspect or analyze CSV data. |
| `read_json(path, ...)` | Local or PyArrow FS `.json` file | Read JSON files into a table. |
| `read_json_folder(path, ...)` | Local or PyArrow FS folder of `.json` files | Read direct JSON file children as JSONL rows. |
| `read_jsonl(path, ...)` | Local or PyArrow FS `.jsonl` / `.ndjson` file | Read JSON Lines or NDJSON event and log data. |
| `read_xml(path, ...)` | Local or PyArrow FS `.xml` file | Read XML documents through the native sanitizer pipeline. |
| `read_xml_folder(path, ...)` | Local or PyArrow FS folder of `.xml` files | Read direct XML file children as XML document rows. |
| `read_parquet(path, ...)` | Local or PyArrow FS `.parquet` / `.pq` file | Read Parquet through the same cleaning pipeline. |
| `read_python(rows, ...)` | `list[dict]` | Clean rows already in memory. |

Readers always return a `Result`. By default, `result.clean_data` is a PyArrow table.

```python
result = ss.read_jsonl("data/events.jsonl")

print(result.clean_data.schema)
print(result.clean_data.num_rows)
print(result.stats)
```

Choose another in-memory analytics target with `output_format`.

```python
pandas_result = ss.read_csv("data/customers.csv", output_format="pandas")
polars_result = ss.read_csv("data/customers.csv", output_format="polars")
duckdb_result = ss.read_csv("data/customers.csv", output_format="duckdb")

pandas_df = pandas_result.clean_data
polars_df = polars_result.clean_data
duckdb_rel = duckdb_result.clean_data
```

Accepted `output_format` values are `pyarrow`, `pandas`, `polars`, and `duckdb`.

Use `read_python` for rows that are already in memory.

```python
rows = [
    {"id": 1, "active": "yes", "score": "10.5"},
    {"id": 2, "active": "no", "score": 8},
]

result = ss.read_python(
    rows,
    true_tokens=("yes",),
    false_tokens=("no",),
)

table = result.clean_data
```

## [File-To-File Converters](#index)

Use `to_*` when you want a sanitized output file and do not need clean data in
memory. These functions stream sanitized output and return a `Result` with
`clean_data` set to `None`, plus bad rows and stats.

| Function | Output | Typical use |
|---|---|---|
| `to_csv(input_path, output_path, ...)` | CSV | Produce a flat file for spreadsheets or downstream text tools. |
| `to_jsonl(input_path, output_path, ...)` | JSON Lines | Produce one cleaned JSON object per line. |
| `to_parquet(input_path, output_path, ...)` | Parquet | Produce a typed columnar file for analytics systems. |

```python
result = ss.to_parquet("raw/orders.csv", "clean/orders.parquet")

assert result.clean_data is None
print(result.stats)

ss.to_csv("raw/events.jsonl", "clean/events.csv")
ss.to_jsonl("raw/orders.parquet", "clean/orders.jsonl")
```

Converters infer the input format from the input file extension. If the input
path has no useful extension, pass `input_format`.

```python
ss.to_parquet("raw/events", "clean/events.parquet", input_format="jsonl")
```

Accepted `input_format` values are `auto`, `csv`, `json`, `jsonl`, `ndjson`,
`xml`, and `parquet`.

## [Result Object](#index)

All public `read_*` and `to_*` functions return `schema_sanitizer.Result`.

For readers, `result.clean_data` contains the requested clean in-memory output.
For converters, clean data is written to `output_path`, so `result.clean_data`
is always `None`.

```python
result = ss.read_csv("data/customers.csv", output_format="pandas")

df = result.clean_data
stats = result.stats
bad_rows = result.bad_rows
```

| Property or method | What it returns |
|---|---|
| `clean_data` | Clean data in the requested reader `output_format`: PyArrow table, pandas DataFrame, Polars DataFrame, or DuckDB relation. Always `None` for `to_*` converters. |
| `stats` | Dictionary of counters such as rows inferred, rows materialized, batches, skipped rows, quarantined rows, warnings, and errors. |
| `bad_rows` | Quarantined rows as a `pyarrow.Table`. The table may be empty when no rows were quarantined. |

### [Result Stats](#index)

`result.stats` is a plain `dict`. All properties are integers and default to
`0` when the runtime did not report that counter.

| Property | What it means |
|---|---|
| `inferred_rows` | Rows scanned while inferring the input schema. |
| `inferred_bytes` | Approximate input bytes scanned while inferring the schema. |
| `arrow_schema_depth` | Maximum Arrow container depth found during inference. Struct and list containers count; scalar leaves and top-level field wrappers do not. |
| `parquet_schema_depth` | Maximum Parquet/BigQuery RECORD depth found during inference. Struct containers count; list containers and scalar leaves do not. |
| `materialized_rows` | Clean rows materialized for `read_*` results or written by `to_*` converters. |
| `batches` | Number of output batches materialized or written. |
| `flattened_fields` | Nested fields flattened by the selected flattening options. |
| `scalar_wrappings` | Scalar values wrapped to fit list or struct-like output shapes. |
| `skipped_rows` | Rows dropped by `on_error="skip_row"`. |
| `quarantined_rows` | Rows dropped from clean output and stored in `result.bad_rows`. |
| `warnings` | Non-fatal warnings reported by the runtime. |
| `errors` | Fatal errors reported by the runtime. |
| `soft_errors` | Recoverable row or value errors handled by policy. |

## [Error Handling](#index)

By default, rows that fail materialization are kept as null rows. Choose a
policy with `on_error`.

| Policy | Behavior |
|---|---|
| `stop` | Raise an error as soon as a row cannot be processed. |
| `skip_row` | Drop bad rows from the output. |
| `emit_null_row` | Keep row count stable by emitting a null row. |
| `quarantine` | Drop bad rows from the output and keep them in `result.bad_rows`. |

```python
result = ss.read_jsonl(
    "data/events.jsonl",
    on_error="quarantine",
)

clean = result.clean_data
print(result.stats)

bad_rows = result.bad_rows
```

Converters return the same `Result` shape as readers. Because the clean data is
written to `output_path`, converter results always have `clean_data is None`.

```python
result = ss.to_parquet(
    "raw/events.jsonl",
    "clean/events.parquet",
    on_error="quarantine",
)

print(result.stats)
bad_rows = result.bad_rows
```

## [Schema Control](#index)

Pass `base_schema` when the output must match or evolve from an expected
contract.

```python
import pyarrow as pa
import schema_sanitizer as ss

schema = pa.schema(
    [
        pa.field("id", pa.int64(), nullable=False),
        pa.field("email", pa.string()),
    ]
)

result = ss.read_jsonl(
    "data/users.jsonl",
    base_schema=schema,
    schema_mode="strict",
    on_error="quarantine",
)

table = result.clean_data
```

| Mode | Behavior |
|---|---|
| `strict` | Output exactly `base_schema`. Requires `base_schema`; inference is skipped. |
| `additive` | Keep `base_schema` field types and add newly observed fields. |

`column_order` defaults to `base_schema_first`. Use `column_order="sorted"` for
lexicographic field ordering.

## [Custom Tokens and Date/Time Patterns](#index)

Use `true_tokens` and `false_tokens` when boolean values use domain-specific
strings. Use temporal regex options when dates or times do not match the built-in
parsers.

```python
result = ss.read_csv(
    "data/events.csv",
    true_tokens=("yes", "enabled", "1"),
    false_tokens=("no", "disabled", "0"),
    timestamp_patterns=(
        r"^(\d{4})/(\d{2})/(\d{2})[ T](\d{2}):(\d{2}):(\d{2})$",
    ),
    date_patterns=(
        r"^(\d{4})\.(\d{2})\.(\d{2})$",
    ),
    time_patterns=(
        r"^(\d{2})h(\d{2})m(\d{2})s$",
    ),
)

table = result.clean_data
```

For `timestamp_patterns`, capture groups 1-6 are year, month, day, hour,
minute, and second. Optional group 7 may contain fractions, and group 8 may
contain a timezone. For `date_patterns`, groups 1-3 are year, month, and day.
For `time_patterns`, groups 1-3 are hour, minute, and second.

## [In-Memory Analytics Options](#index)

Each reader accepts the parameters listed in its section.

<a id="read_csvpath-"></a>

### [`read_csv(path, ...)`](#index)

| Parameter | Default | Accepted values | What it controls |
|---|---:|---|---|
| `path` | required | `str` or path-like object | Local CSV file to read. |
| `output_format` | `pyarrow` | `pyarrow`, `pandas`, `polars`, `duckdb` | Type stored in `Result.clean_data`. |
| `base_schema` | `None` | `pyarrow.Schema` or `None` | Optional base output contract. |
| `schema_mode` | `additive` | `additive`, `strict` | How inferred fields reconcile with `base_schema`. |
| `column_order` | `base_schema_first` | `base_schema_first`, `sorted` | Output field ordering. |
| `parse_integers` | `True` | `bool` | Parse integer-looking strings as integers. |
| `parse_floats` | `True` | `bool` | Parse float-looking strings as floats. |
| `true_tokens` | `()` | sequence of strings | String tokens interpreted as boolean `true`. |
| `false_tokens` | `()` | sequence of strings | String tokens interpreted as boolean `false`. |
| `timestamp_patterns` | `()` | sequence of regex strings | Extra timestamp parsers. Groups 1-6 map to year, month, day, hour, minute, second; group 7 may hold fractions and group 8 timezone. |
| `date_patterns` | `()` | sequence of regex strings | Extra date parsers. Groups 1-3 map to year, month, day. |
| `time_patterns` | `()` | sequence of regex strings | Extra time parsers. Groups 1-3 map to hour, minute, second. |
| `arrow_max_depth` | `32` | integer `>= 0` | Maximum Arrow container depth for object and array expansion. |
| `parquet_max_depth` | `15` | integer `>= 0` | Maximum Parquet/BigQuery RECORD depth for object expansion. |
| `scalar_object_key` | `default_key` | string | Key used when a scalar must be wrapped as an object. |
| `csv_has_header` | `True` | `bool` | Whether the first CSV row is a header. |
| `csv_delimiter` | `,` | single-character string | CSV delimiter. |
| `input_text_encoding` | `utf-8` | text encoding name | Encoding used to decode CSV bytes. |
| `on_error` | `emit_null_row` | `stop`, `skip_row`, `emit_null_row`, `quarantine` | Row-level error policy. |
| `batch_memory_limit_bytes` | `None` | positive integer bytes or `None` | Best-effort per-batch memory budget. |
| `read_chunk_bytes` | `1048576` | positive integer bytes | Chunk size for streaming CSV reads. |

<a id="read_jsonpath-"></a>

### [`read_json(path, ...)`](#index)

| Parameter | Default | Accepted values | What it controls |
|---|---:|---|---|
| `path` | required | `str` or path-like object | Local JSON file to read. |
| `output_format` | `pyarrow` | `pyarrow`, `pandas`, `polars`, `duckdb` | Type stored in `Result.clean_data`. |
| `base_schema` | `None` | `pyarrow.Schema` or `None` | Optional base output contract. |
| `schema_mode` | `additive` | `additive`, `strict` | How inferred fields reconcile with `base_schema`. |
| `column_order` | `base_schema_first` | `base_schema_first`, `sorted` | Output field ordering. |
| `parse_integers` | `True` | `bool` | Parse integer-looking strings as integers. |
| `parse_floats` | `True` | `bool` | Parse float-looking strings as floats. |
| `true_tokens` | `()` | sequence of strings | String tokens interpreted as boolean `true`. |
| `false_tokens` | `()` | sequence of strings | String tokens interpreted as boolean `false`. |
| `timestamp_patterns` | `()` | sequence of regex strings | Extra timestamp parsers. |
| `date_patterns` | `()` | sequence of regex strings | Extra date parsers. |
| `time_patterns` | `()` | sequence of regex strings | Extra time parsers. |
| `arrow_max_depth` | `32` | integer `>= 0` | Maximum Arrow container depth for object and array expansion. |
| `parquet_max_depth` | `15` | integer `>= 0` | Maximum Parquet/BigQuery RECORD depth for object expansion. |
| `scalar_object_key` | `default_key` | string | Key used when a scalar must be wrapped as an object. |
| `input_text_encoding` | `utf-8` | text encoding name | Encoding used to decode JSON bytes. |
| `on_error` | `emit_null_row` | `stop`, `skip_row`, `emit_null_row`, `quarantine` | Row-level error policy. |
| `batch_memory_limit_bytes` | `None` | positive integer bytes or `None` | Best-effort per-batch memory budget. |
| `read_chunk_bytes` | `1048576` | positive integer bytes | Chunk size for streaming JSON reads. |

<a id="read_json_folderpath-"></a>

### [`read_json_folder(path, ...)`](#index)

`read_json_folder` reads the direct `.json` children of a local folder or
PyArrow filesystem folder URI in deterministic filename order. Folder
exploration is not recursive. Each source file must contain one JSON document;
the reader compacts those documents into a temporary JSON Lines stream and then
runs the same sanitizer path used by `read_json`.

| Parameter | Default | Accepted values | What it controls |
|---|---:|---|---|
| `path` | required | `str` or path-like object | Local folder or PyArrow FS folder URI containing `.json` files. |
| `output_format` | `pyarrow` | `pyarrow`, `pandas`, `polars`, `duckdb` | Type stored in `Result.clean_data`. |
| `base_schema` | `None` | `pyarrow.Schema` or `None` | Optional base output contract. |
| `schema_mode` | `additive` | `additive`, `strict` | How inferred fields reconcile with `base_schema`. |
| `column_order` | `base_schema_first` | `base_schema_first`, `sorted` | Output field ordering. |
| `parse_integers` | `True` | `bool` | Parse integer-looking strings as integers. |
| `parse_floats` | `True` | `bool` | Parse float-looking strings as floats. |
| `true_tokens` | `()` | sequence of strings | String tokens interpreted as boolean `true`. |
| `false_tokens` | `()` | sequence of strings | String tokens interpreted as boolean `false`. |
| `timestamp_patterns` | `()` | sequence of regex strings | Extra timestamp parsers. |
| `date_patterns` | `()` | sequence of regex strings | Extra date parsers. |
| `time_patterns` | `()` | sequence of regex strings | Extra time parsers. |
| `arrow_max_depth` | `32` | integer `>= 0` | Maximum Arrow container depth for object and array expansion. |
| `parquet_max_depth` | `15` | integer `>= 0` | Maximum Parquet/BigQuery RECORD depth for object expansion. |
| `scalar_object_key` | `default_key` | string | Key used when a scalar must be wrapped as an object. |
| `input_text_encoding` | `utf-8` | text encoding name | Encoding used to decode each source JSON file. |
| `on_error` | `emit_null_row` | `stop`, `skip_row`, `emit_null_row`, `quarantine` | Row-level error policy. |
| `batch_memory_limit_bytes` | `None` | positive integer bytes or `None` | Best-effort per-document and per-batch memory budget. |
| `read_chunk_bytes` | `1048576` | positive integer bytes | Chunk size for the compacted JSON Lines stream. |

<a id="read_jsonlpath-"></a>

### [`read_jsonl(path, ...)`](#index)

| Parameter | Default | Accepted values | What it controls |
|---|---:|---|---|
| `path` | required | `str` or path-like object | Local JSON Lines or NDJSON file to read. |
| `output_format` | `pyarrow` | `pyarrow`, `pandas`, `polars`, `duckdb` | Type stored in `Result.clean_data`. |
| `base_schema` | `None` | `pyarrow.Schema` or `None` | Optional base output contract. |
| `schema_mode` | `additive` | `additive`, `strict` | How inferred fields reconcile with `base_schema`. |
| `column_order` | `base_schema_first` | `base_schema_first`, `sorted` | Output field ordering. |
| `parse_integers` | `True` | `bool` | Parse integer-looking strings as integers. |
| `parse_floats` | `True` | `bool` | Parse float-looking strings as floats. |
| `true_tokens` | `()` | sequence of strings | String tokens interpreted as boolean `true`. |
| `false_tokens` | `()` | sequence of strings | String tokens interpreted as boolean `false`. |
| `timestamp_patterns` | `()` | sequence of regex strings | Extra timestamp parsers. |
| `date_patterns` | `()` | sequence of regex strings | Extra date parsers. |
| `time_patterns` | `()` | sequence of regex strings | Extra time parsers. |
| `arrow_max_depth` | `32` | integer `>= 0` | Maximum Arrow container depth for object and array expansion. |
| `parquet_max_depth` | `15` | integer `>= 0` | Maximum Parquet/BigQuery RECORD depth for object expansion. |
| `scalar_object_key` | `default_key` | string | Key used when a scalar must be wrapped as an object. |
| `input_text_encoding` | `utf-8` | text encoding name | Encoding used to decode JSON Lines or NDJSON bytes. |
| `on_error` | `emit_null_row` | `stop`, `skip_row`, `emit_null_row`, `quarantine` | Row-level error policy. |
| `batch_memory_limit_bytes` | `None` | positive integer bytes or `None` | Best-effort per-batch memory budget. |
| `read_chunk_bytes` | `1048576` | positive integer bytes | Chunk size for streaming JSON Lines or NDJSON reads. |

<a id="read_xmlpath-"></a>

### [`read_xml(path, ...)`](#index)

`read_xml` parses a local XML document in the native C++ frontend and sends the
resulting rows through the same schema inference, cleaning, quarantine, and
output adapter pipeline as the JSON and CSV readers.

By default, the root element is treated as one row, like a single JSON object.
Pass `xml_row_tag="row"` when a file contains repeated direct child elements
that should become separate rows; the XML scanner then streams each matching
row element. Attributes become fields prefixed with `@`, repeated child tags
become lists, and mixed element text is stored under `#text`.

```python
result = ss.read_xml(
    "raw/orders.xml",
    xml_row_tag="order",
    read_chunk_bytes=1024 * 1024,
    batch_memory_limit_bytes=256 * 1024 * 1024,
)
```

| Parameter | Default | Accepted values | What it controls |
|---|---:|---|---|
| `path` | required | `str` or path-like object | Local XML file to read. |
| `output_format` | `pyarrow` | `pyarrow`, `pandas`, `polars`, `duckdb` | Type stored in `Result.clean_data`. |
| `base_schema` | `None` | `pyarrow.Schema` or `None` | Optional base output contract. |
| `schema_mode` | `additive` | `additive`, `strict` | How inferred fields reconcile with `base_schema`. |
| `column_order` | `base_schema_first` | `base_schema_first`, `sorted` | Output field ordering. |
| `parse_integers` | `True` | `bool` | Parse integer-looking strings as integers. |
| `parse_floats` | `True` | `bool` | Parse float-looking strings as floats. |
| `true_tokens` | `()` | sequence of strings | String tokens interpreted as boolean `true`. |
| `false_tokens` | `()` | sequence of strings | String tokens interpreted as boolean `false`. |
| `timestamp_patterns` | `()` | sequence of regex strings | Extra timestamp parsers. |
| `date_patterns` | `()` | sequence of regex strings | Extra date parsers. |
| `time_patterns` | `()` | sequence of regex strings | Extra time parsers. |
| `arrow_max_depth` | `32` | integer `>= 0` | Maximum Arrow container depth for object and array expansion. |
| `parquet_max_depth` | `15` | integer `>= 0` | Maximum Parquet/BigQuery RECORD depth for object expansion. |
| `scalar_object_key` | `default_key` | string | Key used when a scalar must be wrapped as an object. |
| `input_text_encoding` | `utf-8` | text encoding name | Encoding used to decode XML bytes when transcoding is needed. |
| `xml_row_tag` | `None` | XML element tag name or `None` | Direct child element tag to stream as separate rows. `None` treats the whole document as one row. |
| `on_error` | `emit_null_row` | `stop`, `skip_row`, `emit_null_row`, `quarantine` | Row-level error policy. |
| `batch_memory_limit_bytes` | `None` | positive integer bytes or `None` | Best-effort per-batch memory budget. |
| `read_chunk_bytes` | `1048576` | positive integer bytes | Chunk size for streaming text input reads. |

<a id="read_xml_folderpath-"></a>

### [`read_xml_folder(path, ...)`](#index)

`read_xml_folder` reads the direct `.xml` children of a local folder or PyArrow
filesystem folder URI in deterministic filename order. Folder exploration is
not recursive. Each source file must contain one XML document, and all
documents must use the same root tag unless you pass that tag explicitly as
`xml_row_tag`. The reader wraps those documents in a temporary XML stream and
then runs the same sanitizer path used by `read_xml`.

```python
result = ss.read_xml_folder(
    "raw/order-events",
    xml_row_tag="order",
    batch_memory_limit_bytes=256 * 1024 * 1024,
)
```

| Parameter | Default | Accepted values | What it controls |
|---|---:|---|---|
| `path` | required | `str` or path-like object | Local folder or PyArrow FS folder URI containing `.xml` files. |
| `output_format` | `pyarrow` | `pyarrow`, `pandas`, `polars`, `duckdb` | Type stored in `Result.clean_data`. |
| `base_schema` | `None` | `pyarrow.Schema` or `None` | Optional base output contract. |
| `schema_mode` | `additive` | `additive`, `strict` | How inferred fields reconcile with `base_schema`. |
| `column_order` | `base_schema_first` | `base_schema_first`, `sorted` | Output field ordering. |
| `parse_integers` | `True` | `bool` | Parse integer-looking strings as integers. |
| `parse_floats` | `True` | `bool` | Parse float-looking strings as floats. |
| `true_tokens` | `()` | sequence of strings | String tokens interpreted as boolean `true`. |
| `false_tokens` | `()` | sequence of strings | String tokens interpreted as boolean `false`. |
| `timestamp_patterns` | `()` | sequence of regex strings | Extra timestamp parsers. |
| `date_patterns` | `()` | sequence of regex strings | Extra date parsers. |
| `time_patterns` | `()` | sequence of regex strings | Extra time parsers. |
| `arrow_max_depth` | `32` | integer `>= 0` | Maximum Arrow container depth for object and array expansion. |
| `parquet_max_depth` | `15` | integer `>= 0` | Maximum Parquet/BigQuery RECORD depth for object expansion. |
| `scalar_object_key` | `default_key` | string | Key used when a scalar must be wrapped as an object. |
| `input_text_encoding` | `utf-8` | text encoding name | Encoding used to decode each source XML file. |
| `xml_row_tag` | `None` | XML element tag name or `None` | Expected XML document root tag. `None` infers it from the first file. |
| `on_error` | `emit_null_row` | `stop`, `skip_row`, `emit_null_row`, `quarantine` | Row-level error policy. |
| `batch_memory_limit_bytes` | `None` | positive integer bytes or `None` | Best-effort per-document-row memory budget. |
| `read_chunk_bytes` | `1048576` | positive integer bytes | Chunk size for the compacted XML stream. |

<a id="read_parquetpath-"></a>

### [`read_parquet(path, ...)`](#index)

| Parameter | Default | Accepted values | What it controls |
|---|---:|---|---|
| `path` | required | `str` or path-like object | Local Parquet file to read. |
| `output_format` | `pyarrow` | `pyarrow`, `pandas`, `polars`, `duckdb` | Type stored in `Result.clean_data`. |
| `base_schema` | `None` | `pyarrow.Schema` or `None` | Optional base output contract. |
| `schema_mode` | `additive` | `additive`, `strict` | How inferred fields reconcile with `base_schema`. |
| `column_order` | `base_schema_first` | `base_schema_first`, `sorted` | Output field ordering. |
| `arrow_max_depth` | `32` | integer `>= 0` | Maximum Arrow container depth for object and array expansion. |
| `parquet_max_depth` | `15` | integer `>= 0` | Maximum Parquet/BigQuery RECORD depth for object expansion. |
| `scalar_object_key` | `default_key` | string | Key used when a scalar must be wrapped as an object. |
| `on_error` | `emit_null_row` | `stop`, `skip_row`, `emit_null_row`, `quarantine` | Row-level error policy. |
| `batch_memory_limit_bytes` | `None` | positive integer bytes or `None` | Best-effort per-batch memory budget. |

<a id="read_pythonrows-"></a>

### [`read_python(rows, ...)`](#index)

| Parameter | Default | Accepted values | What it controls |
|---|---:|---|---|
| `rows` | required | `list[dict]` | In-memory rows to normalize. |
| `output_format` | `pyarrow` | `pyarrow`, `pandas`, `polars`, `duckdb` | Type stored in `Result.clean_data`. |
| `base_schema` | `None` | `pyarrow.Schema` or `None` | Optional base output contract. |
| `schema_mode` | `additive` | `additive`, `strict` | How inferred fields reconcile with `base_schema`. |
| `column_order` | `base_schema_first` | `base_schema_first`, `sorted` | Output field ordering. |
| `parse_integers` | `True` | `bool` | Parse integer-looking strings as integers. |
| `parse_floats` | `True` | `bool` | Parse float-looking strings as floats. |
| `true_tokens` | `()` | sequence of strings | String tokens interpreted as boolean `true`. |
| `false_tokens` | `()` | sequence of strings | String tokens interpreted as boolean `false`. |
| `timestamp_patterns` | `()` | sequence of regex strings | Extra timestamp parsers. |
| `date_patterns` | `()` | sequence of regex strings | Extra date parsers. |
| `time_patterns` | `()` | sequence of regex strings | Extra time parsers. |
| `arrow_max_depth` | `32` | integer `>= 0` | Maximum Arrow container depth for object and array expansion. |
| `parquet_max_depth` | `15` | integer `>= 0` | Maximum Parquet/BigQuery RECORD depth for object expansion. |
| `scalar_object_key` | `default_key` | string | Key used when a scalar must be wrapped as an object. |
| `on_error` | `emit_null_row` | `stop`, `skip_row`, `emit_null_row`, `quarantine` | Row-level error policy. |
| `batch_memory_limit_bytes` | `None` | positive integer bytes or `None` | Best-effort memory budget for the already-resident Python payload. |

## [File-To-File Converter Options](#index)

Converters accept local or PyArrow FS URI output paths. Inputs can be local
paths or PyArrow FS URI strings. They infer input format from the input
extension unless you pass
`input_format`.

<a id="to_csvinput_path-output_path-"></a>

### [`to_csv(input_path, output_path, ...)`](#index)

| Parameter | Default | Accepted values | What it controls |
|---|---:|---|---|
| `input_path` | required | `str` or path-like object | Local file or PyArrow FS URI to sanitize. |
| `output_path` | required | `str` or path-like object | Local or PyArrow FS URI CSV file to create. |
| `input_format` | `auto` | `auto`, `csv`, `json`, `jsonl`, `ndjson`, `xml`, `parquet` | Input format selector. |
| `base_schema` | `None` | `pyarrow.Schema` or `None` | Optional base output contract. |
| `schema_mode` | `additive` | `additive`, `strict` | How inferred fields reconcile with `base_schema`. |
| `column_order` | `base_schema_first` | `base_schema_first`, `sorted` | Output field ordering. |
| `parse_integers` | `True` | `bool` | Parse integer-looking strings as integers. |
| `parse_floats` | `True` | `bool` | Parse float-looking strings as floats. |
| `true_tokens` | `()` | sequence of strings | String tokens interpreted as boolean `true`. |
| `false_tokens` | `()` | sequence of strings | String tokens interpreted as boolean `false`. |
| `timestamp_patterns` | `()` | sequence of regex strings | Extra timestamp parsers. |
| `date_patterns` | `()` | sequence of regex strings | Extra date parsers. |
| `time_patterns` | `()` | sequence of regex strings | Extra time parsers. |
| `arrow_max_depth` | `32` | integer `>= 0` | Maximum Arrow container depth for object and array expansion. |
| `parquet_max_depth` | `15` | integer `>= 0` | Maximum Parquet/BigQuery RECORD depth for object expansion. |
| `scalar_object_key` | `default_key` | string | Key used when a scalar must be wrapped as an object. |
| `csv_has_header` | `True` | `bool` | Whether CSV input has a header. |
| `csv_delimiter` | `,` | single-character string | CSV input delimiter. |
| `input_text_encoding` | `utf-8` | text encoding name | Encoding used to decode CSV, JSON, JSON Lines, NDJSON, or XML input. |
| `xml_row_tag` | `None` | XML element tag name or `None` | Direct child XML element tag to stream as separate rows when reading XML input. |
| `on_error` | `emit_null_row` | `stop`, `skip_row`, `emit_null_row`, `quarantine` | Row-level error policy. |
| `batch_memory_limit_bytes` | `None` | positive integer bytes or `None` | Best-effort per-batch memory budget. |
| `read_chunk_bytes` | `1048576` | positive integer bytes | Chunk size for streaming text input reads. |

<a id="to_jsonlinput_path-output_path-"></a>

### [`to_jsonl(input_path, output_path, ...)`](#index)

| Parameter | Default | Accepted values | What it controls |
|---|---:|---|---|
| `input_path` | required | `str` or path-like object | Local file or PyArrow FS URI to sanitize. |
| `output_path` | required | `str` or path-like object | Local or PyArrow FS URI JSON Lines file to create. |
| `input_format` | `auto` | `auto`, `csv`, `json`, `jsonl`, `ndjson`, `xml`, `parquet` | Input format selector. |
| `base_schema` | `None` | `pyarrow.Schema` or `None` | Optional base output contract. |
| `schema_mode` | `additive` | `additive`, `strict` | How inferred fields reconcile with `base_schema`. |
| `column_order` | `base_schema_first` | `base_schema_first`, `sorted` | Output field ordering. |
| `parse_integers` | `True` | `bool` | Parse integer-looking strings as integers. |
| `parse_floats` | `True` | `bool` | Parse float-looking strings as floats. |
| `true_tokens` | `()` | sequence of strings | String tokens interpreted as boolean `true`. |
| `false_tokens` | `()` | sequence of strings | String tokens interpreted as boolean `false`. |
| `timestamp_patterns` | `()` | sequence of regex strings | Extra timestamp parsers. |
| `date_patterns` | `()` | sequence of regex strings | Extra date parsers. |
| `time_patterns` | `()` | sequence of regex strings | Extra time parsers. |
| `arrow_max_depth` | `32` | integer `>= 0` | Maximum Arrow container depth for object and array expansion. |
| `parquet_max_depth` | `15` | integer `>= 0` | Maximum Parquet/BigQuery RECORD depth for object expansion. |
| `scalar_object_key` | `default_key` | string | Key used when a scalar must be wrapped as an object. |
| `csv_has_header` | `True` | `bool` | Whether CSV input has a header. |
| `csv_delimiter` | `,` | single-character string | CSV input delimiter. |
| `input_text_encoding` | `utf-8` | text encoding name | Encoding used to decode CSV, JSON, JSON Lines, NDJSON, or XML input. |
| `xml_row_tag` | `None` | XML element tag name or `None` | Direct child XML element tag to stream as separate rows when reading XML input. |
| `on_error` | `emit_null_row` | `stop`, `skip_row`, `emit_null_row`, `quarantine` | Row-level error policy. |
| `batch_memory_limit_bytes` | `None` | positive integer bytes or `None` | Best-effort per-batch memory budget. |
| `read_chunk_bytes` | `1048576` | positive integer bytes | Chunk size for streaming text input reads. |

<a id="to_parquetinput_path-output_path-"></a>

### [`to_parquet(input_path, output_path, ...)`](#index)

| Parameter | Default | Accepted values | What it controls |
|---|---:|---|---|
| `input_path` | required | `str` or path-like object | Local file or PyArrow FS URI to sanitize. |
| `output_path` | required | `str` or path-like object | Local or PyArrow FS URI Parquet file to create. |
| `input_format` | `auto` | `auto`, `csv`, `json`, `jsonl`, `ndjson`, `xml`, `parquet` | Input format selector. |
| `base_schema` | `None` | `pyarrow.Schema` or `None` | Optional base output contract. |
| `schema_mode` | `additive` | `additive`, `strict` | How inferred fields reconcile with `base_schema`. |
| `column_order` | `base_schema_first` | `base_schema_first`, `sorted` | Output field ordering. |
| `parse_integers` | `True` | `bool` | Parse integer-looking strings as integers. |
| `parse_floats` | `True` | `bool` | Parse float-looking strings as floats. |
| `true_tokens` | `()` | sequence of strings | String tokens interpreted as boolean `true`. |
| `false_tokens` | `()` | sequence of strings | String tokens interpreted as boolean `false`. |
| `timestamp_patterns` | `()` | sequence of regex strings | Extra timestamp parsers. |
| `date_patterns` | `()` | sequence of regex strings | Extra date parsers. |
| `time_patterns` | `()` | sequence of regex strings | Extra time parsers. |
| `arrow_max_depth` | `32` | integer `>= 0` | Maximum Arrow container depth for object and array expansion. |
| `parquet_max_depth` | `15` | integer `>= 0` | Maximum Parquet/BigQuery RECORD depth for object expansion. |
| `scalar_object_key` | `default_key` | string | Key used when a scalar must be wrapped as an object. |
| `csv_has_header` | `True` | `bool` | Whether CSV input has a header. |
| `csv_delimiter` | `,` | single-character string | CSV input delimiter. |
| `input_text_encoding` | `utf-8` | text encoding name | Encoding used to decode CSV, JSON, JSON Lines, NDJSON, or XML input. |
| `xml_row_tag` | `None` | XML element tag name or `None` | Direct child XML element tag to stream as separate rows when reading XML input. |
| `on_error` | `emit_null_row` | `stop`, `skip_row`, `emit_null_row`, `quarantine` | Row-level error policy. |
| `batch_memory_limit_bytes` | `None` | positive integer bytes or `None` | Best-effort per-batch memory budget. |
| `read_chunk_bytes` | `1048576` | positive integer bytes | Chunk size for streaming text input reads. |

## [Schema Inference Heuristics](#index)

Schema inference scans the full source before materialization whenever inference
runs. It is not a sample-based inference step: in inferred mode and additive
`base_schema` mode, every source row is consumed during inference and counted in
`Result.stats["inferred_rows"]`.

For each inferred row, the sanitizer applies two internal passes:

1. The shape pass discovers structural paths: field names,
   objects, arrays, and fields that must be flattened by depth limits.
1. The statistics pass collects scalar type evidence
   for the discovered shape: booleans, integers, floats, timestamps, dates,
   times, strings, nulls, and mixed-type conflicts.

When `schema_mode="strict"` is used with an explicit `base_schema`, the
sanitizer skips inference and uses the schema contract directly; in that fast
path, `inferred_rows` is `0`. Strict mode only works with `base_schema`. Passing
`schema_mode="strict"` without `base_schema` raises an exception.

Separating shape discovery from scalar statistics keeps list and struct
decisions stable across messy inputs. If one row has an object and another row
has a scalar at the same field, the structural shape wins and the scalar is
wrapped under `scalar_object_key` (`default_key` by default). If one row has a
list and another row has a scalar at the same field, the list shape wins and the
scalar is wrapped as a single list element.

Scalar inference is conservative:

- Nulls do not choose a type by themselves.
- Boolean JSON values infer `bool`.
- Numeric JSON values infer `int64` or `float64`.
- Strings can infer booleans, integers, floats, timestamps, dates, or times
  when the configured token and parser options match.
- Mixed scalar kinds fall back to `string`.
- Objects or arrays observed where a scalar is required are stringified.

List inference is stricter than top-level object inference. Lists remain typed
only when their element shape is conflict-free. Lists of scalars and lists of
structs are supported; nested lists or conflicts inside a list element fall back
to `list<string>` so each list column has one stable element type.

## [Base Schema Enforcement](#index)

`base_schema` is an output contract and only accepts a `pyarrow.Schema`. The
sanitizer converts it to the same internal logical schema representation used by
inference before planning materialization.

```python
import pyarrow as pa
import schema_sanitizer as ss

user_schema = pa.schema(
    [
        pa.field("id", pa.int64(), nullable=False),
        pa.field("email", pa.string()),
    ]
)

result = ss.read_jsonl(
    "data/users.jsonl",
    base_schema=user_schema,
    schema_mode="strict",
)
```

`schema_mode="strict"` uses `base_schema` as the complete output schema. The
inference loop is skipped, so `Result.stats["inferred_rows"]` is `0`. Strict
mode only works when `base_schema` is provided; otherwise the call raises an
exception before materialization. Extra source fields are rejected because they
are not present in the strict contract.

`schema_mode="additive"` requires inference. The full source is scanned, then
the inferred schema is reconciled with `base_schema`: fields already present in
`base_schema` keep their declared types and nullable flags, while newly observed
fields are added from inference. Row values that cannot be coerced into the
declared base field type are handled by `on_error`.

For fields present in `base_schema`, the base type wins even when source rows
contain conflicting values. A value such as `"unknown"` in a base `int64` field,
an object in a base scalar field, or a scalar in a base struct/list field is a
materialization conflict. The row is stopped, skipped, quarantined, or replaced
with a null row according to `on_error`. For fields not present in
`base_schema`, conflicts are resolved by the normal inference heuristics before
the field is added: mixed scalar kinds fall back to `string`, object/scalar and
list/scalar conflicts use the wrapping rules, and conflicting list element
shapes fall back to `list<string>`.

`column_order` controls only output field order after reconciliation.
`column_order="base_schema_first"` preserves base fields first, then appends new
fields. `column_order="sorted"` emits fields in lexicographic order.

## [Max Depth Enforcement](#index)

Depth enforcement uses two independent limits because Arrow and
Parquet/BigQuery count nested data differently:

- `arrow_max_depth` defaults to `32`. It counts Arrow container depth: `struct`
  and `list` containers count, while scalar leaves and top-level field wrappers
  do not.
- `parquet_max_depth` defaults to `15`. It counts Parquet/BigQuery RECORD depth:
  `struct` containers count, while `list` containers, scalar leaves, and
  top-level field wrappers do not.

The sanitizer flattens a named field to `<name>_flattened` when keeping that
field's full nested value would exceed either limit. The flattened value is
stored as a string.

Depth examples:

| Shape | `arrow_schema_depth` | `parquet_schema_depth` |
|---|---:|---:|
| `id: int64` | 0 | 0 |
| `user: struct<id: int64>` | 1 | 1 |
| `tags: list<string>` | 1 | 0 |
| `authors: list<struct<name: string>>` | 2 | 1 |
| `asset: struct<authors: list<struct<name: string>>>` | 3 | 2 |

Use `arrow_max_depth` as a defensive complexity limit for Arrow/Parquet
container nesting. Use `parquet_max_depth=15` when the output Parquet will be
read by BigQuery external tables, where the practical limit is nested RECORD
depth rather than physical list wrapper depth.

The reported `Result.stats["arrow_schema_depth"]` and
`Result.stats["parquet_schema_depth"]` use the same counting rules as the
enforcement options.

## [Quarantine Rows Pipeline](#index)

Use `on_error="quarantine"` when you want clean output to continue while keeping
failed rows for inspection or replay. Rows that fail materialization are dropped
from `clean_data` or the converter output file and appended to
`Result.bad_rows`.

`bad_rows` is a PyArrow table with diagnostic metadata:

| Column | What it contains |
|---|---|
| `row_index` | Zero-based source row index. |
| `source_offset` | Byte offset or source-relative offset when available. |
| `code` / `code_str` | Machine-readable diagnostic code. |
| `path_id` | Internal field path id associated with the failure. |
| `detail` | Human-readable error detail. |
| `context_snippet` | Short preview of the offending source row. |
| `raw_row` | Full raw source row text when available. |

For in-memory reads, inspect `result.bad_rows` directly:

```python
result = ss.read_jsonl(
    "data/events.jsonl",
    base_schema=event_schema,
    schema_mode="strict",
    on_error="quarantine",
)

clean = result.clean_data
bad_rows = result.bad_rows
print(result.stats["quarantined_rows"])
```

For file-to-file converters, the clean output is written to `output_path` and
the same `Result.bad_rows` table carries quarantined rows:

```python
result = ss.to_parquet(
    "raw/events.jsonl",
    "clean/events.parquet",
    base_schema=event_schema,
    schema_mode="strict",
    on_error="quarantine",
)

bad_rows = result.bad_rows
```

Quarantine is row-level. If one field in a row cannot be coerced into the output
schema, the whole row is excluded from clean output and recorded once in
`bad_rows`. In contrast, `on_error="skip_row"` drops the row without retaining
it, and `on_error="emit_null_row"` keeps row count stable by writing a null row
instead of recording it in `bad_rows`.

## [Memory Safety Measures](#index)

The sanitizer is designed to process large local files and PyArrow filesystem
URI inputs without requiring the whole clean dataset to live in Python memory.

- File-to-file converters stream sanitized batches directly to the output file.
  `Result.clean_data` is `None` for converters, so the clean table is not
  materialized in memory.
- PyArrow filesystem file inputs are opened as seekable streams. CSV, JSON,
  JSON Lines, NDJSON, and XML URI inputs are not copied to a temporary file;
  their bytes are read by the same chunked native scanner used for local files.
- PyArrow filesystem outputs are opened with `pyarrow.fs.open_output_stream`.
  CSV, JSON Lines, and Parquet converters write incrementally to that stream
  instead of staging the full output in a local temporary file.
- CSV, JSON, JSON Lines, and NDJSON readers use `read_chunk_bytes` to bound
  input chunks while scanning.
- XML without `xml_row_tag` is parsed into a native document tree before row
  emission, so `batch_memory_limit_bytes` limits the accumulated document size
  before the tree is built.
- XML with `xml_row_tag` streams matching direct child elements. The scanner
  reads bounded chunks, discards completed row slices, and raises
  `SchemaSanitizerResourceError` if the active XML buffer exceeds
  `batch_memory_limit_bytes`.
- Local and PyArrow filesystem folder readers (`read_json_folder` and
  `read_xml_folder`) list direct child files only, then compact one source
  document at a time into a local temporary JSON Lines or XML stream. The temp
  file is the bridge that lets many single-document files reuse the normal
  streaming sanitizer pipeline without building one large Python object.
- Folder temp streams contain only the compacted input representation, not the
  final clean dataset. With `batch_memory_limit_bytes`, each source document is
  checked before it is decoded and added to that stream. If a PyArrow filesystem
  does not report a child file size, the child is read in bounded chunks and the
  reader stops at `batch_memory_limit_bytes + 1` bytes before raising
  `SchemaSanitizerResourceError`.
- Folder temp files are deleted when the read finishes, and partially written
  temp files are deleted if compaction raises an exception. If the Python
  process is killed externally, for example with `SIGKILL`, the operating
  system may not give `schema-sanitizer` a chance to run that cleanup.
- Parquet inputs are decoded by PyArrow into record batches and exposed to the
  native JSON frontend through a seekable JSON Lines byte reader. Rows are
  produced incrementally; the Parquet-to-JSONL adapter does not stage a full
  conversion file.
- XML DTD and entity declarations are rejected. The XML frontend does not load
  external entities or expand document-defined entities.
- `batch_memory_limit_bytes` maps to the native per-batch
  `memory_limit_bytes` budget. It reduces inference and output batch sizes
  instead of changing the final schema.
- For already-resident Python inputs, `batch_memory_limit_bytes` is enforced as
  a preflight resource guard. If the Python payload is already larger than the
  configured limit, the call raises `SchemaSanitizerResourceError` before native
  ingestion starts.
- `arrow_max_depth` and `parquet_max_depth` cap nested expansion. Values beyond
  those limits are flattened to strings, preventing unbounded container nesting
  from creating very wide or deeply nested Arrow/Parquet schemas.
- Native parsing and materialization use owned streams, arenas, and Arrow C Data
  resources that are closed when the `Result`, stream, or sink is closed or
  dropped. Table-producing readers force stream materialization and close native
  resources before returning.

Configured resource-limit failures raise `SchemaSanitizerResourceError` and
include `limit_name="memory_limit_bytes"` in their detail payload when
available. True allocator failures are reported separately as
`SchemaSanitizerOutOfMemoryError`.

## [PyArrow Filesystem Integration](#index)

When PyArrow is installed, every file reader and file-to-file converter can use
`pyarrow.fs` URI strings. This covers `read_csv`, `read_json`,
`read_json_folder`, `read_jsonl`, `read_xml`, `read_xml_folder`,
`read_parquet`, `to_csv`, `to_jsonl`, and `to_parquet`. Supported URI input
extensions include `csv`, `json`, `jsonl`, `ndjson`, `xml`, `parquet`, and
`pq`. Supported URI converter output extensions include `csv`, `jsonl`, and
`parquet`.

For normal local files, prefer a regular path:

```python
events = ss.read_jsonl("/home/user/data/events.jsonl")
```

Regular local paths are the simplest and usually best choice for local disk
access. They avoid PyArrow URI parsing and filesystem dispatch.

`file://` is PyArrow's local-filesystem URI scheme. On Linux and WSL, absolute
local paths use three slashes: `file:///home/user/data/events.jsonl`. That URI
points to the same file as `/home/user/data/events.jsonl`, but it is opened
through `pyarrow.fs.LocalFileSystem`. Use it when you specifically want to test
the PyArrow filesystem route or when your code passes filesystem URIs
consistently across local and cloud storage. Do not write `file://home/user/...`;
that form has `home` in the URI host position instead of being an absolute local
path.

| Local form | Example | Opens through | Best use |
| --- | --- | --- | --- |
| Regular local path | `/home/user/data/events.jsonl` | schema-sanitizer local path handling | Default for local disk files. |
| Local PyArrow URI | `file:///home/user/data/events.jsonl` | `pyarrow.fs.LocalFileSystem` | Testing or URI-only code paths. |

Common URI forms:

| Storage | Example URI |
| --- | --- |
| Local file through PyArrow | `file:///home/user/data/events.jsonl` |
| Amazon S3 | `s3://raw-bucket/events/2026-06-12.jsonl` |
| Amazon S3 folder | `s3://raw-bucket/events/2026-06-12/` |
| Google Cloud Storage | `gs://raw-bucket/assets/2026-06-12.parquet` |
| Google Cloud Storage folder | `gs://raw-bucket/assets/2026-06-12/` |
| Google Cloud Storage alias | `gcs://raw-bucket/assets/2026-06-12.xml` |
| Azure Data Lake Storage Gen2 | `abfs://container@account.dfs.core.windows.net/events/2026-06-12.jsonl` |
| Azure Data Lake Storage Gen2 folder | `abfs://container@account.dfs.core.windows.net/events/2026-06-12/` |

Cloud URI support depends on the installed PyArrow build and the normal
provider credentials/configuration available to PyArrow.

```python
import schema_sanitizer as ss

events = ss.read_jsonl("s3://raw-bucket/events/2026-06-12.jsonl")
assets = ss.read_parquet("gs://raw-bucket/assets/2026-06-12.parquet")
daily_events = ss.read_json_folder("s3://raw-bucket/events/2026-06-12/")

ss.to_parquet(
    "s3://raw-bucket/events/2026-06-12.jsonl",
    "gs://clean-bucket/events/2026-06-12.parquet",
)
```

URI file inputs are opened as seekable PyArrow files. CSV, JSON, JSON Lines,
NDJSON, and XML bytes are fed directly to the native chunk scanner. Parquet is
decoded with `pyarrow.parquet` into batches, converted incrementally to JSON
Lines bytes, and then fed to the same native sanitizer path. No single-file URI
input is copied to a temporary file by `schema-sanitizer`.

Folder URI inputs are listed with non-recursive `pyarrow.fs.FileSelector`.
`read_json_folder` filters direct `.json` child files and `read_xml_folder`
filters direct `.xml` child files. The matching children are sorted by
filename, then compacted one document at a time into a local temporary stream
before the normal sanitizer pipeline reads that stream.

URI outputs are opened with `pyarrow.fs.open_output_stream`. CSV and Parquet
writers stream Arrow batches to that output stream, and JSON Lines writes UTF-8
bytes incrementally. The output URI is not staged through a local temporary file.

## [Supported Inputs](#index)

Supported inputs are intentionally file-oriented:

- Normal local file paths for `read_csv`, `read_json`, `read_jsonl`,
  `read_xml`, `read_parquet`, `to_csv`, `to_jsonl`, and `to_parquet`.
- PyArrow filesystem file URI strings for the same single-file readers and
  converters when PyArrow is installed and can open the URI.
- Normal local folders for `read_json_folder` and `read_xml_folder`.
- PyArrow filesystem folder URI strings for `read_json_folder` and
  `read_xml_folder`; folder exploration is non-recursive.
- Already-resident `list[dict]` rows through `read_python`.

## [Unsupported Inputs](#index)

Unsupported inputs include raw JSON or XML strings, bytes payloads, opened
files, `io.BytesIO`, `io.StringIO`, custom reader objects, URLs that PyArrow
cannot open as files, and recursive folder scans. Write those inputs to a local
file first, or use `read_python` for in-memory `list[dict]` rows.

## [Examples](#index)

The `examples/` directory contains tutorial notebooks and one cloud pipeline
CLI example:

- `01_ingestion_and_core_api.ipynb`
- `02_options_and_stats.ipynb`
- `03_adapters_and_converters.ipynb`
- `04_streaming_large_csv_to_parquet.ipynb`
- `05_full_options_catalog_sweep.ipynb`
- `06_xml_reading_and_memory.ipynb`
- `07_gcs_jsonl_to_silver_parquet.py`: GCS JSONL to GCS Parquet using a
  BigQuery external table schema fetched through ADBC as `base_schema`

## [Platform Notes](#index)

Published PyPI wheels target glibc-based Linux environments
(`manylinux_2_28`). Alpine Linux uses musl, so Alpine users should use a
glibc-based Python environment or build from source.

## [Development](#index)

Install the project for local development:

```bash
pip install -e .[dev]
```

Run the tests:

```bash
pytest
```

Build the native core directly with CMake:

```bash
cmake -S . -B build/dev -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build/dev
```

## [License](#index)

`schema-sanitizer` is licensed under the Apache License 2.0. See
[`LICENSE`](LICENSE).
