Metadata-Version: 2.4
Name: schema-classifier
Version: 0.1.0
Summary: MVP: Detect and infer schemas from files/dirs/DataFrames; emit YAML/JSON/TXT/Spark StructType
Author: Aashish Kumar
Keywords: schema,classifier,parquet,csv,json,delta,iceberg,hudi
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyarrow>=12
Requires-Dist: fastavro>=1.7
Requires-Dist: zstandard>=0.20
Requires-Dist: charset-normalizer>=3.0
Provides-Extra: dataframe
Requires-Dist: pandas>=1.5; extra == "dataframe"
Requires-Dist: pyspark>=3.3; extra == "dataframe"
Provides-Extra: orc
Dynamic: license-file

# schema-classifier

**PySchemaClassifier** — a Python library and CLI to detect file/table/dataframe formats, infer/extract schemas, and emit schemas (Spark StructType-like dict, YAML, JSON, TXT).MVP focuses on single-level compression, core formats (CSV/JSON/XML/Parquet/Avro/ORC + Delta/Iceberg/Hudi metadata), sampling policies, and robust exceptions.

## Status
This is a **design-locked skeleton** for MVP implementation. Modules are scaffolded with docstrings and TODO markers.

## Quick Start
```bash
# create and activate venv
python -m venv .venv
source .venv/bin/activate 
##or 
.\.venv\Scripts\activate

# editable install
pip install -e .
pip install -e .[orc] 

# run CLI (prints skeleton info)
schema-detect --help

# Try running below commands to test this framework

default fmt: yaml
default --output-dir .
default --output-file schema.yml

schema-detect tests/data/csv/sales_header.csv
schema-detect tests/data/csv/sales_no_header.csv --fmt yaml --output-file schema_no_header.yml
schema-detect tests/data/csv/very_wide.csv --fmt yaml --output-file schema_wide.yml
schema-detect tests/data/csv/sales_utf8_sig.csv --fmt yaml --output-file schema_utf8.yml
schema-detect tests/data/orc/TestOrcFile.testDate1900.orc --fmt yaml --output-file schema_orc.yml
schema-detect tests/data/avro/weather.avro --fmt yaml --output-file schema_avro.yml
schema-detect tests/data/parquet/v0.7.1.all-named-index.parquet --fmt yaml --output-file schema_pqt.yml
schema-detect tests/data/delta/people_countries_delta_dask/ --fmt yaml --output-file schema_delta.yml
schema-detect tests/data/json/events.ndjson --fmt yaml --output-file schema_json.yml
schema-detect tests/data/xml/books.xml --fmt yaml --output-file schema_xml.yml
schema-detect tests/data/csv/ --multi-file-fmt txt
schema-detect tests/data/csv/sales_20250101.csv --fmt json --output-file schema_date.json

## To print the schema on CLI
schema-detect tests/data/json/events.ndjson --fmt dict

```

```bash
## To test Python APIs
python .\tests\unit\combine_run_schema.py

```

```bash
## To build the image
.\build.ps1 -Target [test|prod]
##or 


```

## CLI Overview (MVP)
Single command: `schema-detect <path>` with write options and detection/sampling knobs.

Key flags (subset):
- `--detection-mode {trust_hint,verify_hint,auto_detect}` (default: `trust_hint`)
- `--coverage-mode {any,max,full}` (default: `max`)
- `--sample-records` (default: 500)
- `--sample-bytes` (default: 5MB for `any`; `full` capped at 100MB)
- `--output-dir`, `--output-file`, `--fmt {yaml,json,txt,dict}`
- `--zip-max-size` (default: 500MB), `--zip-max-members` (default: 100)
- `--max-file-size` (default: 50GB)
- `--sample-total-bytes-cap` (soft cap default: 1GB)
- `--max-workers` (default: os.cpu_count())
- `--retries` (default: 3), `--timeout-seconds` (default: 180)
- `--log-json` (opt-in), `-v/--verbose`

CSV knobs (MVP): `--csv.header {auto,true,false}` (auto flips to true when confidence ≥ 0.80), `--csv.delimiter`, `--csv.quote`, `--csv.escape`, `--encoding` (utf-8/utf-8-sig/utf-16le/utf-16be).



<pre>


schema-classifier/
├─ README.md
├─ LICENSE
├─ pyproject.toml                   # packaging, deps, console script
├─ .gitignore
├─ src/
│  └─ pyschemaclassifier/          # library: prefer 'PySchemaClassifier' (or 'open_pyschemaclassifier' if name taken)
│     ├─ __init__.py
│     ├─ cli.py                    # CLI: schema-detect entrypoint
│     ├─ infer.py                  # Orchestrator: classify → detect → normalize → emit
│     ├─ config.py                 # Config model + load/merge logic (flags override YAML)
│     ├─ logging_utils.py          # Colored logs, JSON logs, verbosity levels
│     ├─ exceptions.py             # ArgumentError + taxonomy (DetectionError, etc.)
│     ├─ models/
│     │  ├─ __init__.py
│     │  └─ schema.py              # Normalized schema model + Spark StructType JSON conversion
│     ├─ detection/
│     │  ├─ __init__.py
│     │  ├─ classifier.py          # extension/magic bytes / table markers (delta/_delta_log, iceberg metadata.json, .hoodie)
│     │  ├─ compression.py         # gzip/bz2/xz/zstd/zip one-level handling; size/member caps; corruption checks
│     │  ├─ sampling.py            # Sampling state machine (records/bytes, coverage_mode, error budget)
│     │  ├─ csv.py                 # Basic delimiter/quote/escape/BOM/encoding; header auto w/ ≥0.80
│     │  ├─ json.py                # NDJSON vs JSON object/array; recursive inference; unions off by default
│     │  ├─ xml.py                 # Basic element→object; arrays via repeated elements (iterparse)
│     │  ├─ parquet.py             # Footer-based extraction via pyarrow; logical type mapping
│     │  ├─ avro.py                # Schema extraction via fastavro
│     │  ├─ orc.py                 # Schema via pyorc
│     │  ├─ delta.py               # Latest snapshot from _delta_log JSON (names in schema, IDs in metadata)
│     │  ├─ iceberg.py             # Parse metadata.json; partition transforms to metadata
│     │  └─ hudi.py                # COW support; raise TableFormatError for MOR
│     ├─ writers/
│     │  ├─ __init__.py
│     │  ├─ yaml.py                # schema.yml writer (default)
│     │  ├─ json.py                # Pretty JSON (schema + meta)
│     │  ├─ txt.py                 # Human-friendly text summary
│     │  └─ struct.py              # Return dict exactly matching Spark StructType.jsonValue()
│     ├─ dataframe/
│     │  ├─ __init__.py
│     │  ├─ pandas.py              # detect_schema_from_df(pd.DataFrame)
│     │  └─ spark.py               # detect_schema_from_df(Spark DataFrame)
│     └─ utils/
│        ├─ __init__.py
│        ├─ io.py                  # Safe open/stream, retries (3), timeouts (180s), size pre-checks (50 GB)
│        ├─ path.py                # Path utilities, dir traversal, per-file sampling selection
│        └─ metrics.py             # Confidence scoring; delimiter stability; provenance/meta helpers
├─ tests/
│  ├─ conftest.py
│  ├─ unit/
│  │  ├─ test_cli.py
│  │  ├─ test_config.py
│  │  ├─ test_exceptions.py
│  │  ├─ test_sampling.py
│  │  ├─ test_csv.py
│  │  ├─ test_json.py
│  │  ├─ test_xml.py
│  │  ├─ test_parquet.py
│  │  ├─ test_avro.py
│  │  ├─ test_orc.py
│  │  ├─ test_delta.py
│  │  ├─ test_iceberg.py
│  │  └─ test_hudi.py
│  ├─ integration/
│  │  ├─ test_directory_mode.py
│  │  ├─ test_zip_container.py
│  │  └─ test_parallel_sampling.py
│  ├─ data/                        # Small fixtures per format + compression (MVP-focused)
│  │  ├─ csv/
│  │  │  ├─ sales_header.csv
│  │  │  ├─ sales_no_header.csv
│  │  │  ├─ sales_utf8_sig.csv
│  │  │  ├─ sales.csv.gz
│  │  │  └─ very_wide.csv
│  │  ├─ json/
│  │  │  ├─ events.ndjson
│  │  │  └─ events.ndjson.gz
│  │  ├─ xml/
│  │  │  └─ books.xml
│  │  ├─ parquet/
│  │  │  └─ sample.parquet
│  │  ├─ avro/
│  │  │  └─ sample.avro
│  │  ├─ orc/
│  │  │  └─ sample.orc
│  │  ├─ delta/
│  │  │  └─ _delta_log/           # minimal commit/checkpoint JSONs
│  │  ├─ iceberg/
│  │  │  └─ metadata.json
│  │  ├─ hudi/
│  │  │  └─ .hoodie/              # COW minimal markers
│  │  └─ containers/
│  │     └─ sample.zip            # Small multi-entry zip within 500MB limit
│  └─ golden/
│     ├─ csv_sales_header.schema.yml
│     ├─ csv_sales_no_header.schema.yml
│     ├─ json_events_ndjson.schema.yml
│     ├─ parquet_sample.schema.yml
│     └─ … (one per format/compression)
├─ examples/
│  ├─ configs/
│  │  └─ mvp_defaults.yml         # Shows all overridable knobs; flags override
│  ├─ cli/
│  │  ├─ detect_csv.sh            # Illustrative commands (no actual code execution here)
│  │  ├─ detect_json.sh
│  │  └─ detect_parquet.sh
│  └─ api/
│     ├─ detect_from_path.md      # Usage examples (text only)
│     └─ detect_from_df.md        # Spark/pandas API examples (text only)
└─ docs/
   ├─ mvp_overview.md             # Short doc; full docs post-MVP
   └─ config_reference.md         # Flags and YAML keys


</pre>
