Metadata-Version: 2.4
Name: dsvmonkey
Version: 0.1.0
Summary: Detect, profile, normalize and repair delimiter-separated values files (CSV, TSV, pipe, semicolon).
Author-email: rexbytes <pythonic@rexbytes.com>
License: MIT License
        
        Copyright (c) 2026 RexBytes
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/rexbytes/dsvmonkey
Project-URL: Issues, https://github.com/rexbytes/dsvmonkey/issues
Keywords: csv,tsv,dsv,etl,encoding,delimiter,cleaning
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Classifier: Topic :: Utilities
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: cleanmonkey<1.0,>=0.1
Requires-Dist: datemonkey<1.0,>=0.1
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: hypothesis>=6.0; extra == "dev"
Dynamic: license-file

# dsvmonkey

Detect, profile, normalize and repair delimiter-separated-values files.

CSV is a polite lie. Real files are tab-separated, pipe-separated, or
semicolon-separated; start with decorative title rows; carry BOMs and
mixed encodings; include ragged rows and quoted newlines. `dsvmonkey`
reads them anyway, tells you what it found, and hands you a clean
stream of rows.

## Status

Alpha. API is not yet stable.

## Install

```bash
pip install dsvmonkey
```

For development (editable install with test tooling):

```bash
pip install -e .[dev]
# or equivalently:
pip install -r requirements-dev.txt
```

Both `requirements.txt` and `requirements-dev.txt` are thin pointers
to `pyproject.toml` — the single source of truth for dependency
lists. Edit dependencies in `pyproject.toml`; the requirements files
need no maintenance.

## What it does

- **Detect** encoding, delimiter, quote char, header row and line
  endings — each with a confidence score, runner-up alternatives and
  the reasoning behind the choice.
- **Normalize** cells on read using [`cleanmonkey`](https://pypi.org/project/cleanmonkey/)
  (BOMs, NBSPs, zero-width spaces, smart quotes, stray control chars).
- **Profile** date columns via [`datemonkey`](https://pypi.org/project/datemonkey/).
- **Repair** ragged rows, stray BOMs and inconsistent line endings.
- **Stream** row-by-row; large files are fine.
- **Chain** cleanly into `pgmonkey` (DB import), `xlfilldown` (Excel
  output) and `typemonkey` (type inference).

## CLI

```bash
dsvmonkey inspect   file.csv                       # human-readable detection report
dsvmonkey normalize file.csv -o clean.csv          # strip BOM, fix ragged rows, normalize endings
dsvmonkey convert   file.csv -o out.jsonl --to jsonl
```

Run `dsvmonkey --help` or `dsvmonkey <command> --help` for the full
list. Flags are command-specific:

- `inspect`: `-v/--verbose`, `--no-columns`, `--sample-rows`,
  `--excel-serial-min`, `--no-deep-scan`, `--clean-sample`,
  `--strict` (exit 3 instead of 0 when the profile recommends
  human review — the unattended-pipeline gate).
- `normalize`: `--encoding`, `--line-ending lf|crlf|cr`,
  `--delimiter`, `--field-count`, `--no-clean`, `--no-deep-scan`,
  `--keep-empty-rows`, `--sanitize-formulas`, `--strict` (same
  gate semantics as `inspect --strict`: profile first, exit 3
  with no output written when detection isn't confident enough).
- `convert`: `--to {csv,tsv,jsonl}`, `--no-clean`, `--no-deep-scan`,
  `--keep-empty-rows`, `--sanitize-formulas` (applies on every output
  format, including `jsonl` — JSONL output is commonly transformed
  back to CSV/Excel later, where formula payloads surviving as JSON
  string values become live formulas), `--strict` (gate as above).

## Python API

```python
import dsvmonkey

# Profile a file — encoding, delimiter, headers, etc.
profile = dsvmonkey.profile_file("file.csv")

# Stream cleaned rows as dicts
for row in dsvmonkey.read("file.csv"):
    ...

# Write a cleaned version
report = dsvmonkey.repair("messy.csv", "clean.csv")

# Convert to JSON Lines
dsvmonkey.to_jsonl("file.csv", "file.jsonl")

# Per-column profiling (date-format detection via datemonkey)
columns = dsvmonkey.profile_columns("file.csv")
```

## Limitations

Some behaviours are deliberate design tradeoffs rather than bugs (e.g.
mixed-encoding detection requires UTF-8 multi-byte evidence to avoid
false-positives on cp1252 files; duplicate header names in dict mode
warn-and-collapse rather than raise). See `LIMITATIONS.md` for the
full list with rationale and escape hatches.

## Using with AI assistants

`SKILL.md` at the repo root is a drop-in Claude Code / agent skill that
teaches LLMs how to call `dsvmonkey` correctly — decision tree, failure
modes it already handles, worked examples, and a "don't" list so agents
stop reinventing broken CSV parsing. Copy it to `~/.claude/skills/` or
include it in a project's `AGENTS.md` / `CLAUDE.md` for automatic
discovery.

## License

MIT. See `LICENSE`.
