Metadata-Version: 2.4
Name: csvbench
Version: 0.1.0
Summary: Read, inspect and rewrite malformed CSV files with automatic encoding and separator detection.
Author-email: Vinícius Machado <viniciusfm1@outlook.com>
License: MIT
Project-URL: Homepage, https://github.com/viniciusfm1/csvbench
Project-URL: Issues, https://github.com/viniciusfm1/csvbench/issues
Keywords: csv,data,encoding,parser,cli
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing
Classifier: Topic :: Utilities
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: charset-normalizer>=3.3
Requires-Dist: chardet>=5.2
Requires-Dist: pydantic>=2.6
Requires-Dist: rich>=13.7
Requires-Dist: typer>=0.12
Provides-Extra: dev
Requires-Dist: pytest>=8.1; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: hypothesis>=6.100; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: mypy>=1.9; extra == "dev"
Requires-Dist: pandas>=2.3.3; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=7.3; extra == "docs"
Requires-Dist: sphinx-napoleon>=0.7; extra == "docs"
Requires-Dist: furo>=2024.1; extra == "docs"
Dynamic: license-file

# csvbench

csvbench is a Python library for reading, diagnosing, and repairing malformed CSV files.
It is under active development and not yet production-ready.

It does not use Python's `csv` module: handling broken files is the point.

---

## Status

Early stage. The core pipeline (encoding detection, parsing, diagnosis) works.
Repair strategies are under active development.

Battle-tested in production? Probably not. But you're welcome to try and to contribute.

---

## Features

- Automatic detection of encoding, delimiter, and quote character
- Multi-character separator support (e.g. `||`, `@@@`)
- Structured diagnostic reports with per-row issue tracking
- Pluggable repair strategies via the Strategy pattern
- CLI with rich terminal output and JSON output for programmatic use

---

## Usage

### CLI

```bash
csvbench inspect appointments.csv
```

```
╭────────────────────────────── csvbench inspect ────────────────────────────────╮
│                                                                                │
│   📁 File  ~/data/appointments.csv                                             │
│   🔤 Encoding  utf-8-sig  (100% confidence - bom)                              │
│   🔀 Separator  ';'  (98% confidence - sniffed)                                │
│   💬 Quotechar  '"'  (97% confidence - detected)                               │
│   📊 Columns  12                                                               │
│   📈 Lines  19847                                                              │
│   ❌ Errors  0                                                                 │
│   ⚠️  Warnings  0                                                              │
│   ⏱️  Elapsed  0.0013s                                                         │
│                                                                                │
╰────────────────────────────────────────────────────────────────────────────────╯
  ✔  No issues found.
```

JSON output for scripting:

```bash
csvbench inspect appointments.csv --format json
csvbench inspect appointments.csv --format json --output report.json
```

Reading from stdin:

```bash
cat appointments.csv | csvbench inspect -
```

### Python API

```python
from csvbench import CsvWorkbench

workbench = CsvWorkbench()
csv_file = workbench.read("appointments.csv")

print(csv_file.delimiter)           # ';'
print(csv_file.encoding)            # 'utf-8-sig'
print(csv_file.report.has_errors)   # False
```

Override detection when you already know the parameters:

```python
csv_file = workbench.read("appointments.csv", delimiter=";", encoding="utf-8")
```

---

## Design

**No `csv` module.** csvbench implements its own parser. Python's `csv` module assumes
the file is well-formed enough to be parsed — csvbench doesn't. The parser operates
character by character to correctly handle malformed quoting, embedded newlines, and
inconsistent delimiters.

**Multi-character separators.** The delimiter detector considers both single-character
(`|`, `;`, `\t`) and multi-character candidates (`||`, `::`) when sniffing the file.

**Pydantic v2 models throughout.** `CSVFile`, `DiagnosticReport`, `Issue`, and all
detector results are Pydantic models. This keeps the data layer typed, validated, and
serializable without extra glue code.

**CLI with two output modes.** `rich` for humans, `json` for pipelines. Both use the
same underlying models — the formatter is swapped, not the data.

---

## Contributing

Issues and pull requests are welcome.

If you find a CSV file that csvbench misparses or misdiagnoses, opening an issue with
the file (or a minimal reproduction) is already a meaningful contribution.

---

## License

MIT
