Metadata-Version: 2.4
Name: dr-dasci
Version: 0.2.0
Summary: Automatic diagnosis for pandas, Polars, NumPy, Arrow, and distributed data pipelines.
Project-URL: Documentation, https://github.com/Arkay92/dr-dasci#readme
Project-URL: Homepage, https://github.com/Arkay92/dr-dasci
Project-URL: Issues, https://github.com/Arkay92/dr-dasci/issues
Project-URL: Repository, https://github.com/Arkay92/dr-dasci
Author: Arkay92
Maintainer: Arkay92
License: MIT License
        
        Copyright (c) 2026 JadeyGraham96
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: dataframe,diagnostics,memory,numpy,pandas,polars,profiling,pyarrow
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Typing :: Typed
Requires-Python: >=3.10
Provides-Extra: all
Requires-Dist: dask[dataframe]>=2024.1; extra == 'all'
Requires-Dist: duckdb>=0.10; extra == 'all'
Requires-Dist: ipython>=8; extra == 'all'
Requires-Dist: ipywidgets>=8; extra == 'all'
Requires-Dist: numpy>=1.23; extra == 'all'
Requires-Dist: pandas>=1.5; extra == 'all'
Requires-Dist: polars>=0.20; extra == 'all'
Requires-Dist: pyarrow>=12; extra == 'all'
Requires-Dist: pyspark>=3.5; extra == 'all'
Provides-Extra: dask
Requires-Dist: dask[dataframe]>=2024.1; extra == 'dask'
Provides-Extra: dev
Requires-Dist: build>=1.2; extra == 'dev'
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Requires-Dist: twine>=5; extra == 'dev'
Provides-Extra: duckdb
Requires-Dist: duckdb>=0.10; extra == 'duckdb'
Provides-Extra: notebook
Requires-Dist: ipython>=8; extra == 'notebook'
Requires-Dist: ipywidgets>=8; extra == 'notebook'
Provides-Extra: numpy
Requires-Dist: numpy>=1.23; extra == 'numpy'
Provides-Extra: pandas
Requires-Dist: pandas>=1.5; extra == 'pandas'
Provides-Extra: polars
Requires-Dist: polars>=0.20; extra == 'polars'
Provides-Extra: spark
Requires-Dist: pyspark>=3.5; extra == 'spark'
Description-Content-Type: text/markdown

# dr-dasci

<p align="center">
  Automatic diagnostics for pandas, Polars, NumPy, Arrow, and distributed data pipelines.
</p>

<p align="center">
  <img width="256" height="256" alt="dr-dasci Logo" src="https://github.com/Arkay92/Dr-DaSci/raw/main/drdasci.png" />
</p>

<p align="center">
  <a href="https://github.com/Arkay92/dr-dasci/actions/workflows/publish.yml"><img alt="Publish" src="https://github.com/Arkay92/dr-dasci/actions/workflows/publish.yml/badge.svg" /></a>
  <a href="https://pypi.org/project/dr-dasci/"><img alt="PyPI" src="https://img.shields.io/pypi/v/dr-dasci.svg" /></a>
  <img alt="Python" src="https://img.shields.io/pypi/pyversions/dr-dasci.svg" />
  <img alt="Downloads" src="https://img.shields.io/pypi/dm/dr-dasci.svg" />
  <img alt="License" src="https://img.shields.io/pypi/l/dr-dasci.svg" />
</p>

**dr-dasci** combines:
- **Dataframe diagnostics** for pandas-like and Polars-like objects.
- **Array diagnostics** for NumPy memory layout, dtype, and copy risks.
- **Operation preflight checks** for joins, groupbys, pivots, conversions, and Parquet reads.
- **Optional engine diagnostics** for Dask, DuckDB, Spark, and Arrow Dataset metadata.
- **Runtime instrumentation** for measuring actual Python allocation peaks.
- **Configurable thresholds** for laptop, CI, and server memory budgets.
- **Suppression config** for accepted finding codes.
- **Machine-readable reports** with stable finding codes, metadata, and JSON export.
- **Notebook rendering** through HTML and optional `ipywidgets`.
- **Safe execution plans** for large tabular transformations.
- **Optional dependencies** so the base package stays lightweight.

---

## Why Diagnostics for Data Pipelines?

pandas, Polars, NumPy, and Arrow are powerful, but many expensive operations look
cheap at the call site:

```
Input object or file
  -> Detect runtime and shape
  -> Inspect dtypes, indexes, memory, cardinality, and layout
  -> Estimate operation-specific peak memory

  -> Report findings with stable codes
  -> Suggest safer dtypes and execution plans
  -> Export text or JSON for notebooks, CI, and logs
```

`dr-dasci` is designed to catch common problems before they become production
failures:
- **Hidden copies** from pandas object blocks, index alignment, and NumPy views.
- **Memory blowups** in joins, groupbys, pivots, unstack, fillna, and conversions.
- **String dtype traps** where `object`, high-cardinality text, or repeated labels
  need different treatment.
- **Parquet-to-pandas expansion** when encoded Arrow data becomes pandas blocks.
- **Join cardinality surprises** from duplicate keys, null keys, and many-to-many
  merges.

---

## Architecture

```
DataFrame / ndarray / file path
    |
    v
Adapter detection
  - pandas DataFrame
  - Polars DataFrame / LazyFrame
  - NumPy ndarray
  - dataframe-like fallback
  - Parquet metadata reader
  - Dask / DuckDB / Spark / Arrow Dataset metadata adapters
    |
    v
Diagnostics
  - shape and memory estimates
  - dtype and cardinality checks
  - pandas index/copy-risk checks
  - NumPy layout checks
  - join/groupby/pivot/conversion preflight
  - runtime peak allocation instrumentation
    |
    v
DoctorReport
  - human-readable show()
  - suggestions via suggest()
  - safe_execution_plan()
  - machine-readable to_dict() / to_json()
```

---

## Install

```bash
pip install dr-dasci
```

For pandas support:

```bash
pip install "dr-dasci[pandas]"
```

For Polars support:

```bash
pip install "dr-dasci[polars]"
```

For Dask, DuckDB, Spark, or notebook support:

```bash
pip install "dr-dasci[dask]"
pip install "dr-dasci[duckdb]"
pip install "dr-dasci[spark]"
pip install "dr-dasci[notebook]"
```

For all optional dataframe, array, and Parquet support:

```bash
pip install "dr-dasci[all]"
```

For development:

```bash
pip install -e ".[dev,all]"
pytest -q
python -m build
twine check dist/*
```

---

## Quick Start

### Basic Diagnosis

```python
from dr_dasci import diagnose

report = diagnose(df, name="orders")

report.show()
print(report.suggest())
```

### Machine-Readable Output

```python
from dr_dasci import diagnose

report = diagnose(df)

payload = report.to_dict()
json_text = report.to_json()

print(payload["findings"][0]["code"])
print(json_text)
```

### Safe Execution Plan

```python
report = diagnose(df, name="events")

for step in report.safe_execution_plan():
    print(step)
```

### Configurable Thresholds

```python
from dr_dasci import DoctorConfig, diagnose

config = DoctorConfig(
    available_memory_bytes=8_000_000_000,
    large_memory_bytes=1_500_000_000,
    expensive_column_bytes=150_000_000,
)

report = diagnose(df, config=config)
```

Suppress accepted finding codes:

```python
from dr_dasci import DoctorConfig, diagnose

config = DoctorConfig(suppress_codes=("EXPENSIVE_OBJECT_COLUMN",))
report = diagnose(df, config=config)
```

---

## CLI

Inspect a local data file:

```bash
dr-dasci inspect data.parquet
```

Emit JSON:

```bash
dr-dasci inspect data.parquet --json
```

---

## Main Features

### 1. **Dataframe Diagnosis**

Detect expensive object columns, large shapes, numeric downcast candidates,
nullable dtype candidates, and pandas index risks:

```python
from dr_dasci import diagnose

report = diagnose(df)
report.show()
```

Common finding codes include:

- `EXPENSIVE_OBJECT_COLUMN`
- `DOWNSIZE_NUMERIC_CANDIDATE`
- `DUPLICATE_INDEX`
- `NON_MONOTONIC_INDEX`
- `PANDAS_OBJECT_BLOCK_COPY_RISK`
- `PANDAS_ALIGNMENT_COPY_RISK`

### 2. **Join Preflight**

Estimate join cardinality, null-key risk, many-to-many risk, and peak memory:

```python
from dr_dasci import diagnose_join

report = diagnose_join(left, right, on="customer_id", how="left")
report.show()
```

### 3. **Groupby Preflight**

Check high-cardinality grouping keys and aggregation memory pressure:

```python
from dr_dasci import diagnose_groupby

report = diagnose_groupby(events, by=["account_id", "event_day"])
print(report.risky_operations(minimum="high"))
```

### 4. **Pivot and Unstack Preflight**

Estimate dense expansion before reshaping:

```python
from dr_dasci import diagnose_pivot

report = diagnose_pivot(df, index="user_id", columns="event_type")
report.show()
```

### 5. **Conversion Diagnostics**

Preflight conversion costs between pandas, Polars, NumPy, and Arrow-backed data:

```python
from dr_dasci import diagnose_conversion

report = diagnose_conversion(df, target="pandas")
print(report.to_json())
```

### 6. **Parquet Metadata Diagnostics**

Inspect Parquet row groups, column counts, compression, encodings, and pandas
conversion risk without loading the full dataset:

```python
from dr_dasci import diagnose_parquet

report = diagnose_parquet("events.parquet")
report.show()
```

### 7. **NumPy Copy-Risk Checks**

Catch object arrays and non-contiguous views:

```python
from dr_dasci import diagnose

report = diagnose(array)
report.show()
```

### 8. **Optional Engine Metadata Diagnostics**

Inspect Dask, DuckDB, Spark, and Arrow Dataset objects from metadata without
triggering computation or collecting rows:

```python
from dr_dasci import diagnose

report = diagnose(lazy_or_distributed_frame)
print(report.safe_execution_plan())
```

### 9. **Runtime Memory Instrumentation**

Measure actual Python allocation peak for a callable when you intentionally want
to run it:

```python
from dr_dasci import diagnose_runtime

result, report = diagnose_runtime(lambda: df.assign(total=df["a"] + df["b"]), name="assign_total")
print(report.estimates[0].metadata["peak_bytes"])
```

Preflight helpers do not execute transformations; `diagnose_runtime` is the
explicit execution-time measurement API.

### 10. **Notebook Reports**

```python
from dr_dasci import diagnose

diagnose(df).to_notebook()
```

### 11. **Stable Finding Codes**

Every finding includes a stable `code`, `severity`, `suggestion`, optional
`column`, documentation URL, and metadata:

```python
for finding in report.findings:
    print(finding.code, finding.severity, finding.metadata)
```

See [docs/FINDINGS.md](docs/FINDINGS.md) for the finding catalog.

---

## Configuration

Tune behavior via `DoctorConfig`:

```python
from dr_dasci import DoctorConfig

config = DoctorConfig(
    large_memory_bytes=1_000_000_000,
    expensive_column_bytes=100_000_000,
    large_cell_count=50_000_000,
    large_rows=1_000_000,
    very_large_rows=5_000_000,
    pivot_row_warning=250_000,
    pivot_width_warning=25,
    join_high_memory_bytes=500_000_000,
    low_cardinality_ratio=0.2,
    low_cardinality_max_unique=50_000,
    high_cardinality_ratio=0.8,
    index_warning_rows=100_000,
    available_memory_bytes=None,
)
```

---

## Examples

```python
from dr_dasci import diagnose, diagnose_join

orders_report = diagnose(orders, name="orders")
customers_join_report = diagnose_join(orders, customers, on="customer_id")

orders_report.show()
customers_join_report.show()
```

```bash
dr-dasci inspect warehouse/orders.parquet --json
```

---

## Project Structure

```
src/dr_dasci/
  __init__.py          # Public API
  config.py            # DoctorConfig thresholds
  core.py              # Diagnostics and operation preflight helpers
  report.py            # DoctorReport, findings, estimates, JSON export
  cli.py               # Command-line interface
  py.typed             # Typing marker
docs/
  FINDINGS.md          # Stable finding-code catalog
CHANGELOG.md           # Release history
tests/
  test_*.py            # Unit and optional integration tests
.github/
  workflows/
    ci.yml             # Lint, type check, tests, build, twine check
    publish.yml        # PyPI publishing workflow
pyproject.toml         # Project metadata and dependencies
drdasci.png            # Project logo
```

---

## Development

```bash
# Install with dev extras
pip install -e ".[dev,all]"

# Lint
ruff check .

# Type check
mypy src

# Run tests
pytest -q

# Build package
python -m build

# Check distributions
twine check dist/*
```

---

## License

MIT

---

## Contributing

Contributions are welcome. Open an issue with a reproducible dataframe shape,
dtypes, operation, and observed memory or runtime behavior.

---

## Citation

If you use dr-dasci in research, please cite:

```bibtex
@software{drdasci2026,
  title={dr-dasci: Automatic Diagnostics for Data Science Pipelines},
  author={Arkay92},
  url={https://github.com/Arkay92/dr-dasci},
  year={2026},
  version={0.2.0},
}
```

---

## Acknowledgments

- [pandas](https://pandas.pydata.org/) for dataframe analytics.
- [Polars](https://pola.rs/) for high-performance dataframe execution.
- [NumPy](https://numpy.org/) for array computing.
- [Apache Arrow](https://arrow.apache.org/) for columnar memory and Parquet tooling.
