Metadata-Version: 2.4
Name: qualink
Version: 0.0.1
Summary: Blazing fast data quality framework for Python, built on Apache DataFusion
Keywords: data,quality,dq
Author: Pavan Kumar Gopidesu
Maintainer: Pavan Kumar Gopidesu
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-Expression: Apache-2.0
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Environment :: Web Environment
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
License-File: LICENSE
Requires-Dist: datafusion>=51.0.0
Requires-Dist: pyarrow>=15.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: pytest>=8.0 ; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23 ; extra == "dev"
Requires-Dist: prek ; extra == "dev"
Requires-Dist: ty>=0.0.18 ; extra == "dev"
Provides-Extra: dev

# qualink

Blazing fast data quality framework for Python, built on Apache DataFusion.

## Features

- **High Performance**: Leverages Apache DataFusion for fast data processing and validation.
- **Flexible Constraints**: Supports various data quality constraints including completeness, uniqueness, and custom assertions.
- **YAML Configuration**: Define validation suites declaratively using YAML files.
- **Multiple Output Formats**: Results can be formatted as human-readable text, JSON, or Markdown.
- **Async Support**: Built with asyncio for non-blocking operations.
- **Easy Integration**: Simple API for defining and running validation suites.

## Installation

Install qualink using pip:

```bash
pip install qualink
```

Or using uv:

```bash
uv add qualink
```

## Quick Start

Here's a basic example of using dq-tool to validate a CSV file:

```python
import asyncio
from datafusion import SessionContext
from qualink.checks import Check, Level
from qualink.constraints import Assertion
from qualink.core import ValidationSuite
from qualink.formatters import MarkdownFormatter


async def main() -> None:
    ctx = SessionContext()
    ctx.register_csv("users", "examples/users.csv")

    result = await (
        ValidationSuite()
        .on_data(ctx, "users")
        .with_name("User Data Quality")
        .add_check(Check.builder("Critical Checks").with_level(Level.ERROR).is_complete("user_id").build())
        .add_check(
            Check.builder("Data Quality")
            .with_level(Level.WARNING)
            .has_completeness("name", Assertion.greater_than_or_equal(0.95))
            .build()
        )
        .run()
    )

    print(MarkdownFormatter().format(result))


if __name__ == "__main__":
    asyncio.run(main())
```

## YAML Configuration

You can also define validation suites using YAML files for a declarative approach:

```yaml
suite:
  name: "User Data Quality"

data_source:
  type: csv
  path: "examples/users.csv"
  table_name: users

checks:
  - name: "Critical Checks"
    level: error
    rules:
      - is_complete: user_id
      - is_unique: email
      - has_size:
          gt: 0
  - name: "Data Quality"
    level: warning
    rules:
      - has_completeness:
          column: name
          gte: 0.95
```

Run the YAML configuration:

```python
import asyncio
from qualink.config import run_yaml
from qualink.formatters import HumanFormatter


async def main() -> None:
    result = await run_yaml("path/to/your/config.yaml")
    print(HumanFormatter().format(result))


if __name__ == "__main__":
    asyncio.run(main())
```

## Constraints

dq-tool supports the following constraint types:

- **Completeness**: Ensures a column has no null values or meets a minimum completeness ratio.
- **Uniqueness**: Checks for duplicate values in a column.
- **Assertion**: Custom assertions using SQL expressions.

## Formatters

Results can be formatted using:

- `HumanFormatter`: Human-readable text output.
- `JsonFormatter`: JSON format for programmatic processing.
- `MarkdownFormatter`: Markdown tables for documentation.

## Development

To set up the development environment:

```bash
git clone https://github.com/gopidesupavan/qualink.git
cd dq-tool
uv sync
```

Run tests:

```bash
uv run pytest
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

## Acknowledgments

- [Apache DataFusion](https://datafusion.apache.org/) for the query engine
- [AWS Deequ](https://github.com/awslabs/deequ/) for the inspiration
- [Term Guard](https://github.com/withterm/term-guard)

