Metadata-Version: 2.4
Name: cleaning-agent
Version: 0.1.10
Summary: Intelligent data cleaning agent for automated data quality improvement
Author-email: StepFunction AI <team@stepfunction.ai>
Maintainer-email: StepFunction AI <team@stepfunction.ai>
License-Expression: MIT
Project-URL: Homepage, https://github.com/stepfnAI/cleaning-agent
Project-URL: Documentation, https://cleaning-agent.readthedocs.io
Project-URL: Repository, https://github.com/stepfnAI/cleaning-agent.git
Project-URL: Issues, https://github.com/stepfnAI/cleaning-agent/issues
Project-URL: Changelog, https://github.com/stepfnAI/cleaning-agent/blob/main/CHANGELOG.md
Keywords: data-cleaning,data-quality,ai-agent,machine-learning,data-preprocessing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: python-dotenv
Requires-Dist: sfn-blueprint>=0.7.4
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: isort>=5.12.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Provides-Extra: test
Requires-Dist: pytest>=7.0.0; extra == "test"
Requires-Dist: pytest-cov>=4.0.0; extra == "test"
Requires-Dist: pytest-mock>=3.10.0; extra == "test"
Provides-Extra: docs
Requires-Dist: sphinx>=6.0.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.2.0; extra == "docs"
Requires-Dist: myst-parser>=1.0.0; extra == "docs"

# Cleaning Agent

Intelligent data cleaning agent for automated data quality improvement.

## 🚀 Features

- **Automated Data Quality Analysis**: Detect missing values, duplicates, outliers, and data type inconsistencies
- **Intelligent Cleaning Strategies**: AI-powered decision making for optimal cleaning approaches
- **LLM-Driven Cleaning**: Leverage Large Language Models to automatically generate and execute Python code for complex data cleaning tasks.
- **Multiple Data Format Support**: CSV, Excel, JSON, Parquet, and pandas DataFrames
- **Comprehensive Reporting**: Detailed cleaning reports with metrics and recommendations
- **Configurable Parameters**: Customize cleaning behavior and thresholds
- **Command Line Interface**: Easy-to-use CLI for batch processing
- **Python API**: Simple integration into existing workflows

## 🏗️ Architecture

The Cleaning Agent follows a modular architecture:

```
CleaningAgent
├── DataQualityAnalyzer    # Analyzes data quality and detects issues
├── CleaningValidator      # Validates cleaned data and provides assessment
├── Configuration          # Manages agent settings and parameters
└── Models                 # Data structures for requests, responses, and reports
```

## Data Quality Metrics
- **Overall Quality Score**: 0-1 scale based on multiple factors
- **Missing Value Analysis**: Per-column missing value statistics
- **Duplicate Analysis**: Duplicate row counts and percentages
- **Data Type Analysis**: Column data type distribution
- **Uniqueness Analysis**: Unique value counts per column

### 🔍 Supported Data Quality Issues
### Missing Values
- **Detection**: Automatic identification of columns with missing data
- **Handling**: Smart imputation strategies (median for numerical, mode for categorical)
- **Thresholds**: Configurable missing value percentage limits

### Duplicate Rows
- **Detection**: Identifies exact and near-duplicate rows
- **Removal**: Configurable duplicate removal strategies
- **Analysis**: Reports duplicate patterns and impact

### Data Type Inconsistencies
- **Detection**: Identifies columns with mixed or inappropriate data types
- **Standardization**: Converts data types for consistency
- **Validation**: Ensures data type appropriateness

### Outliers
- **Detection**: Statistical outlier detection using IQR method
- **Handling**: Configurable outlier treatment (capping, removal, investigation)
- **Impact Assessment**: Reports outlier impact on data quality


## Developer Setup and Testing

### Setup Instructions

1. Clone the repository and checkout the feature branch:
   ```bash
   git clone https://github.com/stepfnAI/cleaning_agent.git 
   cd cleaning_agent
   git checkout review
   ```

2. Install uv (if not already installed):
   ```bash
   # Option A: Using the standalone installer (recommended for macOS/Linux)
   curl -LsSf https://astral.sh/uv/install.sh | sh
   
   # Option B: Using pip (if uv is already in an existing environment)
   pip install uv
   ```

3. Create and activate a virtual environment:
   ```bash
   uv venv --python=3.10 venv
   source venv/bin/activate
   ```

4. Install the project in editable mode with development dependencies:
   ```bash
   uv pip install -e ".[dev]"
   ```

5. Clone and set up the sfn_blueprint dependency:
   ```bash
   cd ..
   git clone https://github.com/stepfnAI/sfn_blueprint.git
   cd sfn_blueprint
   source ../cleaning_agent/venv/bin/activate
   git checkout dev
   uv pip install -e .
   cd ../cleaning_agent
   ```

6. Set your OpenAI API key:
   ```bash
   export OPENAI_API_KEY='your-api-key-here'
   ```

### Example 

1. Run the example script:
   ```bash
   python examples/basic_usage.py
   ```


### Running Tests

1. Run the test suite:
   ```bash
   # Run all tests
   pytest tests/ -s
   
   # Run specific test files
   pytest tests/test_agent.py -s
   pytest tests/test_context_integration.py -s 
   pytest tests/test_execution_validation.py -s 
   pytest tests/test_llm_driven_cleaning.py -s
   pytest tests/test_llm_driven_cleaning_with_sql.py -s
   ```

#####    Test Structure

```
tests/
├── test_agent.py                                        # Agent functionality tests
├── test_context_integration.py                          # Context integration tests
├── test_execution_validation.py                         # Execution validation tests
├── test_llm_driven_cleaning.py                          # LLM-driven cleaning tests
├── tests/test_llm_driven_cleaning_with_sql.py           # SQL cleaning tests
```

##### Test Dependencies
The following testing dependencies are automatically installed:
- `pytest>=7.0.0` - Test framework
- `pytest-cov>=4.0.0` - Coverage reporting
- `black>=23.0.0` - Code formatting
- `isort>=5.12.0` - Import sorting
- `flake8>=6.0.0` - Linting
- `mypy>=1.0.0` - Type checking

## 📊 Output and Reporting

### Cleaning Response
```python
{
    "success": True,
    "cleaned_data": DataFrame,
    "report": {
        "report_id": "uuid",
        "timestamp": "2024-01-01T00:00:00Z",
        "data_summary": {
            "original_shape": (1000, 10),
            "cleaned_shape": (950, 10),
            "rows_removed": 50,
            "columns_processed": 10
        },
        "issues_detected": [...],
        "cleaning_operations": [...],
        "quality_metrics": {
            "original_quality_score": 0.65,
            "final_quality_score": 0.89,
            "improvement": 0.24
        },
        "recommendations": [...],
        "execution_time": 2.34
    },
    "message": "Data cleaning completed successfully",
    "errors": [],
    "metadata": {...}
}
```

## Additional Information

- **Python Version**: 3.10+
- **Dependencies**: Managed through `pyproject.toml`
- **Code Style**: Follows PEP 8 with Black formatting
