Metadata-Version: 2.4
Name: csv-schema-validator
Version: 0.2.0
Summary: CSV file validation against given data schema.
Author: frycz
Requires-Python: >=3.7
Requires-Dist: pydantic>=2.0.0
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Description-Content-Type: text/markdown

# CSV Schema Validator

A powerful Python library for validating CSV files against JSON schemas. Ensure your CSV data meets specific requirements with comprehensive validation including data types, patterns, ranges, and custom constraints.

## Features

- **Type Validation**: Support for `string`, `number`, `integer`, and `boolean` data types
- **Pattern Matching**: Regex pattern validation for strings (e.g., email formats, dates)
- **Enum Validation**: Restrict values to predefined options
- **Range Validation**: Min/max constraints for numeric fields
- **Required Field Checking**: Ensure mandatory fields are present
- **Detailed Error Reporting**: Comprehensive validation results with row/column information
- **Command Line Interface**: Easy-to-use CLI for quick validation
- **Python API**: Programmatic access for integration into larger workflows

## Installation

```bash
pip install csv-schema-validator
```

## Quick Start

### Command Line Usage

```bash
# Validate a CSV file against a schema
csv-schema-validator employees.csv employee_schema.json

# Show help
csv-schema-validator --help

# Show version
csv-schema-validator --version
```

### Python API Usage

```python
from csv_schema_validator import validate_csv

# Define your schema
schema = {
    "name": "Employee Data Schema",
    "description": "Schema for validating employee CSV files",
    "fields": [
        {
            "name": "employee_id",
            "type": "integer",
            "required": True,
            "description": "Unique employee identifier"
        },
        {
            "name": "email",
            "type": "string",
            "required": True,
            "pattern": "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$",
            "description": "Valid email address"
        },
        {
            "name": "department",
            "type": "string",
            "required": True,
            "enum": ["Engineering", "Marketing", "Sales", "HR", "Finance"],
            "description": "Employee department"
        },
        {
            "name": "salary",
            "type": "number",
            "required": True,
            "min": 30000,
            "max": 200000,
            "description": "Annual salary in USD"
        }
    ]
}

# Validate your CSV
result = validate_csv("employees.csv", schema)

if result["is_valid"]:
    print("✅ Validation passed")
else:
    print(f"❌ Validation failed: {len(result['errors'])} errors found")
    for error in result["errors"]:
        print(f"Row {error['row']}, Column {error['column']}: {error['error_type']} - {error['error_message']}")
```

## Schema Format

The schema file should be a JSON file with the following structure:

```json
{
  "name": "Schema Name",
  "description": "Schema description",
  "fields": [
    {
      "name": "field_name",
      "type": "string|number|integer|boolean",
      "required": true,
      "description": "Field description",
      "pattern": "regex_pattern",
      "enum": ["value1", "value2"],
      "min": 0,
      "max": 100
    }
  ]
}
```

### Field Properties

| Property | Type | Required | Description |
|----------|------|----------|-------------|
| `name` | string | YES | Field name (must match CSV header) |
| `type` | string | YES | Data type: `string`, `number`, `integer`, or `boolean` |
| `required` | boolean | YES | Whether the field must be present |
| `description` | string | NO | Human-readable field description |
| `pattern` | string | NO | Regex pattern for string validation |
| `enum` | array | NO | Allowed values for the field |
| `min` | integer | NO | Minimum value (for numeric fields) |
| `max` | integer | NO | Maximum value (for numeric fields) |

## Data Types

### String
- Basic string validation
- Optional regex pattern matching
- Optional enum value restriction

### Number
- Floating-point number validation
- Optional min/max range constraints

### Integer
- Whole number validation
- Optional min/max range constraints

### Boolean
- Accepts: `true`, `false`, `1`, `0`, `yes`, `no`, `on`, `off`

## Examples

### Employee Data Validation

**CSV File (`employees.csv`):**
```csv
employee_id,first_name,last_name,email,department,salary,hire_date,is_active
1,John,Doe,john.doe@company.com,Engineering,75000,2023-01-15,true
2,Jane,Smith,jane.smith@company.com,Marketing,65000,2023-03-22,true
```

**Schema File (`employee_schema.json`):**
```json
{
  "name": "Employee Data Schema",
  "description": "Schema for validating employee CSV files",
  "fields": [
    {
      "name": "employee_id",
      "type": "integer",
      "required": true,
      "description": "Unique employee identifier"
    },
    {
      "name": "first_name",
      "type": "string",
      "required": true,
      "description": "Employee's first name"
    },
    {
      "name": "last_name",
      "type": "string",
      "required": true,
      "description": "Employee's last name"
    },
    {
      "name": "email",
      "type": "string",
      "required": true,
      "pattern": "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$",
      "description": "Valid email address"
    },
    {
      "name": "department",
      "type": "string",
      "required": true,
      "enum": ["Engineering", "Marketing", "Sales", "HR", "Finance"],
      "description": "Employee department"
    },
    {
      "name": "salary",
      "type": "number",
      "required": true,
      "min": 30000,
      "max": 200000,
      "description": "Annual salary in USD"
    },
    {
      "name": "hire_date",
      "type": "string",
      "required": true,
      "pattern": "^\\d{4}-\\d{2}-\\d{2}$",
      "description": "Hire date in YYYY-MM-DD format"
    },
    {
      "name": "is_active",
      "type": "boolean",
      "required": true,
      "description": "Whether employee is currently active"
    }
  ]
}
```

## Command Line Options

| Option | Description |
|--------|-------------|
| `<csv_file>` | Path to the CSV file to validate |
| `<schema_file>` | Path to the JSON schema file |
| `-h, --help` | Show help message |
| `-v, --version` | Show version information |

## Return Value

The validation function returns a dictionary with the following structure:

```python
{
    "is_valid": bool,
    "errors": [
        {
            "error_type": str,  # e.g., "RequiredFieldError", "TypeValidationError", etc.
            "error_message": str,
            "row": int,  # Row number (-1 for header errors)
            "column": str,  # Column name
            "value": str,  # The value that caused the error
            "details": dict  # Additional error details
        }
    ]
}
```

## Error Types

- `RequiredFieldError`: Required field is missing from CSV header
- `TypeValidationError`: Value doesn't match expected data type
- `PatternValidationError`: String doesn't match regex pattern
- `EnumValidationError`: Value not in allowed enum values
- `RangeValidationError`: Numeric value outside allowed range (too small or too large)
- `EmptyFileError`: CSV file is empty or has no data rows
- `CSVFileError`: General CSV file reading errors (file not found, permission denied, etc.)
- `SchemaValidationError`: Schema structure validation errors
- `InvalidJSONError`: Invalid JSON in schema file

## Requirements

- Python 3.7+
- pydantic >= 2.0.0

## License

This project is licensed under the MIT License.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Support

If you encounter any issues or have questions, please file an issue on the GitHub repository.
