Metadata-Version: 2.4
Name: arff-csv-converter
Version: 1.1.0
Summary: A Python library for converting between CSV and ARFF (Weka) file formats
Author-email: Ricardo Montañana <rmontanana@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/rmontanana/arff-csv-converter
Project-URL: Documentation, https://github.com/rmontanana/arff-csv-converter#readme
Project-URL: Repository, https://github.com/rmontanana/arff-csv-converter.git
Project-URL: Bug Tracker, https://github.com/rmontanana/arff-csv-converter/issues
Project-URL: Changelog, https://github.com/rmontanana/arff-csv-converter/blob/main/CHANGELOG.md
Keywords: arff,csv,weka,converter,machine-learning,data-format,file-conversion
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Utilities
Classifier: Typing :: Typed
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=2.0.0
Requires-Dist: numpy>=1.24.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: pytest-cov>=6.0.0; extra == "dev"
Requires-Dist: ruff>=0.8.0; extra == "dev"
Requires-Dist: mypy>=1.13.0; extra == "dev"
Requires-Dist: pandas-stubs>=2.0.0; extra == "dev"
Requires-Dist: build>=1.0.0; extra == "dev"
Requires-Dist: twine>=5.0.0; extra == "dev"
Dynamic: license-file

# ARFF-CSV Converter

[![Tests](https://github.com/rmontanana/arff-csv/actions/workflows/tests.yml/badge.svg)](https://github.com/rmontanana/arff-csv/actions/workflows/tests.yml)
[![PyPI version](https://badge.fury.io/py/csv-arff-converter.svg)](https://badge.fury.io/py/csv-arff-converter)
![https://img.shields.io/badge/python-3.11%2B-blue](https://img.shields.io/badge/python-3.11%2B-brightgreen)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/rmontanana/arff-csv)

A Python library for converting between CSV and ARFF (Weka) file formats. ARFF (Attribute-Relation File Format) is the standard file format used by the Weka machine learning toolkit.

## Features

- **Bidirectional conversion**: Convert CSV to ARFF and ARFF to CSV
- **Automatic type detection**: Automatically infers numeric, nominal, string, and date types
- **CSV analysis mode**: Analyze CSV files and get suggestions for column types
- **Missing value handling**: Properly handles missing values using ARFF standard (?)
- **Sparse format support**: Read and write sparse ARFF format
- **Command-line interface**: Easy-to-use CLI for quick conversions
- **Pandas integration**: Seamlessly works with pandas DataFrames
- **Type hints**: Full type annotation support for better IDE integration
- **Well tested**: Comprehensive test suite with high coverage

## Installation

```bash
pip install arff-csv-converter
```

For development dependencies:

```bash
pip install arff-csv-converter[dev]
```

## Quick Start

### Python API

#### Convert CSV to ARFF

```python
from arff_csv import csv_to_arff

# Basic conversion
csv_to_arff("data.csv", "data.arff")

# With options
csv_to_arff(
    "data.csv",
    "data.arff",
    relation_name="my_dataset",
    nominal_columns=["class", "category"],
    comments=["Generated by my application"]
)
```

#### Convert ARFF to CSV

```python
from arff_csv import arff_to_csv

# Basic conversion
df = arff_to_csv("data.arff", "data.csv")

# Access the DataFrame directly
print(df.head())
```

#### Using the Converter Class

```python
from arff_csv import ArffConverter

converter = ArffConverter()

# CSV to ARFF
arff_data = converter.csv_to_arff("input.csv", "output.arff")
print(f"Relation: {arff_data.relation_name}")
print(f"Attributes: {len(arff_data.attributes)}")
print(f"Instances: {len(arff_data.data)}")

# ARFF to CSV
df = converter.arff_to_csv("input.arff", "output.csv")

# Work with DataFrames directly
df = converter.arff_to_dataframe("data.arff")
converter.dataframe_to_arff(df, "output.arff", relation_name="my_data")

# Get ARFF as string
arff_string = converter.dataframe_to_arff_string(df, relation_name="my_data")
```

#### Working with ArffData

```python
from arff_csv import ArffParser

parser = ArffParser()
arff_data = parser.parse_file("data.arff")

# Access metadata
print(f"Relation: {arff_data.relation_name}")
print(f"Comments: {arff_data.comments}")

# Access attributes
for attr in arff_data.attributes:
    print(f"  {attr.name}: {attr.type.name}")
    if attr.nominal_values:
        print(f"    Values: {attr.nominal_values}")

# Access data as DataFrame
df = arff_data.data
print(df.describe())

# Get attribute lists
numeric_attrs = arff_data.get_numeric_attributes()
nominal_attrs = arff_data.get_nominal_attributes()
```

### Command Line Interface

The package installs a command-line tool `arff-csv`:

#### Analyze CSV (Recommended First Step)

Before converting, you can analyze your CSV file to get suggestions for column types:

```bash
arff-csv csv2arff iris.csv --analyze
```

This will output:
```
======================================================================
CSV ANALYSIS: iris.csv
======================================================================

Rows: 150
Columns: 6

DATA PREVIEW (first 5 rows):
----------------------------------------------------------------------
   Unnamed_0  sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  class
0          0                5.1               3.5                1.4               0.2      0
1          1                4.9               3.0                1.4               0.2      0
2          2                4.7               3.2                1.3               0.2      0
3          3                4.6               3.1                1.5               0.2      0
4          4                5.0               3.6                1.4               0.2      0

COLUMN ANALYSIS:
----------------------------------------------------------------------
Column                    Type       Unique   Nulls    Reason
----------------------------------------------------------------------
Unnamed_0                 INTEGER    150      0        Integer values
sepal length (cm)         NUMERIC    35       0        Floating point values
sepal width (cm)          NUMERIC    23       0        Floating point values
petal length (cm)         NUMERIC    43       0        Floating point values
petal width (cm)          NUMERIC    22       0        Floating point values
class                     NOMINAL    3        0        Common target/class column name

COLUMNS SUGGESTED FOR EXCLUSION:
----------------------------------------------------------------------
  - Unnamed_0: Unique value for every row

SUGGESTED COMMAND:
----------------------------------------------------------------------

arff-csv csv2arff iris.csv iris.arff --relation "iris" --nominal \
    class --exclude Unnamed_0

SUMMARY:
----------------------------------------------------------------------
  Numeric columns:  5
  Nominal columns:  1
  String columns:   0
  Suggested excludes: 1

  Nominal: class
  Exclude: Unnamed_0
```

**Analysis options:**

| Option | Description | Default |
|--------|-------------|---------|
| `-a`, `--analyze` | Enable analysis mode (no conversion) | - |
| `--preview-rows N` | Number of rows to preview | 5 |
| `--nominal-threshold N` | Max unique values to consider nominal | 10 |

**Detection criteria:**

- **Nominal columns**: Binary values (0/1, yes/no, true/false), columns named "class"/"target"/"label", integer columns with few unique values
- **String columns**: Text with many unique values, long text (avg > 50 chars)
- **Numeric columns**: Floating point values, integers with many unique values
- **Exclusion suggestions**: Columns with a single unique value or an identifier-like unique value for every row

#### Convert CSV to ARFF

```bash
# Basic conversion
arff-csv csv2arff input.csv output.arff

# With options
arff-csv csv2arff input.csv output.arff \
    --relation "my_dataset" \
    --nominal class category \
    --string description \
    --exclude id \
    --comment "Generated on 2024-01-15" \
    --verbose
```

**Conversion options:**

| Option | Description | Default |
|--------|-------------|---------|
| `-r`, `--relation NAME` | Relation name | Input filename |
| `-n`, `--nominal COL...` | Columns to treat as nominal | - |
| `-s`, `--string COL...` | Columns to treat as string | - |
| `--exclude COL...` | Columns to exclude from conversion | - |
| `-m`, `--missing VALUE` | Missing value representation | `?` |
| `-c`, `--comment TEXT...` | Comments to add | - |
| `--delimiter CHAR` | CSV delimiter | `,` |
| `--encoding ENC` | File encoding | `utf-8` |
| `-v`, `--verbose` | Verbose output | - |

#### Convert ARFF to CSV

```bash
# Basic conversion
arff-csv arff2csv input.arff output.csv

# With options
arff-csv arff2csv input.arff output.csv \
    --delimiter ";" \
    --include-index \
    --verbose
```

#### Display ARFF file information

```bash
arff-csv info data.arff
```

Output:
```
ARFF File: data.arff
Relation: iris
Instances: 150
Attributes: 5

Attribute Information:
------------------------------------------------------------
  sepallength: NUMERIC
  sepalwidth: NUMERIC
  petallength: NUMERIC
  petalwidth: NUMERIC
  class: NOMINAL {Iris-setosa, Iris-versicolor, Iris-virginica}

Data Preview (first 5 rows):
------------------------------------------------------------
   sepallength  sepalwidth  petallength  petalwidth        class
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
...
```

## ARFF Format Reference

ARFF (Attribute-Relation File Format) is a text format that describes a dataset as a relation with named attributes. The format consists of:

1. **Header section**: Relation name and attribute definitions
2. **Data section**: The actual data instances

### Example ARFF File

```arff
% This is a comment
@RELATION iris

@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa, Iris-versicolor, Iris-virginica}

@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
```

### Supported Attribute Types

| Type | Description | Example |
|------|-------------|---------|
| NUMERIC | Floating-point numbers | `@ATTRIBUTE value NUMERIC` |
| INTEGER | Integer numbers | `@ATTRIBUTE count INTEGER` |
| REAL | Alias for NUMERIC | `@ATTRIBUTE value REAL` |
| STRING | Text strings | `@ATTRIBUTE name STRING` |
| NOMINAL | Categorical values | `@ATTRIBUTE class {a, b, c}` |
| DATE | Date/time values | `@ATTRIBUTE date DATE 'yyyy-MM-dd'` |

### Missing Values

Missing values are represented by `?` in ARFF format:

```arff
@DATA
5.1,3.5,?,0.2,Iris-setosa
?,3.0,1.4,0.2,?
```

## API Reference

### Main Functions

- `csv_to_arff(csv_path, arff_path, ...)` - Convert CSV file to ARFF
- `arff_to_csv(arff_path, csv_path, ...)` - Convert ARFF file to CSV

### Classes

- `ArffConverter` - Main converter class with full functionality
- `ArffParser` - Parser for reading ARFF files
- `ArffWriter` - Writer for creating ARFF files
- `ArffData` - Container for parsed ARFF data
- `Attribute` - ARFF attribute definition

### Exceptions

- `ArffCsvError` - Base exception for all errors
- `ArffParseError` - Error parsing ARFF files
- `ArffWriteError` - Error writing ARFF files
- `CsvParseError` - Error parsing CSV files
- `InvalidAttributeError` - Invalid attribute definition
- `MissingDataError` - Required data missing

## Development

### Setup

```bash
# Clone the repository
git clone https://github.com/rmontanana/arff-csv-converter.git
cd arff-csv-converter

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install in development mode
pip install -e ".[dev]"
```

### Running Tests

```bash
# Run all tests
pytest

# Run with coverage
pytest --cov=arff_csv --cov-report=html

# Run specific test file
pytest tests/test_parser.py

# Run with verbose output
pytest -v
```

### Code Quality

```bash
# Run linter
ruff check src tests

# Run formatter
ruff format src tests

# Run type checker
mypy src
```

### Building

```bash
# Install build tools
pip install build twine

# Build the package
python -m build

# Check the package
twine check dist/*
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Related Projects

- [Weka](https://www.cs.waikato.ac.nz/ml/weka/) - The original machine learning toolkit that uses ARFF format
- [liac-arff](https://github.com/renatopp/liac-arff) - Another Python library for ARFF files
- [scipy.io.arff](https://docs.scipy.org/doc/scipy/reference/io.html#module-scipy.io.arff) - SciPy's ARFF reader

## Changelog

See [CHANGELOG.md](CHANGELOG.md) for a list of changes.
