Metadata-Version: 2.4
Name: sigil-pipeline
Version: 2.6.1
Summary: Static analysis pipeline for generating high-quality Rust code datasets for model fine-tuning. Phase 2 dataset generation with Phase 1 format alignment.
Author-email: Dave Tofflemire <davetmire85@gmail.com>
Maintainer-email: Dave Tofflemire <davetmire85@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/Superuser666-Sigil/SigilDERG-Data_Production
Project-URL: Documentation, https://github.com/Superuser666-Sigil/SigilDERG-Data_Production/wiki
Project-URL: Repository, https://github.com/Superuser666-Sigil/SigilDERG-Data_Production
Project-URL: Bug Tracker, https://github.com/Superuser666-Sigil/SigilDERG-Data_Production/issues
Project-URL: Related Projects, https://github.com/Superuser666-Sigil
Project-URL: Finetuner, https://github.com/Superuser666-Sigil/SigilDERG-Finetuner
Project-URL: Evaluation, https://github.com/Superuser666-Sigil/human-eval-Rust
Project-URL: Release Notes, https://github.com/Superuser666-Sigil/SigilDERG-Data_Production/releases
Keywords: rust,crate,analysis,pipeline,dataset,fine-tuning,static-analysis
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: System :: Distributed Computing
Classifier: Topic :: System :: Monitoring
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests<3.0.0,>=2.31.0
Requires-Dist: aiohttp<4.0.0,>=3.9.0
Requires-Dist: pydantic<3.0.0,>=2.5.0
Requires-Dist: pydantic-settings<3.0.0,>=2.1.0
Requires-Dist: python-dotenv<2.0.0,>=1.0.0
Requires-Dist: click<9.0.0,>=8.1.0
Requires-Dist: rich<14.0.0,>=13.7.0
Requires-Dist: tqdm<5.0.0,>=4.66.0
Requires-Dist: psutil<7.0.0,>=6.1.1
Requires-Dist: humanize<5.0.0,>=4.8.0
Requires-Dist: colorama<1.0.0,>=0.4.6
Requires-Dist: termcolor>=3.2.0
Requires-Dist: tabulate<1.0.0,>=0.9.0
Requires-Dist: asyncio-throttle<2.0.0,>=1.0.0
Requires-Dist: tenacity<9.0.0,>=8.2.0
Requires-Dist: structlog<24.0.0,>=23.2.0
Requires-Dist: marshmallow<4.0.0,>=3.20.0
Requires-Dist: cerberus<2.0.0,>=1.3.0
Requires-Dist: pathspec<1.0.0,>=0.11.0
Requires-Dist: watchdog<4.0.0,>=3.0.0
Requires-Dist: httpx<1.0.0,>=0.25.0
Requires-Dist: urllib3<3.0.0,>=2.0.0
Requires-Dist: orjson<4.0.0,>=3.9.0
Requires-Dist: ujson<6.0.0,>=5.8.0
Requires-Dist: python-dateutil<3.0.0,>=2.8.0
Requires-Dist: pytz>=2023.3
Requires-Dist: pyyaml<7.0.0,>=6.0.0
Requires-Dist: jsonschema<5.0.0,>=4.20.0
Requires-Dist: toml<1.0.0,>=0.10.0
Requires-Dist: tree-sitter<1.0.0,>=0.22.0
Requires-Dist: tree-sitter-rust<1.0.0,>=0.20.0
Provides-Extra: web
Requires-Dist: crawl4ai<1.0.0,>=0.6.0; extra == "web"
Requires-Dist: playwright<2.0.0,>=1.40.0; extra == "web"
Requires-Dist: beautifulsoup4<5.0.0,>=4.12.0; extra == "web"
Requires-Dist: lxml<6.0.0,>=5.3.0; extra == "web"
Provides-Extra: datasets
Requires-Dist: datasets<3.0.0,>=2.14.0; extra == "datasets"
Requires-Dist: huggingface-hub<1.0.0,>=0.19.0; extra == "datasets"
Provides-Extra: finetuning
Requires-Dist: sigilderg-finetuner>=3.0.0; extra == "finetuning"
Provides-Extra: evaluation
Requires-Dist: human-eval-rust>=2.3.0; extra == "evaluation"
Provides-Extra: ecosystem
Requires-Dist: sigilderg-finetuner>=3.0.0; extra == "ecosystem"
Requires-Dist: human-eval-rust>=2.3.0; extra == "ecosystem"
Provides-Extra: ai
Requires-Dist: openai<2.0.0,>=1.3.0; extra == "ai"
Requires-Dist: anthropic<1.0.0,>=0.7.0; extra == "ai"
Requires-Dist: litellm<2.0.0,>=1.30.0; extra == "ai"
Requires-Dist: llama-cpp-python<1.0.0,>=0.2.0; extra == "ai"
Requires-Dist: tiktoken<1.0.0,>=0.5.0; extra == "ai"
Provides-Extra: gpu
Requires-Dist: llama-cpp-python<1.0.0,>=0.2.0; extra == "gpu"
Requires-Dist: nvidia-ml-py>=12.535.0; extra == "gpu"
Requires-Dist: pynvml>=11.5.0; extra == "gpu"
Requires-Dist: gpustat>=1.1.0; extra == "gpu"
Provides-Extra: cuda
Requires-Dist: llama-cpp-python<1.0.0,>=0.2.0; extra == "cuda"
Requires-Dist: nvidia-ml-py>=12.535.0; extra == "cuda"
Requires-Dist: pynvml>=11.5.0; extra == "cuda"
Requires-Dist: gpustat>=1.1.0; extra == "cuda"
Requires-Dist: cupy-cuda12x>=13.0.0; extra == "cuda"
Provides-Extra: ml
Requires-Dist: scikit-learn<2.0.0,>=1.3.0; extra == "ml"
Requires-Dist: numpy<2.0.0,>=1.24.0; extra == "ml"
Requires-Dist: pandas<3.0.0,>=2.0.0; extra == "ml"
Provides-Extra: caching
Requires-Dist: cachetools<6.0.0,>=5.3.0; extra == "caching"
Requires-Dist: aiofiles<25.0.0,>=24.1.0; extra == "caching"
Requires-Dist: redis<6.0.0,>=5.0.0; extra == "caching"
Requires-Dist: requests-cache<2.0.0,>=1.1.1; extra == "caching"
Provides-Extra: microservices
Requires-Dist: PyJWT<3.0.0,>=2.8.0; extra == "microservices"
Requires-Dist: prometheus-client<1.0.0,>=0.17.0; extra == "microservices"
Requires-Dist: asyncio-mqtt<1.0.0,>=0.16.0; extra == "microservices"
Provides-Extra: security
Requires-Dist: presidio-analyzer<3.0.0,>=2.2.0; extra == "security"
Requires-Dist: presidio-anonymizer<3.0.0,>=2.2.0; extra == "security"
Requires-Dist: spacy<4.0.0,>=3.7.0; extra == "security"
Requires-Dist: bandit<2.0.0,>=1.7.0; extra == "security"
Requires-Dist: safety<2.4.0,>=2.3.0; extra == "security"
Provides-Extra: analysis
Requires-Dist: textstat<1.0.0,>=0.7.0; extra == "analysis"
Requires-Dist: vaderSentiment<4.0.0,>=3.3.0; extra == "analysis"
Requires-Dist: textblob<1.0.0,>=0.17.0; extra == "analysis"
Provides-Extra: http2
Requires-Dist: h2<5.0.0,>=3.0.0; extra == "http2"
Requires-Dist: hpack<5.0.0,>=4.0.0; extra == "http2"
Requires-Dist: hyperframe<7.0.0,>=6.0.0; extra == "http2"
Provides-Extra: all
Requires-Dist: crawl4ai<1.0.0,>=0.6.0; extra == "all"
Requires-Dist: playwright<2.0.0,>=1.40.0; extra == "all"
Requires-Dist: beautifulsoup4<5.0.0,>=4.12.0; extra == "all"
Requires-Dist: lxml<6.0.0,>=5.3.0; extra == "all"
Requires-Dist: openai<2.0.0,>=1.3.0; extra == "all"
Requires-Dist: anthropic<1.0.0,>=0.7.0; extra == "all"
Requires-Dist: litellm<2.0.0,>=1.30.0; extra == "all"
Requires-Dist: llama-cpp-python<1.0.0,>=0.2.0; extra == "all"
Requires-Dist: tiktoken<1.0.0,>=0.5.0; extra == "all"
Requires-Dist: nvidia-ml-py>=12.535.0; extra == "all"
Requires-Dist: pynvml>=11.5.0; extra == "all"
Requires-Dist: gpustat>=1.1.0; extra == "all"
Requires-Dist: scikit-learn<2.0.0,>=1.3.0; extra == "all"
Requires-Dist: numpy<2.0.0,>=1.24.0; extra == "all"
Requires-Dist: pandas<3.0.0,>=2.0.0; extra == "all"
Requires-Dist: cachetools<6.0.0,>=5.3.0; extra == "all"
Requires-Dist: aiofiles<25.0.0,>=24.1.0; extra == "all"
Requires-Dist: redis<6.0.0,>=5.0.0; extra == "all"
Requires-Dist: requests-cache<2.0.0,>=1.1.1; extra == "all"
Requires-Dist: PyJWT<3.0.0,>=2.8.0; extra == "all"
Requires-Dist: prometheus-client<1.0.0,>=0.17.0; extra == "all"
Requires-Dist: asyncio-mqtt<1.0.0,>=0.16.0; extra == "all"
Requires-Dist: presidio-analyzer<3.0.0,>=2.2.0; extra == "all"
Requires-Dist: presidio-anonymizer<3.0.0,>=2.2.0; extra == "all"
Requires-Dist: spacy<4.0.0,>=3.7.0; extra == "all"
Requires-Dist: bandit<2.0.0,>=1.7.0; extra == "all"
Requires-Dist: safety<2.4.0,>=2.3.0; extra == "all"
Requires-Dist: h2<5.0.0,>=3.0.0; extra == "all"
Requires-Dist: hpack<5.0.0,>=4.0.0; extra == "all"
Requires-Dist: hyperframe<7.0.0,>=6.0.0; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest<8.0.0,>=7.4.0; extra == "dev"
Requires-Dist: pytest-asyncio<1.0.0,>=0.21.0; extra == "dev"
Requires-Dist: pytest-cov<5.0.0,>=4.1.0; extra == "dev"
Requires-Dist: pytest-mock<4.0.0,>=3.11.0; extra == "dev"
Requires-Dist: pytest-benchmark<5.0.0,>=4.0.0; extra == "dev"
Requires-Dist: black>=25.0.0; extra == "dev"
Requires-Dist: flake8<7.0.0,>=6.0.0; extra == "dev"
Requires-Dist: pyright<2.0.0,>=1.1.350; extra == "dev"
Requires-Dist: isort<6.0.0,>=5.12.0; extra == "dev"
Requires-Dist: pre-commit<4.0.0,>=3.3.0; extra == "dev"
Requires-Dist: sphinx<8.0.0,>=7.1.0; extra == "dev"
Requires-Dist: sphinx-rtd-theme<2.0.0,>=1.3.0; extra == "dev"
Requires-Dist: myst-parser<3.0.0,>=2.0.0; extra == "dev"
Requires-Dist: types-toml>=0.10.8; extra == "dev"
Requires-Dist: types-tqdm>=4.67.0; extra == "dev"
Requires-Dist: pandas-stubs>=2.3.0; extra == "dev"
Requires-Dist: types-aiofiles>=24.1.0; extra == "dev"
Requires-Dist: types-cachetools>=6.1.0; extra == "dev"
Requires-Dist: hypothesis<7.0.0,>=6.92.0; extra == "dev"
Requires-Dist: mutmut<4.0.0,>=3.0.0; extra == "dev"
Requires-Dist: locust<3.0.0,>=2.20.0; extra == "dev"
Provides-Extra: test
Requires-Dist: pytest<8.0.0,>=7.4.0; extra == "test"
Requires-Dist: pytest-asyncio<1.0.0,>=0.21.0; extra == "test"
Requires-Dist: pytest-cov<5.0.0,>=4.1.0; extra == "test"
Requires-Dist: pytest-mock<4.0.0,>=3.11.0; extra == "test"
Requires-Dist: pytest-benchmark<5.0.0,>=4.0.0; extra == "test"
Requires-Dist: hypothesis<7.0.0,>=6.92.0; extra == "test"
Provides-Extra: docs
Requires-Dist: sphinx<8.0.0,>=7.1.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme<2.0.0,>=1.3.0; extra == "docs"
Requires-Dist: myst-parser<3.0.0,>=2.0.0; extra == "docs"
Provides-Extra: observability
Requires-Dist: structlog<24.0.0,>=23.2.0; extra == "observability"
Requires-Dist: opentelemetry-api<2.0.0,>=1.22.0; extra == "observability"
Requires-Dist: opentelemetry-sdk<2.0.0,>=1.22.0; extra == "observability"
Requires-Dist: opentelemetry-exporter-otlp<2.0.0,>=1.22.0; extra == "observability"
Provides-Extra: enterprise
Requires-Dist: structlog<24.0.0,>=23.2.0; extra == "enterprise"
Requires-Dist: opentelemetry-api<2.0.0,>=1.22.0; extra == "enterprise"
Requires-Dist: opentelemetry-sdk<2.0.0,>=1.22.0; extra == "enterprise"
Requires-Dist: opentelemetry-exporter-otlp<2.0.0,>=1.22.0; extra == "enterprise"
Requires-Dist: prometheus-client<1.0.0,>=0.17.0; extra == "enterprise"
Requires-Dist: hypothesis<7.0.0,>=6.92.0; extra == "enterprise"
Requires-Dist: mutmut<4.0.0,>=3.0.0; extra == "enterprise"
Requires-Dist: locust<3.0.0,>=2.20.0; extra == "enterprise"
Requires-Dist: pytest-benchmark<5.0.0,>=4.0.0; extra == "enterprise"
Dynamic: license-file

# Sigil Pipeline v2.6.0

A static analysis pipeline for generating high-quality Rust code datasets for model fine-tuning. The pipeline analyzes Rust crates using static analysis tools and generates training datasets in JSONL format.

> 📖 **Ecosystem Architecture**: For a comprehensive overview of how this project integrates with [SigilDERG-Finetuner](https://github.com/Superuser666-Sigil/SigilDERG-Finetuner) and [human-eval-Rust](https://github.com/Superuser666-Sigil/human-eval-Rust), see [ARCHITECTURE.md](ARCHITECTURE.md).

**Version 2.6.0** includes:

- **Checkpoint/Resume System**: Automatic checkpointing allows resuming long-running pipeline executions without losing progress. Preserves temp directories and skips already-processed crates.
- **Improved Error Injection**: Enhanced error-fixing task generation with fallback to simulated errors when real compilation times out, ensuring more robust task diversity.
- **Enhanced Logging**: Geiger and License checks now always write logs, even when no issues are found, improving observability and debugging.
- **Tool Execution Tracking**: Rejection summaries now include flags indicating which analysis tools were executed or skipped.
- **Enterprise Observability**: Structured logging via structlog, Prometheus-compatible metrics, and optional OpenTelemetry tracing.
- License pre-checking from crates.io API
- Cargo-deny security auditing integration
- Streaming architecture for memory-efficient processing
- Granular filter metrics and observability
- Enhanced quality filtering (unsafe code, outdated dependencies)
- Platform compatibility detection
- Shared cargo target directory for faster builds

## Overview

Sigil Pipeline performs comprehensive static analysis on Rust crates to identify high-quality, idiomatic code suitable for training code generation models. It combines:

- **Curated Rust crates** analyzed through static analysis tools
- **The Stack Rust Clean dataset** files (from HuggingFace)
- **Format validation** to ensure consistent dataset structure

The pipeline generates JSONL datasets with prompt-generation pairs that can be used directly for fine-tuning language models.

## Features

### Static Code Analysis

- **Clippy**: Detects idiomatic code patterns and lint violations
- **Cargo Geiger**: Analyzes unsafe code usage and safety metrics
- **Cargo Outdated**: Assesses dependency maintenance status
- **Cargo License**: Checks license compliance (with centralized verification logic)
- **Cargo Deny**: Performs security and license auditing (optional, configurable)
- **License Pre-Check**: Validates licenses from crates.io API before downloading

### Quality Filtering

- **Rust Edition**: Filters to 2021+ edition crates (modern Rust)
- **Clippy Warnings**: Category-based `max_bad_code_warnings` threshold (default: 0, ignores style/doc lints but blocks unsafe or correctness issues). Legacy `max_clippy_warnings` is still available for total-count filtering.
- **Documentation**: Requires documentation comments on public items
- **Test/Bench Exclusion**: Automatically filters out test and benchmark files
- **Size/Sanity Filters**: Applies Stack dataset filtering criteria (line length, alphabetic ratio)
- **License Filtering**: Only includes permissively licensed code (MIT, Apache-2.0, BSD, etc.) with SPDX expression support
- **Unsafe Code Filtering**: Optional threshold for maximum unsafe code items (from Geiger)
- **Outdated Dependencies**: Optional threshold for maximum outdated dependency ratio
- **Platform Compatibility**: Automatically skips OS-specific crates incompatible with current platform
- **Security Auditing**: Optional cargo-deny integration for security advisories and license violations

### Dataset Generation

- **Prompt Generation**: Creates instruction prompts from code and documentation based on code patterns and doc comments
- **Semantic Chunking**: Splits large files into snippet-sized chunks (functions, impl blocks, modules) for Phase-2
- **Task Type Diversity**: Generates multiple task types for Phase-2:
  - Code generation (70% default)
  - Transformations (15% default): sync→async, match→?, iterator conversions
  - Error fixing (10% default): fix compiler errors in broken code with improved fallback to simulated errors when real compilation times out
  - Explanations (5% default): explain code functionality
- **Format Validation**: Ensures consistent dataset structure
- **Dataset Merging**: Combines multiple datasets with shuffle and weighting options
- **Extra Shards**: Append pre-generated instruct-style shards (e.g., experimental upscales) via CLI without moving files
- **Train/Val Split by Source**: Splits datasets keeping whole crates/files together (tests true generalization)
- **Streaming Architecture**: Generator-based pipeline for memory-efficient processing of large datasets
- **Granular Metrics**: Detailed filter reason breakdown for observability

### Checkpoint/Resume System

- **Automatic Checkpointing**: Saves progress periodically (configurable interval, default: every 10 crates)
- **Resume from Interruptions**: Automatically detects and loads checkpoints on startup
- **Temp Directory Preservation**: Reuses existing temp directories when resuming, preserving downloaded crates (saves GBs of re-downloads)
- **Smart Crate Skipping**: Automatically skips already-processed crates to avoid duplicates
- **Config Compatibility Checking**: Verifies config hash to prevent incompatible resumes
- **Checkpoint Location**: Defaults to `output_dir/checkpoint.json`, customizable via `--checkpoint-path`

## Requirements

- **Python 3.12+**
- **Rust toolchain** (1.56+ for 2021 edition, 1.72+ for 2024 edition)
- **Cargo subcommands**:
  - `cargo clippy` (included with rustup)
  - `cargo geiger`
  - `cargo outdated`
  - `cargo license`
  - `cargo deny`

See [docs/SETUP.md](docs/SETUP.md) for detailed setup instructions.

## Installation

```bash
# Clone the repository
git clone https://github.com/Superuser666-Sigil/SigilDERG-Data_Production.git
cd SigilDERG-Data_Production

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -e ".[datasets]"  # tree-sitter for AST parsing is now included in core deps

# Install Rust analysis tools
cargo install cargo-geiger cargo-outdated cargo-license cargo-deny
rustup component add clippy
```

## Quick Start

### Command Line

```bash
# Analyze specific crates
python -m sigil_pipeline.main --crates serde tokio actix-web

# Use crate list file
python -m sigil_pipeline.main --crate-list data/crate_list.txt

# Phase-2 Instruct Mode (generates diverse task types with semantic chunking)
python -m sigil_pipeline.main \
  --prompt-mode instruct \
  --max-sft-lines 200 \
  --max-sft-chars 8000 \
  --output output/phase2_dataset.jsonl

# Custom task type distribution
python -m sigil_pipeline.main \
  --task-mix '{"code_generation": 0.7, "transformations": 0.15, "error_fixing": 0.1, "explanations": 0.05}'

# Append experimental / pre-generated shards after generation
python -m sigil_pipeline.main \
  --crate-list data/crate_list.txt \
  --extra-phase2-shard experimental/experimental_shard.jsonl \
  --output datasets/phase2_full.jsonl

# Allow longer real error injection (e.g., 3 minutes for cargo check)
python -m sigil_pipeline.main \
  --error-injection-timeout 180 \
  --output datasets/phase2_full.jsonl

# Checkpoint/Resume: Automatically saves progress and can resume from interruptions
# Checkpoint is saved to output_dir/checkpoint.json by default
python -m sigil_pipeline.main \
  --crate-list data/crate_list.txt \
  --output datasets/phase2_full.jsonl \
  --checkpoint-interval 10  # Save checkpoint every 10 crates (default)

# Resume from checkpoint (automatically detected if checkpoint.json exists)
python -m sigil_pipeline.main \
  --crate-list data/crate_list.txt \
  --output datasets/phase2_full.jsonl
# Pipeline will automatically skip already-processed crates and reuse temp directory

# Custom checkpoint path
python -m sigil_pipeline.main \
  --checkpoint-path logs/my_checkpoint.json \
  --crate-list data/crate_list.txt \
  --output datasets/phase2_full.jsonl

# Disable checkpointing
python -m sigil_pipeline.main \
  --no-checkpointing \
  --crate-list data/crate_list.txt \
  --output datasets/phase2_full.jsonl
```

### Python API

```python
import asyncio
from sigil_pipeline.config import PipelineConfig
from sigil_pipeline.main import run_pipeline

async def main():
    config = PipelineConfig(
        crates=["serde", "tokio"],
        output_path="output/dataset.jsonl",
    )
    
    await run_pipeline(config)

if __name__ == "__main__":
    asyncio.run(main())
```

## Configuration

The pipeline uses a `PipelineConfig` dataclass for all settings. Key options:

```python
from sigil_pipeline.config import PipelineConfig

config = PipelineConfig(
    # Crates to analyze
    crates=["serde", "tokio"],
    crate_list_path="data/crate_list.txt",  # Or specify individual crates
    
    # Quality thresholds
    allow_edition_2018=False,  # Only 2021+ edition
    max_bad_code_warnings=0,  # Strict filter for critical lints (style lints ignored)
    require_docs=True,  # Require documentation
    
    # Advanced filtering
    max_unsafe_items=None,  # Optional: max unsafe code items (None = no filter)
    max_outdated_ratio=None,  # Optional: max outdated dependency ratio
    enable_deny_scan=False,  # Optional: cargo-deny security auditing
    
    # File filtering
    max_line_length=100,
    min_alphabetic_ratio=0.3,  # Filters minified code
    
    # Error injection controls
    enable_error_injection=True,
    error_injection_method="both",
    error_injection_timeout=120,
    
    # Performance
    reuse_cargo_target=True,  # Share cargo target directory (output/cargo_target_cache by default)
    
    # Checkpoint/Resume
    enable_checkpointing=True,  # Enable automatic checkpointing (default: True)
    checkpoint_path=None,  # Custom checkpoint path (default: output_dir/checkpoint.json)
    checkpoint_interval=10,  # Save checkpoint every N crates (default: 10)
    
    # Output
    output_path="output/dataset.jsonl",
    max_threads=4,  # Parallel processing
)
```

Configuration can be loaded from JSON or YAML files:

```bash
python -m sigil_pipeline.main --config config.yaml
```

## Output Format

The pipeline generates JSONL files (one JSON object per line) with the following structure:

```jsonl
{"prompt": "Write a Rust program that demonstrates error handling", "gen": "use anyhow::Result;\n\nfn main() -> Result<()> {\n    // ...\n}"}
{"prompt": "Write a Rust code example that uses iterators", "gen": "fn process_data(items: &[i32]) -> Vec<i32> {\n    items.iter().map(|x| x * 2).collect()\n}"}
```

Each line contains:

- `prompt`: Instruction prompt describing what the code does
- `gen`: Generated code (plain text, UTF-8 encoded)

See [docs/DATASET_SCHEMA.md](docs/DATASET_SCHEMA.md) for detailed format specification.

## Project Structure

```text
sigil_pipeline/          # Main pipeline package
├── main.py             # Pipeline orchestration and CLI entry point
├── config.py           # Configuration management
├── crawler.py          # Crate downloading and Stack dataset integration
├── analyzer.py         # Static analysis tools execution
├── filter.py           # Quality filtering heuristics
├── chunker.py          # Semantic code chunking (Phase-2)
├── task_generator.py   # Task type generation (Phase-2)
├── dataset_builder.py  # Prompt generation and dataset assembly
├── dataset_splitter.py # Train/val splitting by source
├── exporter.py         # JSONL export and dataset merging
├── format_validator.py # Format validation
├── observability.py    # Structured logging and metrics
├── telemetry.py        # OpenTelemetry tracing (optional)
└── utils.py            # Utilities (cargo commands, file I/O, etc.)

tools/                   # Dataset utilities
├── analyze_failures.py         # Analyze pipeline rejection reasons
├── convert_jsonl_to_parquet.py # Convert JSONL to Parquet
├── convert_parquet_to_jsonl.py # Convert Parquet to JSONL
├── split_jsonl.py              # Split large JSONL into chunks
├── split_train_val.py          # Create train/val splits
├── rebalance_task_mix.py       # Adjust task type distribution
└── verify_format_test.py       # Validate format compliance

scripts/                 # Setup and release scripts
├── create_release.py           # Release automation
└── setup/
    └── setup_rust_analysis_tools.py  # Install Rust tools

tests/                   # Test suite
benches/                 # Performance benchmarks
docs/                    # Documentation
```

## Tools

The repository includes utility scripts for dataset manipulation and analysis.

### Failure Analysis

`tools/analyze_failures.py`

- Parses the latest (or specified) analysis logs
- Categorizes Clippy warnings (ignores style warnings, flags unsafe/bad code)
- Detects license rejections from the main pipeline log
- Automatically removes license-rejected crates from `data/crate_list.txt` (unless `--no-cleanup`)
- Can write a full report to disk

```bash
# Auto-detect most recent analysis directory
python tools/analyze_failures.py

# Specify locations explicitly
python tools/analyze_failures.py \
  --log-dir logs/analysis_20251124_180335 \
  --log-file logs/phase2_full_run.log \
  --crate-list data/crate_list.txt \
  --output logs/failure_analysis.txt

# Skip automatic crate_list cleanup
python tools/analyze_failures.py --no-cleanup
```

### Dataset Utilities

`tools/split_train_val.py`

- Splits a dataset into train/val files while keeping whole crates/files together.

```bash
python tools/split_train_val.py \
  --input datasets/phase2_full.jsonl \
  --train output/train.jsonl \
  --val output/val.jsonl \
  --val-ratio 0.1
```

`tools/split_jsonl.py`

- Splits large JSONL files into ~11MB chunks without breaking JSON objects.

```bash
python tools/split_jsonl.py \
  --input datasets/phase2_full.jsonl \
  --output-dir datasets/chunks \
  --prefix phase2_chunk
```

`tools/convert_jsonl_to_parquet.py`

- Converts JSONL datasets to Parquet, supporting both training-ready (metadata stripped) and provenance variants.

```bash
python tools/convert_jsonl_to_parquet.py \
  --input datasets/phase2_full.jsonl \
  --output datasets/phase2_full.parquet \
  --variant training
```

`tools/convert_parquet_to_jsonl.py`

- Converts Parquet datasets back to JSONL (useful for inspection or smaller workflows).

```bash
python tools/convert_parquet_to_jsonl.py \
  --input datasets/phase2_full.parquet \
  --output datasets/phase2_roundtrip.jsonl
```

`tools/verify_format_test.py`

- Quick check to ensure a dataset matches the Phase 1 format specification.

```bash
python tools/verify_format_test.py --input datasets/phase2_full.jsonl
```

`tools/rebalance_task_mix.py`

- Downsamples (or lightly reweights) a JSONL dataset to match a desired `_task_type` distribution and writes a summary report.

```bash
python tools/rebalance_task_mix.py \
  --input datasets/phase2_full.jsonl \
  --output datasets/phase2_balanced.jsonl \
  --target-mix code_generation=0.5,error_fixing=0.25,transformations=0.15,explanations=0.10
```

## Testing

```bash
# Run all tests (672 tests)
pytest tests/

# Run with coverage report
pytest tests/ --cov=sigil_pipeline --cov-report=term-missing

# Run specific test modules
pytest tests/test_api_tracker.py -v          # API evolution tracking
pytest tests/test_ast_patterns.py -v         # AST-based extraction
pytest tests/test_task_generator.py -v       # Task type generation
pytest tests/test_telemetry.py -v            # OpenTelemetry tracing
pytest tests/test_converters.py -v           # Format conversion
pytest tests/test_dataset_splitter.py -v     # Train/val splitting

# Run tests by keyword
pytest tests/ -k "api" -v                    # API-related tests
pytest tests/ -k "ast" -v                    # AST parsing tests

# Run property-based tests
pytest tests/test_properties.py -v --hypothesis-show-statistics

# Run local CI checks
python test_ci_local.py
```

### Test Coverage Summary

| Category | Modules | Coverage |
|----------|---------|----------|
| Core Pipeline | analyzer, filter, config | 81-99% |
| AST Processing | ast_patterns, task_generator | 78-80% |
| API Tracking | api_tracker, usage_analyzer | 79-89% |
| Data Processing | dataset_splitter, converters | 63-98% |
| Infrastructure | telemetry, utils, environment | 77-91% |
| CLI | ecosystem, main | 42-93% |

**Overall Coverage: 75%** (4845 statements, 672 tests passing)

## SigilDERG Ecosystem Integration

This package is part of the **SigilDERG ecosystem** for Rust code model training. It integrates seamlessly with:

- **[sigilderg-finetuner](https://github.com/Superuser666-Sigil/SigilDERG-Finetuner)**: QLoRA fine-tuning for Rust code models
- **[human-eval-rust](https://github.com/Superuser666-Sigil/human-eval-Rust)**: Evaluation harness for Rust code generation

### Install Full Ecosystem

```bash
pip install sigil-pipeline[ecosystem]
```

This installs all three packages with proper version constraints.

### Complete Workflow

1. **Generate dataset** (this package):

   ```bash
   python -m sigil_pipeline.main --output datasets/phase2_full.jsonl
   ```

2. **Fine-tune model** (sigilderg-finetuner):

   ```bash
   sigilderg-train configs/llama8b-phase2.yml  # Uses local:datasets/phase2_full.jsonl
   ```

3. **Evaluate model** (human-eval-rust):

   ```bash
   sigilderg-eval samples.jsonl --use-human-eval
   ```

### Unified CLI

Use the unified orchestrator for the complete workflow:

```bash
sigil-ecosystem \
    --crate-list data/crate_list.txt \
    --dataset-path datasets/phase2_full.jsonl \
    --config-path configs/llama8b-phase2.yml
```

See **[Ecosystem Integration Guide](docs/ECOSYSTEM_INTEGRATION.md)** for detailed documentation.

## Documentation

- **[Architecture](ARCHITECTURE.md)**: Complete ecosystem architecture overview
- **[Setup Guide](docs/SETUP.md)**: Rust toolchain and cargo subcommand installation
- **[Dataset Schema](docs/DATASET_SCHEMA.md)**: Detailed dataset format specification
- **[Ecosystem Integration](docs/ECOSYSTEM_INTEGRATION.md)**: Complete workflow guide for all three packages
- **[Clippy Category Filtering](docs/CLIPPY_CATEGORY_FILTERING.md)**: Quality filter documentation
- **[OS-Agnostic Cargo Commands](docs/OS_AGNOSTIC_CARGO_COMMANDS.md)**: Cross-platform cargo usage
- **[Testing CI Locally](docs/TESTING_CI_LOCALLY.md)**: Local CI workflow testing
- **[Architecture Decision Records](docs/adr/)**: Design decisions and rationale

## Docker

The project includes Docker support for containerized execution:

```bash
# Build image
docker build -t sigil-pipeline:2.3.0 .

# Run pipeline
docker-compose up

# Interactive shell
docker run -it sigil-pipeline:2.2.0 bash

# Run with custom arguments
docker run -v $(pwd)/output:/app/output sigil-pipeline:2.2.0 \
    --crate-list /app/data/crate_list.txt \
    --output /app/output/dataset.jsonl
```

See `docker-compose.yml` and `Dockerfile` for configuration details.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgments

- **Rust community** for excellent analysis tools (Clippy, Geiger, etc.)
- **HuggingFace** for the Stack dataset and datasets library
- **The Stack dataset** contributors for providing high-quality Rust code
- **Ammar Nasr** for producing and distributing the Stack Rust Clean Dataset (<https://huggingface.co/datasets/ammarnasr/the-stack-rust-clean>)

---

**Sigil Pipeline** - Generating high-quality Rust code datasets for model fine-tuning.
