Metadata-Version: 2.4
Name: cordon
Version: 0.2.0
Summary: Semantic anomaly detection for system log files
Author-email: Caleb Evans <caevans@redhat.com>
License: Apache-2.0
License-File: LICENSE
Requires-Python: >=3.10
Requires-Dist: huggingface-hub>=0.20.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: sentence-transformers>=2.2.0
Requires-Dist: torch>=2.0.0
Requires-Dist: tqdm>=4.65.0
Provides-Extra: benchmark
Requires-Dist: matplotlib>=3.7.0; extra == 'benchmark'
Requires-Dist: pyyaml>=6.0; extra == 'benchmark'
Requires-Dist: requests>=2.31.0; extra == 'benchmark'
Requires-Dist: scipy>=1.10.0; extra == 'benchmark'
Requires-Dist: seaborn>=0.12.0; extra == 'benchmark'
Requires-Dist: umap-learn>=0.5.0; extra == 'benchmark'
Provides-Extra: dev
Requires-Dist: mypy>=1.5.0; extra == 'dev'
Requires-Dist: pre-commit>=3.5.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: llama-cpp
Requires-Dist: huggingface-hub>=0.20.0; extra == 'llama-cpp'
Requires-Dist: llama-cpp-python>=0.3.0; extra == 'llama-cpp'
Description-Content-Type: text/markdown

# Cordon

**Semantic anomaly detection for system log files**

Cordon uses transformer-based embeddings and density-based scoring to identify semantically unusual patterns in large log files, designed to reduce massive logs down to the most anomalous sections for analysis.

**Key principle:** Repetitive patterns (even errors) are considered "normal background." Cordon surfaces unusual, rare, or clustered events that stand out semantically from the bulk of the logs.

## Features

- **Semantic Analysis**: Uses transformer models to understand log content meaning, not just keyword matching
- **Density-Based Scoring**: Identifies anomalies using k-NN distance in embedding space
- **Noise Reduction**: Filters out repetitive logs, keeping only unusual patterns
- **Multiple Backends**: sentence-transformers (default) or llama.cpp for containers

## Requirements

### GPU Requirements (Optional but Recommended)

For GPU acceleration, you need:
- **NVIDIA GPU**: Pascal architecture or newer (GTX 10-series, RTX series, Tesla P/V/A/H series)
- **Compute Capability**: 6.0 or higher
- **Compatible GPUs**: GTX 1050+, RTX 20/30/40 series, Tesla P100+, V100, A100, H100

**Not compatible**: GTX 900-series or older (Maxwell/Kepler architectures)

CPU mode is always available as a fallback.

## Installation

### From PyPI (Recommended)

```bash
# With uv (recommended)
uv pip install cordon

# With pip
pip install cordon
```

### From Source

```bash
# Clone the repository
git clone https://github.com/calebevans/cordon.git
cd cordon

# With uv (recommended)
uv pip install -e .

# With pip
pip install -e .
```

For development:

```bash
uv pip install -e ".[dev]"
pre-commit install
```

For llama.cpp backend (GPU acceleration in containers):

```bash
uv pip install -e ".[llama-cpp]"
```

### Container Installation

```bash
make container-build
```

See [Container Guide](./docs/CONTAINER.md) for GPU support and advanced usage.

## Quick Start

### Command Line

```bash
# Basic usage
cordon system.log

# Multiple files
cordon app.log error.log

# With options
cordon --window-size 10 --k-neighbors 10 --anomaly-percentile 0.05 app.log

# With GPU acceleration (scoring batch size auto-detected)
cordon --device cuda --batch-size 64 large.log

# Override auto-detection if needed
cordon --device cuda --batch-size 64 --scoring-batch-size 50000 large.log

# Save results to file
cordon --output anomalies.xml system.log

# Show detailed statistics and save results
cordon --detailed --output results.xml app.log

# llama.cpp backend (for containers)
cordon --backend llama-cpp system.log
```

### Python Library

```python
from pathlib import Path
from cordon import SemanticLogAnalyzer, AnalysisConfig

# Basic usage
analyzer = SemanticLogAnalyzer()
output = analyzer.analyze_file(Path("system.log"))
print(output)

# Advanced configuration with GPU acceleration
config = AnalysisConfig(
    window_size=10,
    k_neighbors=10,
    anomaly_percentile=0.05,
    device="cuda",           # GPU for embedding and scoring
    batch_size=64,           # Embedding batch size
    scoring_batch_size=None  # Auto-detect optimal batch size (default)
)
analyzer = SemanticLogAnalyzer(config)
result = analyzer.analyze_file_detailed(Path("app.log"))
```

## Backend Options

### sentence-transformers (Default)

Best for native installations with GPU access.

```bash
cordon system.log  # Auto-detects GPU (MPS/CUDA)
cordon --device cuda system.log
cordon --device cpu system.log
```

### llama.cpp Backend

Best for container deployments with GPU acceleration via Vulkan.

```bash
# Auto-downloads model on first run
cordon --backend llama-cpp system.log

# With GPU acceleration
cordon --backend llama-cpp --n-gpu-layers 10 system.log

# Custom model
cordon --backend llama-cpp --model-path ./model.gguf system.log
```

See [llama.cpp Guide](./docs/llama-cpp.md) for details on models, performance, and GPU setup.

## Container Usage

### Build

```bash
# Build locally
make container-build
```

### Run

```bash
# Pull published image from GitHub Container Registry
podman pull ghcr.io/calebevans/cordon:latest  # or :dev for development builds

# Run with published image
podman run --rm -v /path/to/logs:/logs:Z ghcr.io/calebevans/cordon:latest /logs/system.log

# Run with locally built image
make container-run DIR=/path/to/logs ARGS="/logs/system.log"

# With GPU (requires Podman with libkrun)
podman run --device /dev/dri -v /path/to/logs:/logs:Z ghcr.io/calebevans/cordon:latest \
  --backend llama-cpp --n-gpu-layers 10 /logs/system.log
```

See [Container Guide](./docs/CONTAINER.md) for full details.

## Primary Use Case: LLM Context Reduction

Cordon attempts to solve the problem of log files being too large for LLM context windows by reducing them to semantically significant sections.

**Real-world reduction rates from benchmarks:**
- 1M-line HDFS logs → 20K lines (98% reduction with p=0.02 threshold)
- 5M-line HDFS logs → 100K lines (98% reduction with p=0.02 threshold)

Example workflow:

```python
# Extract anomalies
analyzer = SemanticLogAnalyzer()
anomalies = analyzer.analyze_file(Path("production.log"))

# Send curated context to LLM (now fits in context window)
```

The output is intentionally lossy—it discards repetitive patterns to focus on semantically unusual events.

## How It Works

### Pipeline

1. **Ingestion**: Read log file line-by-line
2. **Segmentation**: Create overlapping windows of N lines
3. **Vectorization**: Embed windows using transformer models
4. **Scoring**: Calculate k-NN density scores
5. **Thresholding**: Select top X% based on scores
6. **Merging**: Combine overlapping significant windows
7. **Formatting**: Generate XML-tagged output

### Scoring

- **Higher score** = Semantically unique = Anomalous
- **Lower score** = Repetitive = Normal background noise

The score for each window is the average cosine distance to its k nearest neighbors in the embedding space.

**GPU Acceleration**: Both embedding and scoring phases automatically leverage GPU acceleration (CUDA/MPS) when available, providing significant speedups for large log files.

**Important:** Repetitive patterns are filtered even if critical. The same FATAL error repeated 100 times scores as "normal" because it's semantically similar to itself.

See [Cordon's architecture](./docs/architecture.md) for full details.

## Configuration

### Analysis Parameters

| Parameter | Default | CLI Flag | Description |
|-----------|---------|----------|-------------|
| `window_size` | 4 | `--window-size` | Lines per window (non-overlapping) |
| `k_neighbors` | 5 | `--k-neighbors` | Number of neighbors for density calculation |
| `anomaly_percentile` | 0.1 | `--anomaly-percentile` | Top N% to keep (0.1 = 10%) |
| `batch_size` | 32 | `--batch-size` | Batch size for embedding generation |
| `scoring_batch_size` | Auto | `--scoring-batch-size` | Batch size for k-NN scoring (auto-detects based on GPU memory) |

### Backend Options

| Parameter | Default | CLI Flag | Description |
|-----------|---------|----------|-------------|
| `backend` | `sentence-transformers` | `--backend` | Embedding backend |
| `model_name` | `all-MiniLM-L6-v2` | `--model-name` | HuggingFace model |
| `device` | Auto | `--device` | Device for embedding and scoring (cuda/mps/cpu) |
| `model_path` | None | `--model-path` | GGUF model path (llama-cpp) |
| `n_gpu_layers` | 0 | `--n-gpu-layers` | GPU layers (llama-cpp) |

### Output Options

| Parameter | Default | CLI Flag | Description |
|-----------|---------|----------|-------------|
| `detailed` | False | `--detailed` | Show detailed statistics (timing, score distribution) |
| `output` | None | `--output`, `-o` | Save anomalous blocks to file (default: stdout) |

Run `cordon --help` for full CLI documentation.

### ⚠️ Important: Token Limits and Window Sizing

**Transformer models have token limits that affect how much of each window is analyzed.** Windows exceeding the limit are automatically truncated to the first N tokens.

**Cordon will warn you if significant truncation is detected** and suggest better settings for your logs.

**Default model (`all-MiniLM-L6-v2`) has a 256-token limit:**
- Compact logs (20-30 tokens/line): Can increase to `window_size=8` for more context
- Standard logs (40-50 tokens/line): Default works well
- Verbose logs (50-70 tokens/line): Default works, or use larger model for bigger windows
- Very verbose logs (80+ tokens/line): Reduce to `window_size=3` or use larger-context model

**For verbose system logs**, use larger-context models:
```bash
# BAAI/bge-base-en-v1.5 supports 512 tokens (~8-10 verbose lines)
cordon --model-name "BAAI/bge-base-en-v1.5" --window-size 8 your.log
```

**See [Configuration Guidelines](./docs/architecture.md#configuration-guidelines) for detailed recommendations.**

## Use Cases

### What Cordon Is Good For

- **LLM Pre-processing**: Reduce large logs to small anomalous sections prior to analysis
- **Initial Triage**: First-pass screening of unfamiliar logs to find "what's unusual here?"
- **Anomaly Detection**: Surface semantically unique events (rare errors, state transitions, unusual clusters)
- **Exploratory Analysis**: Discover unexpected patterns without knowing what to search for

### What Cordon Is NOT Good For

- Complete error analysis (repetitive errors filtered)
- Specific error hunting (use grep/structured logging)
- Compliance logging (this is lossy by design)

## Performance

### GPU Acceleration

Cordon automatically leverages GPU acceleration for both embedding and scoring phases when available:

- **Embedding**: Uses PyTorch/sentence-transformers with CUDA or MPS
- **Scoring**: Uses PyTorch for GPU-accelerated k-NN computation
- **Speedup**: 5-15x faster scoring on GPU compared to CPU for large datasets

For large log files (millions of lines), GPU acceleration can reduce total processing time from hours to minutes.

### Memory Management

Cordon uses PyTorch for all k-NN scoring operations:

| Strategy | When | RAM Usage | Speed |
|----------|------|-----------|-------|
| PyTorch GPU | GPU available (CUDA/MPS) | Moderate | Fastest |
| PyTorch CPU | No GPU / CPU forced | Moderate | Fast |

**What's a "window"?** A window is a non-overlapping chunk of N consecutive log lines (default: 4 lines). A 10,000-line log with window_size=4 creates 2,500 windows.
