Metadata-Version: 2.4
Name: llm-detector
Version: 0.1.0
Summary: Transparent, probabilistic classification of text as human-generated or LLM-generated
Author-email: Michael Bommarito <michael.bommarito@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/mjbommar/llm-detector
Project-URL: Bug Tracker, https://github.com/mjbommar/llm-detector/issues
Project-URL: Repository, https://github.com/mjbommar/llm-detector
Keywords: llm-detection,ai-detection,text-classification,machine-learning,nlp,natural-language-processing,gpt-detection,chatgpt-detection,ai-generated-text,human-vs-ai,text-analysis,statistical-nlp
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Operating System :: OS Independent
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: nupunkt-rs>=0.1.1
Requires-Dist: scikit-learn>=1.6.0
Requires-Dist: tokenizers>=0.20.0
Provides-Extra: training
Requires-Dist: datasets>=2.19; extra == "training"
Requires-Dist: zstandard>=0.22; extra == "training"
Requires-Dist: pyarrow>=12; extra == "training"
Requires-Dist: tqdm>=4.66; extra == "training"
Provides-Extra: dev
Requires-Dist: pytest>=8.4.2; extra == "dev"
Requires-Dist: pytest-cov>=4.1; extra == "dev"
Requires-Dist: pytest-benchmark>=5.1.0; extra == "dev"
Requires-Dist: ruff>=0.5.0; extra == "dev"

# llm-detector

**Research WIP**: Transparent, probabilistic classification of text as human-generated or LLM-generated.

## Installation

```bash
pip install llm-detector
```

## Quick Start

```python
from llm_detector import classify_text

# Simple classification
result = classify_text("Your text here")
print(f"LLM probability: {result['p_llm']:.2%}")
print(f"Classification: {'LLM' if result['is_llm'] else 'Human'}")
```

## Advanced Usage

### Using the Runtime API

```python
from llm_detector import DetectorRuntime
from llm_detector.assets import default_artifacts

# Initialize detector with default models
with default_artifacts() as (model_path, baseline_path):
    detector = DetectorRuntime(
        model_path=model_path,
        baseline_path=baseline_path
    )

    # Single text classification
    result = detector.predict("This is a sample text.")
    print(f"LLM: {result.p_llm:.2%}, Human: {result.p_human:.2%}")

    # Access detailed metrics
    print(f"Confidence: {result.confidence:.4f}")
    print(f"Document metrics:", result.details['document_metrics'])
```

### Detailed Results with Diagnostics

```python
from llm_detector import classify_text

result = classify_text(
    "Your text here",
    include_diagnostics=True
)

# Access classification
print(f"Classification: {'LLM' if result['is_llm'] else 'Human'}")
print(f"LLM probability: {result['p_llm']:.4f}")
print(f"Confidence: {result['confidence']:.4f}")

# Diagnostic metrics for analysis
if 'diagnostics' in result:
    diag = result['diagnostics']
    print(f"Simple mean: {diag.get('simple_mean', 0):.4f}")
    print(f"Max score: {diag.get('max_score', 0):.4f}")
```

## CLI Usage

```bash
# Classify text from command line
llm-detector --text "Your text here"

# Classify from file
llm-detector --file input.txt

# Get detailed output with diagnostics
llm-detector --text "Your text" --show-diagnostics --json
```

## Research Notes

This is an active research project exploring transparent statistical methods for LLM detection. The approach combines:

- **Statistical features**: Lexical diversity, punctuation patterns, repetition metrics
- **Tokenizer divergence**: Cross-tokenizer efficiency and consistency metrics
- **Ensemble aggregation**: Logit-weighted mean with diagnostic fallbacks

Current limitations:
- Performance varies by text length (best with 3+ sentences)
- Optimized for general English text
- Continuous model updates as LLM capabilities evolve

## Development

```bash
# Install with development dependencies
pip install -e ".[dev,training]"

# Run tests
pytest

# Train custom models (requires training extras)
python -m llm_detector.training.cli --help
```

## License

MIT
