Metadata-Version: 2.4
Name: corrdim
Version: 0.2.1
Summary: Correlation Dimension for LLMs - A library for computing correlation dimension of autoregressive large language models
Author-email: kduxin <duxin@tongji.edu.cn>
License-Expression: MIT
Project-URL: Homepage, https://github.com/kduxin/corrdim
Project-URL: Repository, https://github.com/kduxin/corrdim
Project-URL: Documentation, https://corrdim.readthedocs.io
Project-URL: Bug Tracker, https://github.com/kduxin/corrdim/issues
Keywords: correlation-dimension,language-models,fractal-geometry,nlp,machine-learning,natural-language-processing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Mathematics
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.21.0
Requires-Dist: ninja>=1.11.0
Requires-Dist: torch>=2.0.0
Requires-Dist: setuptools>=68.0.0
Requires-Dist: transformers>=4.30.0
Requires-Dist: accelerate>=0.21.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: tqdm>=4.62.0
Requires-Dist: triton>=3.5.0; extra != "no-triton"
Provides-Extra: gpu
Provides-Extra: no-triton
Provides-Extra: dev
Requires-Dist: pytest>=8.4.2; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: isort>=5.12.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: ipykernel>=6.0.0; extra == "dev"
Requires-Dist: matplotlib>=3.10.6; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=5.3.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.2.0; extra == "docs"
Requires-Dist: myst-parser>=0.16.1; extra == "docs"
Provides-Extra: demo
Requires-Dist: streamlit>=1.51.0; extra == "demo"
Dynamic: license-file

# CorrDim: Correlation Dimension for Language Models

CorrDim is a Python library for computing the **correlation dimension** of autoregressive language models from next-token log-probability vectors, based on the paper **["Correlation Dimension of Auto-Regressive Large Language Models"](https://arxiv.org/abs/2510.21258)** (NeurIPS 2025).

## Documentation

Full documentation is available at [corrdim.readthedocs.io](https://corrdim.readthedocs.io).

Use the docs site for:

- installation details and backend notes
- the full Python API reference
- CLI documentation
- examples and usage patterns

## What CorrDim measures

Given a text and an autoregressive language model, CorrDim measures the text's **global structural complexity** as perceived by that model.

In practice:

- repetitive or degenerate text tends to have a lower correlation dimension
- ordinary fluent text tends to have a higher dimension
- richer long-range structure can produce an even higher dimension

CorrDim is complementary to local metrics such as perplexity: it focuses on **sequence-level geometry**, not just token-level prediction quality.

## How it works

At a high level, CorrDim:

1. converts text into a sequence of next-token log-probability vectors
2. optionally reduces the vocabulary dimension
3. computes a correlation-integral curve over epsilon thresholds
4. estimates the correlation dimension by fitting a line in log-log space

For the mathematical details, see the [paper](https://arxiv.org/abs/2510.21258).

## Installation

CorrDim requires Python 3.10 or newer.

```bash
pip install corrdim
```

If you want to avoid Triton installation:

```bash
pip install "corrdim[no-triton]"
```

For local development:

```bash
pip install "corrdim[dev,docs]"
```

To compile the CUDA extension during installation:

```bash
CORRDIM_BUILD_CUDA=1 pip install .
```

## Quick start

```python
import torch
import corrdim

result = corrdim.measure_text(
    "Your text here...",
    model="Qwen/Qwen2.5-1.5B",
    precision=torch.float16,
)

print("corrdim:", result.corrdim)
print("fit_r2:", result.fit_r2)
print("linear_region_bounds:", result.linear_region_bounds)
```

For batched input:

```python
import torch
import corrdim

results = corrdim.measure_texts(
    [
        "Short sample A...",
        "Short sample B...",
    ],
    model="Qwen/Qwen2.5-1.5B",
    precision=torch.float16,
)

for result in results:
    print(result.corrdim, result.fit_r2)
```

## API overview

The most important entry points are:

- `measure_text` / `measure_texts` for end-to-end text measurement
- `curve_from_text` / `curve_from_vectors` when you want the curve first
- `estimate_dimension_from_curve` when you already have saved curve data
- `progressive_curve_from_text` for prefix-wise analysis
- `correlation_integral` and related functions for lower-level tensor workflows

For full API details, signatures, return types, and backend behavior, see the [documentation site](https://corrdim.readthedocs.io).

## CLI

CorrDim includes a `corrdim` command-line interface:

```bash
corrdim measure-text \
  --file data/sep60/chaos.txt \
  --model Qwen/Qwen2.5-1.5B
```

Additional CLI commands and options are documented at [corrdim.readthedocs.io](https://corrdim.readthedocs.io).

## Backends

CorrDim supports multiple backends for correlation-integral computation:

- `cuda`
- `triton`
- `pytorch`
- `pytorch_fast`
- `auto`

Set the default backend with:

```bash
export CORRDIM_CORRINT_BACKEND=pytorch
```

Or in Python:

```python
import corrdim

print(corrdim.set_corrint_backend("auto"))
print(corrdim.available_corrint_backends())
```

## Citation

```bibtex
@inproceedings{du2025correlation,
  title={Correlation Dimension of Auto-Regressive Large Language Models},
  author={Du, Xin and Tanaka-Ishii, Kumiko},
  booktitle={Advances in Neural Information Processing Systems},
  year={2025},
  arxiv={2510.21258}
}
```

## Links

- Documentation: https://corrdim.readthedocs.io
- Paper: https://arxiv.org/abs/2510.21258
- Repository: https://github.com/kduxin/corrdim

## License

MIT License
