Metadata-Version: 2.4
Name: corrdim
Version: 0.3.6
Summary: Correlation Dimension for LLMs - A library for computing correlation dimension of autoregressive large language models
Author-email: kduxin <duxin@tongji.edu.cn>
License: MIT
Project-URL: Homepage, https://github.com/kduxin/corrdim
Project-URL: Repository, https://github.com/kduxin/corrdim
Project-URL: Documentation, https://corrdim.readthedocs.io
Project-URL: Bug Tracker, https://github.com/kduxin/corrdim/issues
Keywords: correlation-dimension,language-models,fractal-geometry,nlp,machine-learning,natural-language-processing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Mathematics
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.21.0
Requires-Dist: ninja>=1.11.0
Requires-Dist: torch>=2.3.0
Requires-Dist: setuptools>=68.0.0
Requires-Dist: transformers>=4.30.0
Requires-Dist: accelerate>=0.21.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: tqdm>=4.62.0
Requires-Dist: ipython>=8.39.0
Provides-Extra: dev
Requires-Dist: pytest>=8.4.2; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: isort>=5.12.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: ipykernel>=6.0.0; extra == "dev"
Requires-Dist: matplotlib>=3.10.6; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=5.3.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.2.0; extra == "docs"
Requires-Dist: myst-parser>=0.16.1; extra == "docs"
Provides-Extra: demo
Requires-Dist: streamlit>=1.51.0; extra == "demo"
Dynamic: license-file


# CorrDim: Correlation Dimension for Language Models

CorrDim is a Python library for computing the **correlation dimension** of autoregressive language models from next-token log-probability vectors, based on the paper **["Correlation Dimension of Auto-Regressive Large Language Models"](https://arxiv.org/abs/2510.21258)** (NeurIPS 2025).

## Documentation

Full documentation is available at [corrdim.readthedocs.io](https://corrdim.readthedocs.io).

Use the docs site for:

- installation details and backend notes
- the full Python API reference
- CLI documentation
- examples and usage patterns

## What CorrDim measures

Given a text and an autoregressive language model, CorrDim measures the text's **global structural complexity** as perceived by that model.

In practice:

- repetitive or degenerate text tends to have a lower correlation dimension
- ordinary fluent text tends to have a higher dimension
- richer long-range structure can produce an even higher dimension

CorrDim is complementary to local metrics such as perplexity: it focuses on **sequence-level geometry**, not just token-level prediction quality.

## How it works

At a high level, CorrDim:

1. converts text into a sequence of next-token log-probability vectors
2. optionally reduces the vocabulary dimension
3. computes a correlation-integral curve over epsilon thresholds
4. estimates the correlation dimension by fitting a line in log-log space

For the mathematical details, see the [paper](https://arxiv.org/abs/2510.21258).

## Installation

CorrDim requires Python 3.10 or newer.
You may use `pip` or `uv` to install corrdim.

### pip

> **Linux GPU users:** By default, PyPI distributes CPU-only PyTorch on Linux. If you have an NVIDIA GPU, install CUDA PyTorch first.
> Choose based on your driver version:
>
> | CUDA version | Min driver | Install command |
> |---|---|---|
> | cu126 (default) | ≥ 525 | `pip install torch --index-url https://download.pytorch.org/whl/cu126` |
> | cu130 | ≥ 580 | `pip install torch --index-url https://download.pytorch.org/whl/cu130` |
>
> (For NVIDIA DGX Spark with GB10, use cu130)

Then, run
```bash
pip install corrdim
```

### uv

If using `uv`, please install PyTorch first before installing corrdim

> **Linux GPU users:** By default, PyPI distributes CPU-only PyTorch on Linux. If you have an NVIDIA GPU, install CUDA PyTorch first.
> Choose based on your driver version:
>
> | CUDA version | Min driver | Install command |
> |---|---|---|
> | cu126 (default) | ≥ 525 | `uv add torch --index https://download.pytorch.org/whl/cu126` |
> | cu130 | ≥ 580 | `uv add torch --index https://download.pytorch.org/whl/cu130` |
> 
> (For NVIDIA DGX Spark with GB10, use cu130)

Then, run
```
uv add corrdim
```


## Quick start

```python
import torch
import corrdim

result = corrdim.measure_text(
    "Your text here...",
    model="Qwen/Qwen3-0.6B",
    precision=torch.float16,
)

print("corrdim:", result.corrdim)
print("fit_r2:", result.fit_r2)
print("linear_region_bounds:", result.linear_region_bounds)
```

For batched input:

```python
import torch
import corrdim

results = corrdim.measure_texts(
    [
        "Short sample A...",
        "Short sample B...",
    ],
    model="Qwen/Qwen3-0.6B",
    precision=torch.float16,
)

for result in results:
    print(result.corrdim, result.fit_r2)
```

### Progressive dimension along the sequence

To fit correlation dimension at multiple prefix lengths without re-running the model for each prefix, use `measure_text_progressive`. It calls `progressive_curve_from_text` once, then subsamples prefix indices:

- `skip_prefix_tokens`: first prefix index to include (shorter prefixes are skipped)
- `measure_every_tokens`: stride between measured indices, or `None` (default) to choose from length: fewer than 100 tokens → `1`, fewer than 1000 → `10`, otherwise `100`

The return value is a `ProgressiveDimensionResult`: `by_prefix` maps prefix index to a full `DimensionResult`; `corrdims` maps index to the fitted scalar only.

```python
import torch
import corrdim

prog_dims = corrdim.measure_text_progressive(
    long_text,
    model="Qwen/Qwen3-0.6B",
    precision=torch.float16,
    skip_prefix_tokens=100,
)

for prefix_len, d in sorted(prog_dims.corrdims.items()):
    print(prefix_len, d)
```

## API overview

The most important entry points are:

- `measure_text` / `measure_texts` for end-to-end text measurement
- `measure_text_progressive` for multiple fitted dimensions along sequence prefixes (one model pass)
- `curve_from_text` / `curve_from_vectors` when you want the curve first
- `estimate_dimension_from_curve` when you already have saved curve data
- `progressive_curve_from_text` for prefix-wise analysis
- `correlation_integral` and related functions for lower-level tensor workflows

For full API details, signatures, return types, and backend behavior, see the [documentation site](https://corrdim.readthedocs.io).

## CLI

CorrDim includes a `corrdim` command-line interface:

```bash
corrdim measure-text \
  --file data/sep60/chaos.txt \
  --model Qwen/Qwen3-0.6B
```

Additional CLI commands and options are documented at [corrdim.readthedocs.io](https://corrdim.readthedocs.io).

## Backends

CorrDim supports multiple backends for correlation-integral computation:

- `triton`
- `pytorch`
- `pytorch_fast`
- `auto`

Set the default backend with:

```bash
export CORRDIM_CORRINT_BACKEND=pytorch
```

Or in Python:

```python
import corrdim

print(corrdim.set_corrint_backend("auto"))
print(corrdim.available_corrint_backends())
```

## Tips for systems with limited GPU RAM (e.g., <10GB)

If you run into out-of-memory errors, reduce `block_size` (default 512) to lower the peak memory usage during correlation-integral computation:

```python
result = corrdim.measure_text(
    text,
    model="Qwen/Qwen3-0.6B
",
    block_size=128,
)
```


You can also set `forward_chunk_size` to control how many tokens are processed per forward pass (reduce this value, e.g. 128, on systems with limited GPU RAM):

```python
result = corrdim.measure_text(
    text,
    model="Qwen/Qwen3-0.6B",
    block_size=128,
    forward_chunk_size=128,
)
```

## Citation

```bibtex
@inproceedings{du2025correlation,
  title={Correlation Dimension of Auto-Regressive Large Language Models},
  author={Du, Xin and Tanaka-Ishii, Kumiko},
  booktitle={Advances in Neural Information Processing Systems},
  year={2025},
  arxiv={2510.21258}
}
```

## Links

- Documentation: https://corrdim.readthedocs.io
- Paper: https://arxiv.org/abs/2510.21258
- Repository: https://github.com/kduxin/corrdim

## License

MIT License
