Metadata-Version: 2.4
Name: embedding-condensation
Version: 0.1.0
Summary: Measure layer-wise token embedding cosine similarity (embedding condensation diagnostic).
Project-URL: Homepage, https://chenliu-1996.github.io/projects/LM-Dispersion/
Project-URL: Repository, https://github.com/ChenLiu-1996/LM-Dispersion
Author: Chen Liu
License: Non-Commercial License — see ../LICENSE.md in the LM-Dispersion repository.
License-File: LICENSE
Keywords: embeddings,language-models,representation-geometry,transformers
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Requires-Dist: datasets>=2.14
Requires-Dist: matplotlib>=3.7
Requires-Dist: nltk>=3.8
Requires-Dist: numpy>=1.24
Requires-Dist: torch>=2.0
Requires-Dist: tqdm>=4.65
Requires-Dist: transformers<4.48,>=4.40
Provides-Extra: test
Requires-Dist: pytest>=7.0; extra == 'test'
Description-Content-Type: text/markdown

# embedding-condensation

Minimal library for the **embedding condensation** diagnostic from [LM-Dispersion](https://github.com/ChenLiu-1996/LM-Dispersion): layer-wise token cosine-similarity matrices and optional heatmaps.

## Install

```bash
cd pypi
pip install -e ".[test]"
```

## Usage

```python
from transformers import AutoModel, AutoTokenizer
from embedding_condensation import measure_embedding_condensation

model = AutoModel.from_pretrained("gpt2").eval()
tokenizer = AutoTokenizer.from_pretrained("gpt2")

result = measure_embedding_condensation(
    model,
    tokenizer,
    texts=["Your long input text here. " * 200],
    repetitions=1,
    plot=False,
)
print(result.mean_cossim_by_layer)
```

## PyPI upload

```bash
cd pypi
pip install build twine
python -m build
twine upload dist/*
```

## Test

```bash
cd pypi
pytest
```
