Metadata-Version: 2.4
Name: dyf-rs
Version: 0.2.1
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Rust
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Dist: numpy>=1.20
Requires-Dist: pytest>=7.0 ; extra == 'dev'
Requires-Dist: pyarrow>=10.0 ; extra == 'dev'
Requires-Dist: pyarrow>=10.0 ; extra == 'full'
Requires-Dist: polars>=0.19 ; extra == 'full'
Requires-Dist: scikit-learn>=1.0 ; extra == 'full'
Requires-Dist: sentence-transformers>=2.2 ; extra == 'full'
Provides-Extra: dev
Provides-Extra: full
Summary: Density Yields Features - Rust core for structure discovery in embedding spaces
Keywords: outlier,classification,lsh,pca,embeddings,machine-learning
Author-email: Justin Donaldson <jdonaldson@gmail.com>
License: Proprietary
Requires-Python: >=3.9
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/jdonaldson/dyf
Project-URL: Repository, https://github.com/jdonaldson/dyf

# DYF - Outlier Classification

Fast outlier classification using PCA-based LSH. Identifies three types of items in embedding spaces:

- **Dense**: Items in well-populated semantic buckets
- **Bridge**: Sparse items that find community via recovery PCA (connect clusters)
- **Orphan**: Truly unique items with no semantic neighbors

## Installation

```bash
pip install dyf
```

## Quick Start

```python
import numpy as np
from dyf import OutlierClassifier

# Create embeddings (e.g., from sentence-transformers)
embeddings = np.random.randn(10000, 384).astype(np.float32)

# Classify outliers
classifier = OutlierClassifier(embedding_dim=384)
classifier.fit(embeddings)

# Get results
print(classifier.report())
bridge = classifier.get_bridge()  # Indices of bridge items
orphans = classifier.get_orphans()    # Indices of orphan items
```

## Performance

~60ms for 60K embeddings (384 dimensions) - 3.8x faster than pure Python/sklearn.

## API

### OutlierClassifier

```python
OutlierClassifier(
    embedding_dim: int,
    initial_bits: int = 14,      # Bits for initial PCA LSH
    recovery_bits: int = 8,       # Bits for recovery PCA
    dense_threshold: int = 10,    # Min bucket size for "dense"
    intra_outlier_std: float = 2.0,  # Std threshold for intra-bucket outliers
    recovery_cluster_min: int = 3,   # Min cluster size for "recovered"
    seed: int = 31
)
```

**Methods:**
- `fit(embeddings)` - Fit on numpy array (n_samples, embedding_dim)
- `fit_arrow(arrow_array)` - Fit on PyArrow FixedSizeListArray (zero-copy)
- `get_bridge()` - Get indices of bridge items
- `get_orphans()` - Get indices of orphan items
- `get_statuses()` - Get status for all items
- `report()` - Get classification report

## License

MIT

