Metadata-Version: 2.4
Name: sbss
Version: 0.0.5
Summary: Similarity-Based Stratified Splitting Algorithm
Author-email: timothyckl <timothy.ckl@outlook.com>
License: MIT
Project-URL: Homepage, https://github.com/timothyckl/similarity-stratified-split
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.23.5
Provides-Extra: dev
Requires-Dist: pytest==7.4.3; extra == "dev"
Requires-Dist: scipy; extra == "dev"
Requires-Dist: scikit-learn; extra == "dev"
Dynamic: license-file

# Similarity Stratified Split

Implementation of the Similarity-Based Stratified Splitting algorithm described in [Similarity Based Stratified Splitting: an approach to train better classifiers](https://arxiv.org/abs/2010.06099).

## Overview

The authors propose a Similarity-Based Stratified Splitting (SBSS) technique, which uses both the output and input space information to split a dataset. Splits are generated using similarity functions among samples to place similar samples in different splits. This approach allows for a better representation of the data in the training phase. This strategy leads to a more realistic performance estimation when used in real-world applications.

## Install

**PyPI**
```bash
pip install sbss
```

**Local**

```
git clone https://github.com/timothyckl/similarity-stratified-split.git
cd ./similarity-stratified-split
pip install -e .
```

## Usage

```python
import numpy as np
from scipy.spatial import distance
from sbss import SimilarityStratifiedSplit

def get_distances(x):
    distances = distance.squareform(distance.pdist(x, metric='euclidean'))
    return distances

# inputs are recommended to be normalized
X = np.random.rand(1000, 128)
y = np.random.randint(0, 10, (1000,))

n_splits = 3
s = SimilarityStratifiedSplit(n_splits, dist_func=get_distances)

for train_index, test_index in s.split(X, y):
  print(f"Train indices: {train_index}\nTest indices: {test_index}")
  print("="*100)
```

## sklearn Compatibility

`SimilarityStratifiedSplit` is compatible with sklearn's cross-validation utilities. It can be passed directly to `cross_val_score`, `GridSearchCV`, and similar tools:

```python
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

splitter = SimilarityStratifiedSplit(n_splits=3, dist_func=get_distances)
cv_scores = cross_val_score(SVC(), X, y, cv=splitter)
```

Note that `y` is always required by the SBSS algorithm — passing `y=None` will raise a `ValueError`.

## References

- Farias, F., Ludermir, T. and Bastos-Filho, C. (2020) Similarity based stratified splitting: An approach to train better classifiers, arXiv.org. Available at: https://arxiv.org/abs/2010.06099 (Accessed: 27 November 2023). 
