Metadata-Version: 2.4
Name: soakpy
Version: 0.0.54
Summary: SOAK splitting utility
Author-email: Tung Nguyen <nguyenlamtung10@gmail.com>
License-Expression: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: scikit-learn
Requires-Dist: matplotlib
Requires-Dist: scipy
Dynamic: license-file

# SOAK: Same/Other/All K-fold Cross-Validation
SOAK is designed to estimate the **similarity of patterns** found across different subsets of a dataset. It extends traditional K-fold cross-validation with "Same," "Other," and "All" splitting strategies to provide a robust measure of pattern similarity.

# Usage
## Low-level: SOAK split only
```python
import numpy as np
import soakpy

# --- synthetic data ---
X = np.arange(10).reshape(-1, 1)
X = np.append(X, [10, 12, 14])
y = X.ravel()
subset_vec = np.array(['even' if x % 2 == 0 else 'odd' for x in X.ravel()])

# --- Initialize soak object ---
for subset_value, category, fold_id, random_seed, train_idx_final, test_same_idx in soakpy.split(subset_vec, n_splits=2, n_random_seeds=2):
    print(f"test subset: {subset_value:6s} --- category: {category:6s} --- test fold: {fold_id}")
    print(f"y_test : {y[test_same_idx]}")
    print(f"y_train: {y[train_idx_final]}")
    print("-"*50)
```

```
test subset: even   --- category: same   --- test fold: 1
y_test : [ 0  2  4 10]
y_train: [ 6  8 12 14]
--------------------------------------------------
test subset: even   --- category: other  --- test fold: 1
y_test : [ 0  2  4 10]
y_train: [1 9]
--------------------------------------------------
test subset: even   --- category: same-ds --- test fold: 1
y_test : [ 0  2  4 10]
y_train: [ 6 14]
--------------------------------------------------
test subset: even   --- category: same-ds --- test fold: 1
y_test : [ 0  2  4 10]
y_train: [ 8 14]
--------------------------------------------------
test subset: even   --- category: all    --- test fold: 1
y_test : [ 0  2  4 10]
y_train: [ 1  6  8  9 12 14]
--------------------------------------------------
test subset: even   --- category: all-ds --- test fold: 1
y_test : [ 0  2  4 10]
y_train: [ 1 14]
--------------------------------------------------
test subset: even   --- category: all-ds --- test fold: 1
y_test : [ 0  2  4 10]
y_train: [12 14]
--------------------------------------------------
test subset: odd    --- category: same   --- test fold: 1
y_test : [3 5 7]
y_train: [1 9]
--------------------------------------------------
test subset: odd    --- category: other  --- test fold: 1
y_test : [3 5 7]
y_train: [ 6  8 12 14]
--------------------------------------------------
test subset: odd    --- category: other-ds --- test fold: 1
y_test : [3 5 7]
y_train: [12 14]
--------------------------------------------------
test subset: odd    --- category: other-ds --- test fold: 1
y_test : [3 5 7]
y_train: [ 8 14]
--------------------------------------------------
test subset: odd    --- category: all    --- test fold: 1
y_test : [3 5 7]
y_train: [ 1  6  8  9 12 14]
--------------------------------------------------
test subset: odd    --- category: all-ds --- test fold: 1
y_test : [3 5 7]
y_train: [8 9]
--------------------------------------------------
test subset: odd    --- category: all-ds --- test fold: 1
y_test : [3 5 7]
y_train: [ 8 14]
--------------------------------------------------
test subset: even   --- category: same   --- test fold: 2
y_test : [ 6  8 12 14]
y_train: [ 0  2  4 10]
--------------------------------------------------
test subset: even   --- category: other  --- test fold: 2
y_test : [ 6  8 12 14]
y_train: [3 5 7]
--------------------------------------------------
test subset: even   --- category: same-ds --- test fold: 2
y_test : [ 6  8 12 14]
y_train: [0 2]
--------------------------------------------------
test subset: even   --- category: same-ds --- test fold: 2
y_test : [ 6  8 12 14]
y_train: [ 4 10]
--------------------------------------------------
test subset: even   --- category: all    --- test fold: 2
y_test : [ 6  8 12 14]
y_train: [ 0  2  3  4  5  7 10]
--------------------------------------------------
test subset: even   --- category: all-ds --- test fold: 2
y_test : [ 6  8 12 14]
y_train: [0 5]
--------------------------------------------------
test subset: even   --- category: all-ds --- test fold: 2
y_test : [ 6  8 12 14]
y_train: [ 2 10]
--------------------------------------------------
test subset: odd    --- category: same   --- test fold: 2
y_test : [1 9]
y_train: [3 5 7]
--------------------------------------------------
test subset: odd    --- category: other  --- test fold: 2
y_test : [1 9]
y_train: [ 0  2  4 10]
--------------------------------------------------
test subset: odd    --- category: other-ds --- test fold: 2
y_test : [1 9]
y_train: [0 4]
--------------------------------------------------
test subset: odd    --- category: other-ds --- test fold: 2
y_test : [1 9]
y_train: [ 2 10]
--------------------------------------------------
test subset: odd    --- category: all    --- test fold: 2
y_test : [1 9]
y_train: [ 0  2  3  4  5  7 10]
--------------------------------------------------
test subset: odd    --- category: all-ds --- test fold: 2
y_test : [1 9]
y_train: [2 7]
--------------------------------------------------
test subset: odd    --- category: all-ds --- test fold: 2
y_test : [1 9]
y_train: [2 5]
--------------------------------------------------
```

## High-level: Analyze dataset and Visualize
```python
import soakpy
import pandas as pd

df = pd.read_csv("https://github.com/lamtung16/soak_regression/raw/refs/heads/main/data/WorkersCompensation.csv.xz")
soak_obj = soakpy.SOAK(df=df, subset_col="Gender", target_col="UltimateIncurredClaimCost")
soak_obj.analyze(model_list=["featureless", "tree"], n_splits=2, n_random_seeds=2, log_target=True)
soak_obj.visualize(subset_value='M', model="tree", metric="rmse", figsize=(12, 2.5))
soak_obj.visualize(subset_value='F', model="featureless", metric="mae", figsize=(12, 2.5))
```
