Metadata-Version: 2.4
Name: RE-sLDA
Version: 0.1.1
Summary: Resampling-Enhanced Sparse LDA for ordinal outcomes
Author: RE-sLDA contributors
License: # MIT License
        
        Copyright (c) 2026 Ryan Wang
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Source, https://github.com/your-org/RE-sLDA
Keywords: sparse-lda,ordinal-regression,feature-selection,bootstrap,bioinformatics
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE.md
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: scikit-learn
Requires-Dist: scipy
Requires-Dist: joblib
Requires-Dist: tqdm
Requires-Dist: tqdm-joblib
Provides-Extra: plot
Requires-Dist: matplotlib; extra == "plot"
Requires-Dist: seaborn; extra == "plot"
Provides-Extra: dev
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Dynamic: license-file

# RE-sLDA: Resampling-Enhanced Sparse LDA for Ordinal Outcomes
This repository contains the implementation of RE-sLDA, a framework designed to enhance feature selection stability and accuracy when dealing with high-dimensional data and ordinal outcomes. By integrating resampling techniques with Sparse Linear Discriminant Analysis (sLDA), this method identifies robust biomarker signatures that standard sparse models often miss due to selection instability.

## **Key Features**
- **Ordinal Outcome Optimization:** Specifically tuned for categorical outcomes with a natural ordering (e.g., disease severity, treatment response grades).

- **Resampling-Based Stability:** Utilizes bootstrap-based resampling to calculate Variable Inclusion Probabilities (VIP), ensuring the selected features are not artifacts of a single data split.

- **Parallel Computing Support:** Fully integrated with `multiprocessing` for high-performance execution on multi-core machines.

---

## **Installation**

### From PyPI (recommended)
```bash
pip install RE-sLDA
```
This installs the `re_slda` Python package and a `re-slda` command-line entry point. A virtual environment is recommended to avoid dependency conflicts.

### From source
Clone the repository, then install in editable mode:
```bash
git clone https://github.com/your-org/RE-sLDA.git
cd RE-sLDA
pip install -e .
```

---

## **Usage**

RE-sLDA can be driven two ways: as importable functions inside a notebook/script, or as a command-line tool.

### Option A — Import into a notebook
```python
import pandas as pd
import re_slda

X = pd.read_csv("datasets/use_glio_data_filter1000.csv")    # has header
Y = pd.read_csv("datasets/use_glio_dataY_filter1000.csv",
                header=None).values.squeeze()

# Bootstrapping — returns a DataFrame, one row per iteration
results = re_slda.run_bootstrapping(X, Y, iters=200, base_seed=42)

# Per-feature Variable Inclusion Probability
vip = re_slda.compute_vip(results)
vip.head(20)
```
`run_subsampling` has the same shape. Pass `out_prefix="MyRun"` to also write a timestamped CSV to `output/`. The public API is:

| Function | Purpose |
|----------|---------|
| `re_slda.run_bootstrapping(X, Y, varnames=None, *, iters=200, ...)` | Resampling-with-replacement pipeline. Returns a DataFrame. |
| `re_slda.run_subsampling(X, Y, varnames=None, *, iters=200, ...)`   | Train/test-split pipeline. Returns a DataFrame. |
| `re_slda.compute_vip(results)` | Variable Inclusion Probabilities from a results DataFrame. |
| `re_slda.predict_asda_ordinal(model, X_new)` | Predict ordinal labels from a fitted ordASDA model. |
| `re_slda.ordASDA(...)` | Low-level ordinal sparse LDA fitter. |

### Option B — Command line
After `pip install`, the `re-slda` console script is available:
```bash
re-slda bootstrapping
re-slda subsampling
```
The legacy invocation still works for users who prefer to clone the repo:
```bash
python pipeline.py bootstrapping
python pipeline.py subsampling
```
Both commands accept `--iters`, `--out-prefix`, `--save-dir`, `--seed`, `--x`, and `--y`. Run `re-slda --help` for details.

---

## **Tutorial: Walkthrough with the Included Glioma Dataset**

This repository ships with a real example dataset so you can run the full pipeline before applying RE-sLDA to your own data. The walkthrough below explains each step, how everything works, and how to read the results.

### Step 1 — Inspect the example data
Two files are provided in [`datasets/`](datasets/):

| File | Role | Shape | Description |
|------|------|-------|-------------|
| `use_glio_data_filter1000.csv` | Feature matrix **X** | 175 samples × 1000 features | Pre-filtered gene-expression features for glioma patients. First row is the feature names (`V4391`, `V708`, …). |
| `use_glio_dataY_filter1000.csv` | Response vector **Y** | 175 values | Ordinal tumor-grade labels (e.g. `1`, `2`, `3`, `4`). No header row. |

You can preview the data with any spreadsheet tool or:
```bash
head -2 datasets/use_glio_data_filter1000.csv
head -5 datasets/use_glio_dataY_filter1000.csv
```

### Step 2 — Run the bootstrapping pipeline
Bootstrapping is the recommended starting point, as it produces Variable Inclusion Probabilities (VIPs) for every feature. Either from the command line:

```bash
re-slda bootstrapping            # after `pip install RE-sLDA`
# or, from a source checkout:
python pipeline.py bootstrapping
```

or directly from a notebook:
```python
import pandas as pd, re_slda

X = pd.read_csv("datasets/use_glio_data_filter1000.csv")
Y = pd.read_csv("datasets/use_glio_dataY_filter1000.csv", header=None).values.squeeze()
results = re_slda.run_bootstrapping(X, Y, iters=200)
```

What happens during this run (using the defaults):
1. The script draws **200 bootstrap replicates** of the rows of X/Y (with replacement).
2. For each replicate it samples a random predictor subspace of size `subspace_size = 5`, repeated until ~80% of the predictor pool has been covered (`target_unique_prob = 0.8`).
3. Inside each subspace it fits an ordinal sLDA model with 4-fold cross-validation to pick `optimal_lambda`.
4. Selected variables are accumulated and held-out MAE / Accuracy are recorded.

Expect a runtime of several minutes on a modern multi-core laptop. The console will print progress as each replicate completes.

### Step 3 — Run the subsampling pipeline (optional, for comparison)
```bash
re-slda subsampling
```
or from a notebook:
```
results = re_slda.run_subsampling(X, Y, iters=200)
``` 
 Subsampling replaces the bootstrap with repeated train/test splits (`test_ratio = 0.20`, `n_subspaces = 5`, 200 iterations). It is useful as a sanity check: features that appear stable under *both* schemes are the most trustworthy.

### Step 4 — Locate the output
When invoked from the command line, results are written to [`output/`](output/) with a timestamped filename:
```
output/BS_Glios_group_bootstrapping_<mmddHHMM>.csv      # bootstrapping
output/CVlam_Glio_Subspace_subsampling_<mmddHHMM>.csv   # subsampling
```
From a notebook nothing is written to disk by default — the call returns a DataFrame. Pass `out_prefix="MyRun"` (and optionally `save_dir=...`) to also write a timestamped CSV.

### Step 5 — Interpret the results
Each CSV contains one row per resampling iteration with these columns:

| Column | Meaning | How to read it |
|--------|---------|----------------|
| `Selected_Variables` | Comma separated features chosen on that iteration | Tally these across rows. Features appearing in many rows are stable. The fraction of rows in which a feature appears is its **Variable Inclusion Probability (VIP)**. |
| `optimal_lambda` | Cross-validated regularisation strength | A tight distribution suggests the regularisation surface is well-behaved; very high variance suggests an under-determined problem. |
| `MAE` | Mean Absolute Error on held-out data | Lower is better. Because Y is ordinal, MAE is the primary performance metric. It penalises a "grade 4 predicted as grade 2" more than a one-step error. |
| `Accuracy` | Exact-match classification accuracy on held-out data | Use as a secondary metric; ordinal models often have modest accuracy but small MAE. |

A typical post-processing pattern in Python:
```python
import pandas as pd, re_slda

# If you ran from a notebook you already have `results`; otherwise read the CSV:
results = pd.read_csv("output/BS_Glios_group_bootstrapping_<mmddHHMM>.csv")

# 1. Per-feature VIP across the 200 iterations
vip = re_slda.compute_vip(results)
print(vip.head(20))

# 2. Predictive performance summary
print(results[["MAE", "Accuracy", "optimal_lambda"]].describe())
```

**Rules of thumb for the example dataset:**
- Features with **VIP ≥ 0.6** are the candidate stable signature.
- Compare the top VIP features from the bootstrapping and subsampling outputs. The intersection is the most reliable.
- Median MAE on the glioma example should sit well below 1.0 (i.e. on average predictions are off by less than one tumor grade)

### Step 6 — Adapt to your own data
Once you are familiarized with the framework, swap in your own data (see *Setup Dataset* below) and tune the parameters described in *Parameter Configuration*.

---

### Parameter Configuration
All tunable parameters are exposed as keyword arguments to `re_slda.run_bootstrapping` / `re_slda.run_subsampling` (and as flags on the `re-slda` CLI). Pass them at the call site — no need to edit `pipeline.py`.

| Parameter | Pipeline | Effect |
|-----------|----------|--------|
| `iters` | both | Number of resampling iterations. More iterations → more stable VIPs, longer runtime. |
| `cv_folds` / `n_folds_cv` | both | Inner CV folds used to pick `lambda`. |
| `predictor_subset` | both | Pool of candidate features per iteration (default 80% of columns). |
| `subspace_size` | bootstrapping | Size of each randomly sampled predictor subspace. |
| `target_unique_prob` | bootstrapping | Target unique-sample coverage; controls bootstrap scale. |
| `n_subspaces` | subsampling | Number of subspaces per train/test split. |
| `test_ratio` | subsampling | Hold-out fraction for each split. |
| `base_seed` | both | Random seed for reproducibility. |
| `out_prefix`, `save_dir` | both | If set, also write a timestamped CSV to disk. |

---

### Setup Dataset
A sample dataset is already provided in [`datasets/`](datasets/). To run on your own data:

- **From a notebook:** load any DataFrame / NumPy array and pass it directly — `re_slda.run_bootstrapping(X, Y)` accepts both.
- **From the CLI:** point `re-slda` at your files with `--x path/to/X.csv --y path/to/Y.csv`.

#### **Dataset Requirements:**
- Files must be in `.csv` format
- **Y (response) dataset**
    - Single column, no header
    - Contains **ordinal labels** only
- **X (feature) dataset**
    - Rows represent samples
    - Columns represent features
    - First row must contain feature (variable) names
> **Important:** The number of rows in X must match the number of entries in Y.

---

## **Output**

After running either the **bootstrapping** or **subsampling** pipeline, the framework generates a CSV output file in the designated output directory.

- All output files are saved in the `output/` directory
- The **filename prefix** can be modified in the pipeline parameters

### Output Naming
Output files are timestamped to ensure reproducibility and prevent overwriting previous results. The filename format is:
```
<prefix>_<pipeline>_<mmddHHMM>.csv
```
Example:
```
BS_Glios_group_bootstrapping_01251445.csv
```

### Notes on Interpretation

**Note:** The `Selected_Variables` column reflects the final set of features chosen for that run and may vary across executions due to resampling and randomness. This variation is the *signal* the framework exploits; aggregate across iterations to obtain the VIP.

**Important:** Performance metrics are computed on held-out data and may vary depending on the random seed and parameter configuration. Always report summaries (median, IQR) across iterations rather than a single number.

---

## Reference

> Clemmensen, L., Hastie, T., Witten, D., & Ersbøll, B. (2011).
> Sparse Discriminant Analysis. Technometrics.

---
