Metadata-Version: 2.4
Name: maldi-tof-classifier
Version: 0.3.2
Summary: The maldi_tof_classifier package offers a CLI and Python 3 API for machine learning based classification of MALDI TOF spectra as measured by a Shimadzu 8030 MALDI-TOF mass spectrometer.
License: MIT License
        
        Copyright (c) 2026 Oliver Felix Matthias Klein
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: imbalanced-learn>=0.0
Requires-Dist: joblib>=1.5.3
Requires-Dist: keras>=3.12.1
Requires-Dist: matplotlib>=3.10.8
Requires-Dist: numpy>=2.2.6
Requires-Dist: pandas>=2.3.3
Requires-Dist: pydantic>=2.12.5
Requires-Dist: pyopls>=20.3.post1
Requires-Dist: pyyaml>=6.0.3
Requires-Dist: rpy2>=3.6.7
Requires-Dist: scikit-learn>=1.7.2
Requires-Dist: scipy>=1.15.3
Requires-Dist: seaborn>=0.13.2
Requires-Dist: tensorflow>=2.21.0
Requires-Dist: tqdm>=4.67.1
Requires-Dist: typer>=0.24.1
Requires-Dist: xgboost>=3.1.3
Provides-Extra: r
Requires-Dist: rpy2>=3.6.7; extra == "r"
Dynamic: license-file

# maldi-tof-classifier

Version: 0.3.2

The **maldi-tof-classifier** package provides functionality for:
- Reading MALDI-TOF spectra
- Preprocessing spectral data
- Machine learning based classification

It is designed for spectra generated by a Shimadzu 8030 MALDI-TOF mass spectrometer.

Source code:
https://github.com/ofmk94/maldi-tof-classifier

License: MIT

---

## Installation

Python 3.10 or later is required.

Install the package from PyPI:

    pip install maldi-tof-classifier

Additionally, the R package `MALDIquant` must be installed on the system. It can usually be installed from within R with:

    install.packages("MALDIquant")

---

## Overview

This README consists of two parts:

1. CLI tool usage
2. Python API and typical workflows

The tool is a Python package, but foremost a CLI tool.

It is strongly recommended to work with peak data and the default `PeakExtractor`.

Example data for download is available under:
https://github.com/ofmk94/maldi-tof-classifier-data

---

# Part 1 - CLI Tool Usage

## 1.1 Required directory structure

The CLI tool requires the following directory structure:

    data_train/
        A/
            sample1.csv
            sample2.csv
        B/
            sample3.csv

    data_predict/
        unknown1.csv
        unknown2.csv

    cli_files/
        config.yaml

- `data_train`
  Contains subdirectories for the classes to be learned, for example `A`, `B`, `C`.
  The subdirectories may contain either `.txt` files with spectra or `.csv` files with peak data as produced by a Shimadzu 8030 MALDI-TOF mass spectrometer.
  This directory is for training only.

- `data_predict`
  Contains files of the same type, either `.txt` full spectra or `.csv` peak data, to be classified.
  This directory is for prediction.

- `cli_files`
  Contains the files necessary for the CLI setup and the files with results.
  It must contain `config.yaml`.

Training and prediction must use the same file type and the same extractor.

---

## 1.2 Output files

The following files are created inside `cli_files` during usage:

    cli_files/pipeline.joblib
        created once the model with the classification pipeline is trained.

    cli_files/training_performance.csv
        test set performance of the pipeline on classification
        includes accuracy, precision, recall, f1-score, confusion matrix.

    cli_files/predictions.csv
        predictions on data_predict data.

---

## 1.3 CLI commands

There are two commands available:

Train the model:

    mtc train

Predict on new data:

    mtc predict

Both commands need to be executed in an environment with the directories described above.

---

## 1.4 Configuration via config.yaml

The setup for the training can be thoroughly defined through `cli_files/config.yaml`.

All parameters are optional. There are default values for everything, so providing settings is optional.

---

## 1.5 Extractor settings

### 1.5.1 extractor_cls

Type of extractor.

Options:
- `"PeakExtractor"` for working with `.csv` files containing peak data
- `"FullSpectraExtractor"` for full spectra `.txt` files

Default:

    extractor_cls: "PeakExtractor"

This must be coherent between `mtc train` and `mtc predict`.

### 1.5.2 extractor_params

Additional optional parameters for `PeakExtractor` or `FullSpectraExtractor`.

Default:

    extractor_params: null

These parameters are passed directly to the selected extractor constructor.

For `PeakExtractor`, the main parameters are:
- `snr_thresh` default `3.0`
- `rel_shift_tolerance` default `0.002`
- `min_peak_freq` default `0.25`

Example:

    extractor_params:
        snr_thresh: 3.0
        rel_shift_tolerance: 0.002

For `FullSpectraExtractor`, the main parameters are:
- `use_mz_cutoff` default `false`
- `mz_cutoff_mass` default `20000.0`

Example:

    extractor_params:
        use_mz_cutoff: true
        mz_cutoff_mass: 20000.0

The dataclasses for file location and file parsing are advanced options and should generally not be set by the CLI user.

---

## 1.6 Scaling and dimensionality reduction

### 1.6.1 scaler_cls

Scaling object.

Available options from `sklearn.preprocessing`:
- `"StandardScaler"`
- `"MinMaxScaler"`

Optional.

Default:

    scaler_cls: null

### 1.6.2 dim_reducer_cls

Optional dimensionality reduction.

Options:
- `"PCA"` from `sklearn.decomposition`
- `"SVD"` using `sklearn.decomposition.TruncatedSVD`

Default:

    dim_reducer_cls: null

### 1.6.3 n_components

Number of components to use in optional dimensionality reduction.

Type:
- `int`

Default:

    n_components: 20

---

## 1.7 Classifier settings

### 1.7.1 classifier_cls

Classification model.

Available options:
- sklearn models:
  `LogisticRegression`, `LinearDiscriminantAnalysis`, `QuadraticDiscriminantAnalysis`, `PLSRegression` as `PLS-DA`, `SVC`, `RandomForestClassifier`
- xgboost model:
  `XGBClassifier`
- special option:
  `OPLS-DA`, implemented via `LogisticRegression`

Default:

    classifier_cls: "RandomForestClassifier"

### 1.7.2 classifier_params

Optional parameters for the classifier.

Default:

    classifier_params: null

These parameters are passed directly to the selected classifier constructor. Only parameters valid for the selected classifier should be used here.

Examples:

    classifier_params:
        n_estimators: 200
        max_depth: 10

or:

    classifier_params:
        C: 1.0
        max_iter: 1000

Refer to the documentation of `scikit-learn` or `xgboost` for the full list of supported constructor arguments.

---

## 1.8 Train/test split and balancing

### 1.8.1 test_size

Size of test set.

Type:
- `float`

Default:

    test_size: 0.2

### 1.8.2 oversample

Whether simple oversampling should be performed for balancing training classes.

Type:
- `bool`

Default:

    oversample: true

---

## 1.9 Notes

- It is strongly recommended to work with peak data and the default `PeakExtractor`.
- Training and prediction must use the same extractor and the same data format.
- The advanced file parsing options are usually not needed for standard CLI usage.

---

# Part 2 - Python API and typical workflows

## 2.1 Overview

The **maldi-tof-classifier** package provides functionality for:
- Reading MALDI-TOF spectra
- Preprocessing spectral data
- Machine learning based classification

Source code:
https://github.com/ofmk94/maldi-tof-classifier

Example data:
https://github.com/ofmk94/maldi-tof-classifier-data

Docstrings contain more detailed information on parameters and behavior.
This section illustrates typical usage.

The directory structure is the same as in Part 1.

---

## 2.2 Step 1 - Loading and preprocessing data

Recommended: peak data using `PeakExtractor`

    from maldi_tof_classifier.extractors import PeakExtractor
    from pathlib import Path
    from sklearn.model_selection import train_test_split

    TRAIN_DIR = Path(".") / "data_train"

    extractor = PeakExtractor(snr_thresh=3.0)

    peaks_dfs, class_labels = extractor.extract_train_data(TRAIN_DIR)

    X_train, X_test, y_train, y_test = train_test_split(
        peaks_dfs, class_labels, test_size=config["test_size"]
    )

    X_train = extractor.transform_train_data(X_train)
    X_test = extractor.transform_predict_data(X_test)

Alternative: full spectra using `FullSpectraExtractor`

    from maldi_tof_classifier.extractors import FullSpectraExtractor

    extractor = FullSpectraExtractor(use_mz_cutoff=True, mz_cutoff_mass=20000.0)

    spectra, class_labels, spots = extractor.extract_train_data(TRAIN_DIR)

    X_train, X_test, y_train, y_test = train_test_split(
        spectra, class_labels, test_size=config["test_size"]
    )

---

## 2.3 Step 2 - Label encoding

    from sklearn.preprocessing import LabelEncoder

    le = LabelEncoder()

    y_train = le.fit_transform(y_train)
    y_test = le.transform(y_test)

---

## 2.4 Step 3 - Handle class imbalance (optional)

    from imblearn.over_sampling import RandomOverSampler

    ros = RandomOverSampler()
    X_train, y_train = ros.fit_resample(X_train, y_train)

---

## 2.5 Step 4 - Scaling (optional)

    from sklearn.preprocessing import StandardScaler

    scaler = StandardScaler()

    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

---

## 2.6 Step 5 - Dimensionality reduction (optional)

    from sklearn.decomposition import PCA

    pca = PCA(n_components=20)

    X_train = pca.fit_transform(X_train)
    X_test = pca.transform(X_test)

---

## 2.7 Step 6 - Classification

### RandomForestClassifier

    from sklearn.ensemble import RandomForestClassifier

    classifier = RandomForestClassifier()

    classifier.fit(X_train, y_train)

    y_pred = classifier.predict(X_test)

---

### XGBClassifier

    from xgboost import XGBClassifier

    classifier = XGBClassifier()

    classifier.fit(X_train, y_train)

    y_pred = classifier.predict(X_test)

---

## 2.8 Neural network models

Available in `maldi_tof_classifier.nn`:

- `CNN1DClassifier`
- `LSTMClassifier`

For neural networks, a train/validation/test split and one-hot encoding is typically used.

### Split

    from sklearn.model_selection import train_test_split

    X_train, X_val_test, y_train, y_val_test = train_test_split(
        spectra, class_labels, test_size=0.3
    )

    X_val, X_test, y_val, y_test = train_test_split(
        X_val_test, y_val_test, test_size=0.333
    )

### One-hot encoding

    from tensorflow.keras.utils import to_categorical

    n_classes = y_train.max() + 1

    y_train = to_categorical(y_train, n_classes)
    y_val = to_categorical(y_val, n_classes)
    y_test = to_categorical(y_test, n_classes)

### Example

    from maldi_tof_classifier.nn import CNN1DClassifier

    model = CNN1DClassifier(X_train, y_train)

    model.fit(
        X_train,
        y_train,
        epochs=20,
        validation_data=(X_val, y_val)
    )

    y_pred = model.predict(X_test)

---

## 2.9 Pipeline API

Steps 2.5–2.7 (Step 4–6) can be combined into a pipeline:

    from maldi_tof_classifier.pipelines import generate_pipeline

### Components

- Scaler (optional)
- Dimensionality Reduction (optional)
- Classifier (required)

### Parameters

- `classifier_cls`
  Instantiable class of the classifier.

- `classifier_params`
  Parameters passed to the classifier.

- `scaler_cls`
  Optional scaler class.

- `dim_reducer_cls`
  Optional dimensionality reduction class.

- `n_components`
  Number of components for dimensionality reduction.

### Example

    from sklearn.preprocessing import StandardScaler
    from sklearn.decomposition import PCA
    from sklearn.ensemble import RandomForestClassifier

    from maldi_tof_classifier.core import generate_pipeline

    pipeline = generate_pipeline(
        classifier_cls=RandomForestClassifier,
        classifier_params={"n_estimators": 100},
        scaler_cls=StandardScaler,
        dim_reducer_cls=PCA,
        n_components=20
    )

    pipeline.fit(X_train, y_train)

    y_pred = pipeline.predict(X_test)

---

## Author

Oliver Klein \
oliver.klein@stud.hcw.ac.at \
oliverfmklein@gmail.com

---

## License

This project is licensed under the MIT License.

Copyright (c) 2026 Oliver Felix Matthias Klein (GitHub username: ofmk94)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


## Disclaimer

This README was written based on the original draft and revised into English Markdown format with assistance from ChatGPT (Version 5.3).

No liability is assumed for the provided software or for the contents of this README.

---

_Last edited: April 16th, 2026_
