Metadata-Version: 2.4
Name: ms-toolkit-nrel
Version: 0.1.0
Summary: Tools for mass spectrometry data analysis
License: Apache-2.0
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: joblib
Requires-Dist: gensim
Requires-Dist: scikit-learn
Dynamic: description
Dynamic: description-content-type
Dynamic: license
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# ms-toolkit

Tools for mass spectrometry (MS) library searching and model training.

This library provides a pipeline for vectorizing spectra, training Word2Vec
models, preselecting candidates using clustering/GMM, and searching using
weighted cosine or embedding similarity. Portions of the code are adapted from
the [Spec2Vec](https://github.com/iomega/spec2vec) project.

## Features

- Parse MS library text files with optional progress UI
- Create `SpectrumDocument` objects for Word2Vec training
- Train and load Word2Vec models (`w2v.py`)
- Vectorize spectra and perform similarity search (`preprocessing.py`, `similarity.py`)
- Preselect candidates using KMeans or Gaussian Mixture Models (`preselector.py`)
- High-level `MSToolkit` facade wrapping the workflow (`api.py`)

## Installation

Install with pip using the included `setup.py`:

```bash
pip install .
```

Dependencies include `numpy`, `joblib`, `gensim`, and `scikit-learn`. Optional UI
features require `customtkinter` or `PySide6`.

## Usage Example

```python
from ms_toolkit.api import MSToolkit

# Initialize toolkit
ms = MSToolkit(library_txt="NIST14.txt", cache_json="library.json")

# Load library (shows progress UI by default)
ms.load_library()

# Vectorize and train models
ms.vectorize_library()
ms.train_preselector()
ms.train_w2v()

# Search using a query spectrum
query = [(100, 0.5), (150, 1.0), (200, 0.8)]
results = ms.search_w2v(query)
for compound, score in results:
    print(compound, score)
```

## License

This project is licensed under the Apache License 2.0. See `LICENSE` for details.
The `NOTICE` file explains that some code derives from Spec2Vec, which is also
Apache 2.0 licensed.

