Metadata-Version: 2.4
Name: mztabwriter
Version: 0.1.0
Summary: Minimal library for writing mzTab 1.0 proteomics files
Author: gluck
Author-email: glucksistemi@gmail.com
Requires-Python: >=3.10
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Provides-Extra: pandas
Requires-Dist: pandas (>=1.3) ; extra == "pandas"
Description-Content-Type: text/markdown

# mztabwriter

A minimal, dependency-free Python library for writing **mzTab 1.0** proteomics files.

[mzTab specification (1.0 Proteomics Release)](https://github.com/HUPO-PSI/mzTab/tree/master/specification_document-releases/1_0-Proteomics-Release) · [Format examples](https://github.com/HUPO-PSI/mzTab/tree/master/examples/1_0-Proteomics-Release) · [Russian README](README_RU.md)

---

## Features

- Generates mzTab 1.0 files (proteomics mode)
- No mandatory runtime dependencies — pure Python 3.10+
- Supports both `Complete` and `Summary` modes
- Supports both `Quantification` and `Identification` types
- Handles label-free, iTRAQ, and SILAC experiments
- Full metadata coverage: instruments, contacts, publications, samples, URIs
- Optional **pandas** integration for bulk loading from DataFrames
- `to_string()` and `to_file()` output methods

---

## Installation

```bash
pip install mztabwriter
```

With optional pandas support:

```bash
pip install mztabwriter[pandas]
```

---

## mzTab 1.0 File Structure (Proteomics)

An mzTab file consists of tab-separated sections, each identified by a row-type prefix:

| Prefix | Section | Description |
|--------|---------|-------------|
| `MTD` | Metadata | Experiment description, instruments, software, ms_runs, assays, modifications |
| `PRH` | Protein Header | Column names for the protein table |
| `PRT` | Protein | One row per identified protein |
| `PSH` | PSM Header | Column names for the PSM table |
| `PSM` | PSM | One row per peptide-spectrum match |
| `COM` | Comment | Ignored by parsers, human-readable notes |

### MTD — Metadata (required)

Key metadata fields:

| Key | Description | Example |
|-----|-------------|---------|
| `mzTab-version` | Format version | `1.0.0` |
| `mzTab-mode` | `Complete` or `Summary` | `Complete` |
| `mzTab-type` | `Quantification` or `Identification` | `Quantification` |
| `description` | Free-text experiment description | |
| `ms_run[N]-location` | URI of raw data file | `file:///data/run1.mzML` |
| `assay[N]-quantification_reagent` | CV param of label/reagent | `[MS, MS:1002038, unlabeled sample, ]` |
| `assay[N]-ms_run_ref` | Which ms_run this assay uses | `ms_run[1]` |
| `study_variable[N]-assay_refs` | Assays grouped by condition | `assay[1],assay[2]` |
| `study_variable[N]-description` | Condition description | `heat shock control` |
| `fixed_mod[N]` | Fixed search modification (UNIMOD CV) | `[UNIMOD, UNIMOD:4, Carbamidomethyl, ]` |
| `variable_mod[N]` | Variable search modification | `[UNIMOD, UNIMOD:35, Oxidation, ]` |
| `protein_search_engine_score[N]` | Score type for proteins | `[MS, MS:1001171, Mascot:score, ]` |
| `psm_search_engine_score[N]` | Score type for PSMs | `[MS, MS:1001171, Mascot:score, ]` |
| `quantification_method` | Quantification strategy | `[MS, MS:1001835, SILAC, ]` |

Optional:

| Key | Description |
|-----|-------------|
| `title` | Experiment title |
| `mzTab-ID` | Repository identifier |
| `instrument[N]-name/source/analyzer/detector` | MS instrument details |
| `software[N]` | Analysis software |
| `publication[N]` | `pubmed:XXXXXXX` or `doi:...` |
| `contact[N]-name/affiliation/email` | Contact person |
| `uri[N]` | Link to data repository |
| `sample[N]-species/cell_type/disease/tissue` | Sample description |

### PRT — Protein rows

Each protein row contains:

| Column | Type | Description |
|--------|------|-------------|
| `accession` | str | Database identifier (e.g. `P63017`) |
| `description` | str\|null | Protein description |
| `taxid` | int\|null | NCBI Taxonomy ID |
| `species` | str\|null | Species name |
| `database` | str\|null | Database name (e.g. `UniProtKB`) |
| `database_version` | str\|null | Database version |
| `search_engine` | CvParam\|null | Search engine |
| `best_search_engine_score[1]` | float\|null | Best score across all runs |
| `search_engine_score[1]_ms_run[N]` | float\|null | Score per run |
| `num_psms_ms_run[N]` | int\|null | Number of PSMs per run |
| `num_peptides_distinct_ms_run[N]` | int\|null | Distinct peptides per run |
| `num_peptides_unique_ms_run[N]` | int\|null | Unique peptides per run |
| `ambiguity_members` | str\|null | Comma-separated accessions of ambiguity group |
| `modifications` | str\|null | Detected modifications (e.g. `12-UNIMOD:35`) |
| `protein_coverage` | float\|null | Sequence coverage fraction (0.0–1.0) |
| `protein_abundance_assay[N]` | float\|null | Abundance per assay |
| `protein_abundance_study_variable[N]` | float\|null | Mean abundance per condition |
| `protein_abundance_stdev_study_variable[N]` | float\|null | Std deviation per condition |
| `protein_abundance_std_error_study_variable[N]` | float\|null | Std error per condition |

### PSM — Peptide-Spectrum Match rows

| Column | Type | Description |
|--------|------|-------------|
| `sequence` | str | Peptide amino acid sequence |
| `PSM_ID` | int | Unique PSM identifier within the file |
| `accession` | str | Protein accession |
| `unique` | 0\|1\|null | 1 if peptide is unique to this protein |
| `database` | str\|null | Database name |
| `database_version` | str\|null | Database version |
| `search_engine` | CvParam\|null | Search engine |
| `search_engine_score[1]` | float\|null | Score |
| `modifications` | str\|null | Modifications (e.g. `0-UNIMOD:214, 9-UNIMOD:4`) |
| `spectra_ref` | str\|null | Spectrum reference, e.g. `ms_run[1]:scan=1296` |
| `retention_time` | float\|null | Retention time in seconds |
| `charge` | int\|null | Precursor charge state |
| `exp_mass_to_charge` | float\|null | Experimental m/z |
| `calc_mass_to_charge` | float\|null | Theoretical m/z |
| `pre` | str\|null | Amino acid before the peptide N-terminus (`-` = protein N-term) |
| `post` | str\|null | Amino acid after the peptide C-terminus |
| `start` | int\|null | 1-based start position in protein |
| `end` | int\|null | 1-based end position in protein |

---

## API Reference

### `CvParam(cv_label, accession, name, value="")`

A Controlled Vocabulary parameter — the basic annotation unit in mzTab.

```python
from mztabwriter import CvParam

CvParam("MS", "MS:1001207", "Mascot")
# → [MS, MS:1001207, Mascot, ]

CvParam("UNIMOD", "UNIMOD:4", "Carbamidomethyl")
# → [UNIMOD, UNIMOD:4, Carbamidomethyl, ]

CvParam("PRIDE", "PRIDE:0000131", "Instrument model", "Micromass Q-TOF I")
# → [PRIDE, PRIDE:0000131, Instrument model, Micromass Q-TOF I]
```

### `Modification(position, cv_accession)`

A peptide/protein modification at a specific position.

```python
from mztabwriter import Modification

Modification(0, "UNIMOD:214")    # → 0-UNIMOD:214
Modification(12, "UNIMOD:35")   # → 12-UNIMOD:35
Modification(None, "UNIMOD:4")  # → -UNIMOD:4
```

### `MzTabDocument(mode, type_, version, title, description, mztab_id)`

The main document class.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `mode` | `"Complete"` \| `"Summary"` | `"Complete"` | File mode |
| `type_` | `"Quantification"` \| `"Identification"` | `"Quantification"` | Data type |
| `version` | str | `"1.0.0"` | mzTab format version |
| `title` | str \| None | `None` | Experiment title |
| `description` | str \| None | `None` | Experiment description |
| `mztab_id` | str \| None | `None` | Repository ID |

#### Metadata methods

| Method | Returns | Description |
|--------|---------|-------------|
| `add_ms_run(location, format=None, id_format=None)` | `MsRun` | Add a raw data file reference |
| `add_sample(description, species, cell_type, disease, tissue, custom)` | `Sample` | Add sample description |
| `add_assay(ms_run, quantification_reagent, sample=None, quantification_mods=None)` | `Assay` | Add assay (run + label) |
| `add_study_variable(description, assays)` | `StudyVariable` | Group assays into a condition |
| `set_quantification_method(cv)` | `None` | Set experiment-level quantification method |
| `set_protein_quantification_unit(cv)` | `None` | Set abundance unit |
| `add_software(cv)` | `None` | Add analysis software |
| `add_publication(ref)` | `None` | Add `pubmed:XXXXXXX` or `doi:...` |
| `add_contact(name, affiliation=None, email=None)` | `None` | Add contact person |
| `add_uri(uri)` | `None` | Add data repository URI |
| `add_instrument(name, source, analyzer, detector)` | `None` | Add MS instrument description |
| `add_fixed_mod(cv, site=None, position=None)` | `SearchModification` | Add fixed search modification |
| `add_variable_mod(cv, site=None, position=None)` | `SearchModification` | Add variable search modification |
| `add_protein_search_engine_score(cv)` | `SearchEngineScore` | Register protein score type |
| `add_psm_search_engine_score(cv)` | `SearchEngineScore` | Register PSM score type |

#### Data methods

| Method | Returns | Description |
|--------|---------|-------------|
| `add_protein(accession, ...)` | `ProteinRow` | Add a protein row |
| `add_psm(sequence, psm_id, accession, ...)` | `PsmRow` | Add a PSM row |
| `add_proteins_from_dataframe(df)` | `None` | Bulk-load proteins from pandas DataFrame |
| `add_psms_from_dataframe(df)` | `None` | Bulk-load PSMs from pandas DataFrame |

#### Output methods

| Method | Returns | Description |
|--------|---------|-------------|
| `to_string()` | `str` | Return the complete mzTab document as a string |
| `to_file(path)` | `None` | Write the document to a file (UTF-8) |

---

## Examples

### Label-free quantification (2 conditions × 3 replicates)

```python
from mztabwriter import MzTabDocument, CvParam, Modification

doc = MzTabDocument(
    mode="Complete",
    type_="Quantification",
    title="LFQ heat shock experiment",
    description="Label-free quantification of heat shock proteins, 2 conditions",
)

# Raw data files
r1 = doc.add_ms_run("file:///data/ctrl_rep1.mzML")
r2 = doc.add_ms_run("file:///data/ctrl_rep2.mzML")
r3 = doc.add_ms_run("file:///data/ctrl_rep3.mzML")
r4 = doc.add_ms_run("file:///data/treat_rep1.mzML")
r5 = doc.add_ms_run("file:///data/treat_rep2.mzML")
r6 = doc.add_ms_run("file:///data/treat_rep3.mzML")

reagent = CvParam("MS", "MS:1002038", "unlabeled sample")
a1 = doc.add_assay(r1, reagent)
a2 = doc.add_assay(r2, reagent)
a3 = doc.add_assay(r3, reagent)
a4 = doc.add_assay(r4, reagent)
a5 = doc.add_assay(r5, reagent)
a6 = doc.add_assay(r6, reagent)

doc.add_study_variable("control", [a1, a2, a3])
doc.add_study_variable("heat shock treatment", [a4, a5, a6])

# Scores and modifications
doc.add_protein_search_engine_score(CvParam("MS", "MS:1001171", "Mascot:score"))
doc.add_psm_search_engine_score(CvParam("MS", "MS:1001171", "Mascot:score"))
doc.add_fixed_mod(CvParam("UNIMOD", "UNIMOD:4", "Carbamidomethyl"), site="C", position="Anywhere")
doc.add_variable_mod(CvParam("UNIMOD", "UNIMOD:35", "Oxidation"), site="M", position="Anywhere")
doc.set_quantification_method(CvParam("MS", "MS:1002038", "unlabeled sample"))
doc.set_protein_quantification_unit(CvParam("PRIDE", "PRIDE:0000393", "Relative quantification unit"))

# Proteins
doc.add_protein(
    accession="P63017",
    description="Heat shock cognate 71 kDa protein",
    taxid=10090,
    species="Mus musculus",
    database="UniProtKB",
    database_version="2013_08",
    search_engine=CvParam("MS", "MS:1001207", "Mascot"),
    best_search_engine_score=46.0,
    search_engine_scores={"ms_run[1]": 46, "ms_run[2]": 26, "ms_run[3]": 36,
                          "ms_run[4]": -3, "ms_run[5]": -1, "ms_run[6]": None},
    num_psms={"ms_run[1]": 1, "ms_run[2]": 1, "ms_run[3]": 1,
              "ms_run[4]": 1, "ms_run[5]": 1, "ms_run[6]": 0},
    num_peptides_distinct={"ms_run[1]": 1, "ms_run[2]": 1, "ms_run[3]": 1,
                           "ms_run[4]": 1, "ms_run[5]": 1, "ms_run[6]": 0},
    num_peptides_unique={"ms_run[1]": 1, "ms_run[2]": 1, "ms_run[3]": 1,
                         "ms_run[4]": 1, "ms_run[5]": 1, "ms_run[6]": 0},
    protein_coverage=0.34,
    protein_abundance_assay={
        "assay[1]": 34.3, "assay[2]": 40.4, "assay[3]": 41.1,
        "assay[4]": 267.0, "assay[5]": 234.4, "assay[6]": 271.0,
    },
    protein_abundance_study_variable={"study_variable[1]": 38.6, "study_variable[2]": 257.5},
    protein_abundance_stdev_study_variable={"study_variable[1]": 3.8, "study_variable[2]": 20.1},
    protein_abundance_std_error_study_variable={"study_variable[1]": 2.2, "study_variable[2]": 11.6},
)

# PSMs
doc.add_psm(
    sequence="QTQTFTTYSDNQPGVL",
    psm_id=1,
    accession="P63017",
    unique=1,
    database="UniProtKB",
    database_version="2013_08",
    search_engine=CvParam("MS", "MS:1001207", "Mascot"),
    search_engine_score=46.0,
    modifications=[Modification(0, "UNIMOD:214")],
    spectra_ref="ms_run[1]:scan=1296",
    retention_time=1336.62,
    charge=3,
    exp_mass_to_charge=600.6218923,
    calc_mass_to_charge=600.6197,
    pre="K",
    post="I",
    start=424,
    end=439,
)

print(doc.to_string())
doc.to_file("lfq_experiment.mzTab")
```

### iTRAQ quantification

```python
from mztabwriter import MzTabDocument, CvParam

doc = MzTabDocument(mode="Complete", type_="Quantification")

run = doc.add_ms_run("file:///data/itraq_run1.mzML")

a1 = doc.add_assay(run, CvParam("PRIDE", "PRIDE:0000114", "iTRAQ reagent 114"))
a2 = doc.add_assay(run, CvParam("PRIDE", "PRIDE:0000115", "iTRAQ reagent 115"))
a3 = doc.add_assay(run, CvParam("PRIDE", "PRIDE:0000116", "iTRAQ reagent 116"))
a4 = doc.add_assay(run, CvParam("PRIDE", "PRIDE:0000117", "iTRAQ reagent 117"))

doc.add_study_variable("t=0", [a1])
doc.add_study_variable("t=1", [a2])
doc.add_study_variable("t=2", [a3])
doc.add_study_variable("t=3", [a4])

doc.set_quantification_method(CvParam("PRIDE", "PRIDE:0000313", "iTRAQ"))
doc.add_fixed_mod(CvParam("UNIMOD", "UNIMOD:214", "iTRAQ4plex"), site="K", position="Anywhere")
doc.add_fixed_mod(CvParam("UNIMOD", "UNIMOD:214", "iTRAQ4plex"), site="N-term", position="Any N-term")
```

### SILAC quantification

```python
from mztabwriter import MzTabDocument, CvParam

doc = MzTabDocument(mode="Complete", type_="Quantification")

run = doc.add_ms_run("file:///data/silac.mzML")
light = CvParam("PRIDE", "PRIDE:0000326", "SILAC light")
heavy = CvParam("PRIDE", "PRIDE:0000325", "SILAC heavy")

heavy_mods = [
    CvParam("UNIMOD", "UNIMOD:267", "Label:13C(6)15N(4)"),
    CvParam("UNIMOD", "UNIMOD:259", "Label:13C(6)15N(2)"),
]
a_light = doc.add_assay(run, light)
a_heavy = doc.add_assay(run, heavy, quantification_mods=heavy_mods)

doc.add_study_variable("control", [a_light])
doc.add_study_variable("treatment", [a_heavy])
doc.set_quantification_method(CvParam("MS", "MS:1001835", "SILAC"))
```

### Loading from pandas DataFrame

```python
import pandas as pd
from mztabwriter import MzTabDocument, CvParam

doc = MzTabDocument(mode="Complete", type_="Quantification")
# ... (add ms_runs, assays, study_variables, scores first) ...

df_proteins = pd.DataFrame([
    {
        "accession": "P63017",
        "description": "Heat shock cognate 71 kDa protein",
        "taxid": 10090,
        "species": "Mus musculus",
        "database": "UniProtKB",
        "database_version": "2013_08",
        "search_engine": CvParam("MS", "MS:1001207", "Mascot"),
        "best_search_engine_score": 46.0,
        "protein_coverage": 0.34,
        "protein_abundance_assay[1]": 34.3,
        "protein_abundance_assay[2]": 266.9,
        "protein_abundance_study_variable[1]": 34.3,
        "protein_abundance_study_variable[2]": 266.9,
        "protein_abundance_stdev_study_variable[1]": 3.8,
        "protein_abundance_stdev_study_variable[2]": 20.1,
        "protein_abundance_std_error_study_variable[1]": 2.2,
        "protein_abundance_std_error_study_variable[2]": 11.6,
    },
])

doc.add_proteins_from_dataframe(df_proteins)
doc.to_file("output.mzTab")
```

---

## File Structure Summary

```
MTD   mzTab-version   1.0.0
MTD   mzTab-mode      Complete
MTD   mzTab-type      Quantification
MTD   description     ...
MTD   ms_run[1]-location   file:///data/run1.mzML
MTD   assay[1]-quantification_reagent   [MS, MS:1002038, unlabeled sample, ]
MTD   assay[1]-ms_run_ref   ms_run[1]
MTD   study_variable[1]-assay_refs   assay[1],assay[2],assay[3]
MTD   study_variable[1]-description  control
...
PRH   accession   description   ...   protein_abundance_assay[1]   ...
PRT   P63017      Heat shock…   ...   34.3                         ...
...
PSH   sequence   PSM_ID   accession   ...   spectra_ref   ...
PSM   QTQTFTT…   1        P63017      ...   ms_run[1]:scan=1296   ...
```

---

## License

MIT

