Metadata-Version: 2.3
Name: msaf2
Version: 0.3.0
Summary: Building and managing MSA prior to stucture inference
Author-email: Guillaume Launay <guillaume.launay@univ-lyon1.fr>
Requires-Python: >=3.12
Requires-Dist: biopython>=1.83
Requires-Dist: fastparquet>=2024.11.0
Requires-Dist: pandas>=2.2.3
Requires-Dist: pyyaml>=6.0.2
Requires-Dist: rich>=13.9.4
Description-Content-Type: text/markdown

# Multiple Sequence Align/Alpha Fold

## Streamlining the MSA building stages
Gives you control to the database search and the bundling of msa files prior to structure inference.

## Installation

### External dependencies
MSAF uses the following tools:
-  [mmseqs2](https://github.com/soedinglab/MMseqs2) for database search
-  [mafft](https://mafft.cbrc.jp/alignment/software/) for multiple sequence alignment
You will need those two sotware installed

### Python package
Just, `pip install msaf2`

### Global setup 
MSAF often requires a configuration file (as `-c` flag).
This configuration file is in yaml format and has the following shape
```yaml
databases : 
  - /path/to/databases/mmseqs
executables:
  mafft: /usr/local/bin/mafft
  mmseqs: /opt/homebrew/bin/mmseqs
settings:
  cache : /path/to/msaf/cache
cocktails:
    test:
        ingredients:
        - target: swissprot
            label: pif.sto
        - target: uniprot
            label: paf.a3m
```

Where,
* `databases` is a list of folders, where MSAF recursively looks for mmseqs database
* `executables` are key, value of paths to executable external dependencies
* `cache` points to a folder used to store MSAF mess, it MUST exist
* `cocktails` is dictionary of recipes

A configuration template file can be generated by the following command
```
python -m msaf2 --generate
```
Which you can then edit according to your settings.

#### MSAF recipes
Recipes are declared in the configuration file. A recipe is caracterized by a name (eg:`test`) and `ingredients`. `ingredients` define database search and save schema as list of `target` and `label`. The `target` key defines the database to search and `label` defines the resulting msa file (and format).
Recipes may also feature an optional `PDQT` parameter, which if set to `TRUE` will wrap all a3m files in an [aligned.pdqt file](https://github.com/chaidiscovery/chai-lab)

In the above exemple, the `test` recipe will trigger a search in swissprot and uniprot for all supplied queries. 
- The result of swissprot search will be saved under stockholm format in a file named `pif.sto`
- The result of the uniprot search will be saved under a3m format in a file named `paf.a3m`



## Usage

### List available database
At startup, MSAF will recurively search inside all `databases` item found in configuration file for mmseqs database files (*<database_name>_h*, *<database_name>_.index*, *<database_name>.lookup*, *<database_name>.index*).

The registred <database_name> can be displayed with
```
msaf2 config.yaml --list
```

### run a search
```
msaf2 -c config.yaml --query <abs_path_query1.fasta> <abs_path_query2.fasta> --bp test
```
With `--bp` refering to one recipe defined in the config file and `--query` to absolute path(s) of query sequence file(s) (fasta format).
#### Multimer search
Results will be saved in the `--output` folder (msas, by default) with subfolders using sequential one letter chain identifier along the sequence of query files. If the same file is provided more than once as a query, only one folder will be created. Hence, results of an homodimer search will be stored under a single `A/` subfolder.

### wrap a preexisting folder of msa
if a preexisitng folder is passed with the `--pdqt` flag, the a3m msa files present in this folder will be archive in a `aligned.pdqt` file.
```
msaf2 --pdqt <results_a3m_folder>
```


