Metadata-Version: 2.4
Name: msaf2
Version: 0.5.0
Summary: Building and managing MSA prior to stucture inference
Author-email: Guillaume Launay <guillaume.launay@univ-lyon1.fr>
Requires-Python: >=3.12
Requires-Dist: biopython>=1.83
Requires-Dist: fastparquet>=2024.11.0
Requires-Dist: pandas>=2.2.3
Requires-Dist: pyyaml>=6.0.2
Requires-Dist: rich>=13.9.4
Description-Content-Type: text/markdown

# Multiple Sequence Align/Alpha Fold

## Streamlining the MSA building stages
Gives you control to the database search and the bundling of msa files prior to structure inference.

## Installation

### External dependencies
MSAF uses the following tools:
-  [mmseqs2](https://github.com/soedinglab/MMseqs2) for database search
-  [mafft](https://mafft.cbrc.jp/alignment/software/) for multiple sequence alignment
You will need those two sotware installed

### Python package
Just, `pip install msaf2`

### Global setup 
MSAF often requires a configuration file (as `-c` flag).
This configuration file is in yaml format and has the following shape
```yaml
databases : 
  - /path/to/databases/mmseqs
executables:
  mafft: /usr/local/bin/mafft
  mmseqs: /opt/homebrew/bin/mmseqs
settings:
  cache : /path/to/msaf/cache
cocktails:
    test:
        ingredients:
        - target: swissprot
            label: pif.sto
        - target: uniprot
            label: paf.a3m
```

Where,
* `databases` is a list of folders, where MSAF recursively looks for mmseqs database
* `executables` are key, value of paths to executable external dependencies
* `cache` points to a folder used to store MSAF mess, it MUST exist
* `cocktails` is dictionary of recipes

A configuration template file can be generated by the following command
```
python -m msaf2 --generate
```
Which you can then edit according to your settings.

#### MSAF recipes
Recipes are declared in the configuration file. A recipe is caracterized by a name (eg:`test`) and `ingredients`. `ingredients` define database search and save schema as list of `target` and `label`. The `target` key defines the database to search and `label` defines the resulting msa file (and format).
Recipes may also feature an optional `PDQT` parameter, which if set to `TRUE` will wrap all a3m files in an [aligned.pdqt file](https://github.com/chaidiscovery/chai-lab)

In the above exemple, the `test` recipe will trigger a search in swissprot and uniprot for all supplied queries. 
- The result of swissprot search will be saved under stockholm format in a file named `pif.sto`
- The result of the uniprot search will be saved under a3m format in a file named `paf.a3m`



## Usage

### List available database
At startup, MSAF will recurively search inside all `databases` item found in configuration file for mmseqs database files (*<database_name>_h*, *<database_name>_.index*, *<database_name>.lookup*, *<database_name>.index*).

The registred <database_name> can be displayed with
```
msaf2 config.yaml --list
```

### run a search
```
msaf2 -c config.yaml --query <abs_path_query1.fasta> <abs_path_query2.fasta> --bp test
```
With `--bp` refering to one recipe defined in the config file and `--query` to absolute path(s) of query sequence file(s) (fasta format).
#### Multimer search
Results will be saved in the `--output` folder (msas, by default) with subfolders using sequential one letter chain identifier along the sequence of query files. If the same file is provided more than once as a query, only one folder will be created. Hence, results of an homodimer search will be stored under a single `A/` subfolder.

### wrap a preexisting folder of msa
if a preexisitng folder is passed with the `--pdqt` flag, the a3m msa files present in this folder will be archive in a `aligned.pdqt` file.
```
msaf2 --pdqt <results_a3m_folder>
```


# Miscellaneous
How to format fasta database file, from MMSEQS2 documentation:

```
Searching
Before searching, you need to convert your FASTA file containing query sequences and target sequences into a sequence DB. You can use the query database examples/QUERY.fasta and target
database examples/DB.fasta to test the search workflow:
mmseqs createdb examples/QUERY.fasta queryDB
mmseqs createdb examples/DB.fasta targetDB
These calls should generate five files each, e.g. queryDB, queryDB_h and its corresponding index file
queryDB.index, queryDB_h.index and queryDB.lookup from the FASTA QUERY.fasta input
sequences.
The queryDB and queryDB.index files contain the amino acid sequences, while the queryDB_h and
queryDB_h.index file contain the FASTA headers. The queryDB.lookup file contains a list of tab
separated fields that map from the internal identifier to the FASTA identifiers.
For the next step, an index file of the targetDB is computed for a fast read-in. It is recommended
to compute the index if the targetDB is reused for several searches. If only few searches against this
database will be done, this step should be skipped.
mmseqs createindex targetDB tmp
This call will create a targetDB.idx file. It is just possible to have one index per database.
Then generate a directory for temporary files. MMseqs2 can produce a high IO on the file system.
It is recommended to create this temporary folder on a local drive.
```