Metadata-Version: 2.4
Name: mifaser
Version: 1.64
Summary: a python package for super-fast and accurate annotation of molecular functionality
Home-page: https://bitbucket.org/bromberglab/mifaser
Author: Chengsheng Zhu, Maximilian Miller
Author-email: mmiller@bromberglab.com
License: NPOSL-3.0
Project-URL: Bug Tracker, https://bitbucket.org/bromberglab/mifaser/issues
Project-URL: Documentation, https://bitbucket.org/bromberglab/mifaser/wiki/docs
Project-URL: Source Code, https://bitbucket.org/bromberglab/mifaser
Keywords: microbiome,metagenome,function annotation
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE.md
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: project-url
Dynamic: requires-python
Dynamic: summary

# mi-faser #
### microbiome - functional annotation of sequencing reads ###

A super-fast ( < 20min/10GB of reads ) and accurate ( > 90% precision ) method for annotation of
molecular functionality encoded in sequencing read data without the need for assembly or gene finding.

**Web Service:** https://bromberglab.org/services/mifaser/

**Docker:** A pre-built docker image is available at https://hub.docker.com/r/bromberglab/mifaser

## Requirements ##

*mi-faser* runs on **LINUX**, **MacOSX** and **WINDOWS** systems.

**Dependencies**

* Python >= 3.6
* DIAMOND >= 0.8.8 (included; source: https://github.com/bbuchfink/diamond)
* WINDOWS: Visual C++ Redistributable *

**Note:** *mi-faser* was developed and optimized using ***DIAMOND* v0.8.8**, which is included in all releases. This is also the version used in the accompanying publication [1]. Most newer releases of *mi-faser* use the latest stable release of *DIAMOND*. *mi-faser* results for the first release (v1.2) with an updated version of *DIAMOND* (v0.9.13) were not affected by this (<0.1% difference; based on results for the artificial metagenome supplied as example dataset). According to the authors, more recent versions of *DIAMOND* offer substantial improvements regarding speed and memory usage as well as bugfixes. Thus, we strongly recommend to always use the latest version of *DIAMOND* (see Section: *DIAMOND upgrade*). This might alter *mi-faser* results slightly. However, results are expected to be enriched by new correct annotations rather than introducing mis-annotations.

Note that it is recommended to download and **compile *DIAMOND* locally** (https://github.com/bbuchfink/diamond) as this might have a
significant impact on performance (due to special CPU instructions).
However, this repository includes a pre-compiled version of *DIAMOND* to use.

Note that different split sizes could, at very rare occasions, result in minor deviations in *mi-faser* annotations. This is due to certain heuristics applied by *DIAMOND* when generating sequence alignments. We suggest to retain the split size for comparable analyses.

**Optional extensions**

* SRA Toolkit >= 2.9.1 ([NCBI](https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/))

  If installed enables *mi-faser* to automatically retrieve and process read files deposited in the NCBI Sequence Read Archives [SRA](https://www.ncbi.nlm.nih.gov/sra). Currently SRR, ERR and DRR identifiers are supported.

## Reference Databases ##

*mi-faser* was developed using a manually curated reference database of protein functions (*GS* database; [DOI 10.5281/zenodo.1048269](https://doi.org/10.5281/zenodo.1048268)).

*mi-faser* version 1.5 added a *GS+* database, which extends the default *GS* database. The *GS+* database includes an additional 55 manually curated protein sequences, introducing 28 new E.C.s that represent important microbial functions in the environment.

*mi-faser* version 1.61 introduced updated (2021) reference databases *GS-21-all* and *GS-21-bac*. *GS-21-all* is a set of 15,524 manually curated proteins reference proteins and 2,716 E.C. annotations from Eukaryota, Archaea, Bacteria and Viruses. *GS-21-bac* contains 7,288 reference proteins and 1,882 E.C. annotations from Bacteria.

*mi-faser* version 1.63 added three 2024 reference databases: *GS-24-all*, *GS-24-bac* and *GS-24-prok*. *GS-24-all* covers 12,491 manually curated reference proteins and 3,686 E.C. annotations from Eukaryota, Archaea, Bacteria and Viruses. *GS-24-bac* includes only reference proteins from Bacteria, and *GS-24-prok* includes only reference proteins from Bacteria and Archaea.

To create an new reference database, refer to the paragraph *Creating a reference database*.

## Installation ##

**Standalone *VS* Web Service**

The Standalone version of *mi-faser* partitions the user input into subsets analogue to the Web Service (http://services.bromberglab.org/mifaser/). However, those partitions are processed sequentially and not in parallel as in the Web Service.
Thus the Standalone version is only recommended for smaller jobs and is mainly thought to provide the *mi-faser* code base.

**Python package**
*mi-faser* is available as a python package. To install *mi-faser* using pip run:
```
pip install mifaser
```
*mi-faser* can the be used directly from the command line:
```
mifaser
```
The *mi-faser* module can be imported in a Python project by `import mifaser`.

**Docker**

The pre-built *mi-faser* docker image is probably the most convenient way to run *mi-faser* locally or in any cloud infrastructure. The docker image can be used in the same way as the standalone version, however mounting of a common working directory into the virtual environment is required.

To create and execute a single instance of *mi-faser* using a locally mounted working directory run:
```
docker run --rm \
    -v <LOCAL_INPUT_DIRECTORY>:/input \
    -v <LOCAL_OUTPUT_DIRECTORY>:/output \
    bromberglab/mifaser -f <INPUT_FILE>
```
<INPUT_FILE> is a valid *mi-faser* input file located in <LOCAL_INPUT_DIRECTORY> on your host environment. By default, *mi-faser* reads inputfiles relative to `/input` and writes any output to `/output`. Thus, by bind mounting your local <LOCAL_INPUT_DIRECTORY> to `/input` inside the docker container, input files can be passed simply as relative paths to your <LOCAL_INPUT_DIRECTORY>. Similarly, by mounting a <LOCAL_OUTPUT_DIRECTORY> to `/output` inside the docker container, all *mi-faser* outputs can be accessed at the <LOCAL_OUTPUT_DIRECTORY>.

**Python source (git repository)**

Open a terminal and checkout the *mi-faser* repository:
```
git clone https://git@bitbucket.org/bromberglab/mifaser.git
```
or download the zipped version:
```
curl --remote-name https://bitbucket.org/bromberglab/mifaser/get/master.zip
unzip master.zip
```

## Usage ##

**In case *mi-faser* was downloaded using the git repository:**

 * navigate to the *mi-faser* repository base directory
 * all examples in the following documentation have to be run using `python -m mifaser` instead of `mifaser`.
 
 **Run *mi-faser* (Single or 2-Lane mode)**

**Single:** input-file containing DNA reads, single http[s]/ftp[s] url or SRA accession ID (sra:<accession_id>):
```
mifaser -f/--inputfile <INPUT_FILE>
```

**2-Lane:** two files (R1/R2), http[s]/ftp[s] urls or SRA accession IDs (sra:<accession_id1> sra:<accession_id2>):
```
mifaser -l/--lanes <R1_FILE> <R2_FILE>
```

<div class="pagebreak"></div>

## CLI ##
*mi-faser* help:
```
usage: mifaser [-h] [-f INPUTFILE] [-l R1 R2] [-o OUTPUTFOLDER]
               [-d DATABASEFOLDER] [-i DIAMONDFOLDER] [-m] [-s SPLIT]
               [-S [SPLITMB]] [-t THREADS] [-c CPU] [-p] [-n] [-u UPDATE]
               [-D [arg [arg ...]]] [-v] [-q] [--version]

mi-faser, microbiome - functional annotation of sequencing reads
 
a super-fast ( < 10min/10GB of reads ) and accurate ( > 90% precision ) method
for annotation of molecular functionality encoded in sequencing read data
without the need for assembly or gene finding.
 
Public web service: https://services.bromberglab.org/mifaser
 
Version: 1.64 [07/21/25]

optional arguments:
  -h, --help            show this help message and exit
  -f INPUTFILE, --inputfile INPUTFILE
                        input DNA reads file, http[s]/ftp[s] url or SRA
                        accession id (sra:<id>)
  -l R1 R2, --lanes R1 R2
                        2-Lane format (R1/R2) files, http[s]/ftp[s] url or SRA
                        accession ids (sra:<id_1> sra:<id_2>)
  -o OUTPUTFOLDER, --outputfolder OUTPUTFOLDER
                        path to base output folder; default: INPUTFILE_out
  -d DATABASEFOLDER, --databasefolder DATABASEFOLDER
                        name of database located in database/ directory OR
                        absolute path to folder containing database files
  -i DIAMONDFOLDER, --diamondfolder DIAMONDFOLDER
                        path to folder containing diamond binary
  -m, --mapping         if flag is set all reads mappings will be generated
                        (reads{n=*} -> EC{n=1}, fasta)
  -s SPLIT, --split SPLIT
                        split by X sequences; default: 100k; 0 forces no split
  -S [SPLITMB], --splitmb [SPLITMB]
                        split by X MB; default: 25; (requires split from GNU
                        Coreutils)
  -t THREADS, --threads THREADS
                        number of threads; default: 1
  -c CPU, --cpu CPU     max cpus per thread; default: all available
  -p, --preserve        if flag is set intermediate results are kept
  -n, --no-check        if flag is set check for compatibility between diamond
                        database and binary is omitted
  -u UPDATE, --update UPDATE
                        valid update commands: { diamond[:version] }
  -D [arg [arg ...]], --createdb [arg [arg ...]]
                        create new reference database: <db_name>
                        <db_sequences.fasta> [merge_db=<name of db to merge
                        with>] [update_ec_annotations=<1|0>; default: 0]
  -v, --verbose         set verbosity level; default: log level INFO
  -q, --quiet           if flag is set console output is logged to file
  --version             show program's version number and exit

If you use *mi-faser* in published research, please cite:
 
Zhu, C., Miller, M., ... Bromberg, Y. (2017).
Functional sequencing read annotation for high precision microbiome analysis.
Nucleic Acids Res. [doi:10.1093/nar/gkx1209]
(https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkx1209/4670955)
 
mi-faser is developed by Chengsheng Zhu and Maximilian Miller.
Feel free to contact us for support at services@bromberglab.org.
 
This project is licensed under [NPOSL-3.0](http://opensource.org/licenses/NPOSL-3.0)
 
Test: mifaser -f mifaser/files/test/artificial_mg.fasta -o mifaser/files/test/out
```

**Example**

A demo dataset containing 10k reads is provided to verify a local *mi-faser* installation. Navigate to the *mifaser* repository base directory and run *mi-faser* with the following arguments:
```
mifaser -f mifaser/files/test/artificial_mg.fasta -o mifaser/files/test/out
```
The resulting analysis will be located relative to the *mifaser* base directory at: *mifaser/files/test/out/*.

**DIAMOND upgrade**

As *DIAMOND* (https://github.com/bbuchfink/diamond) is actively developed, we provide an easy way to upgrade (or downgrade) to another version.
In case a specific version of *DIAMOND* is given as parameter, this version will be automatically downloaded and installed (default: latest release).
```
mifaser --update diamond[:<DIAMOND_VERSION>]
```

**Creating a reference database**

*mi-faser* uses a manually curated reference database of protein functions. To create an alternative reference database, first store the desired set of protein sequences in a multi-FASTA file using the following format for the sequence headers:
>\>id|annotation|e.c.-number|additional_details

sequences.fasta:
```
>id|annotation|e.c.-number|additional_details
MKPNTDFMLIADGAKVLTQGNLTEHCAIEVSDGIICGLKSTISAEWTADKPHYRLTSGTL
VAGFIDTQVNGGGGLMFNHVPTLETLRLMMQAHRQFGTTAMLPTVITDDIEVMQAAADAV
AEAIDCQVPGIIGIHFEG
>id|annotation|e.c.-number|additional_details
MYYGLDIGGTKIELAIFDTQLALQDKWRLSTPGQDYSAFMATLAEQIEKADQQCGERGTV
GIALPGVVKADGTVISSNVPCLNQRRVAHDLAQLLNRTVAIGNDCRCFALSEAVLGVGRG
YSRVLGMI
```

Then run *mi-faser* using the *-D/--createdb* argument to create a new reference database *my_database*:

```
mifaser -D my_database path/to/sequences.fasta
```

To use the new database run:

```
mifaser -d my_database -f mifaser/files/test/artificial_mg.fasta -o mifaser/files/test/out
```

See the *help* menu (--help) for more details.

<div class="pagebreak"></div>

## License ##

This project is licensed under [NPOSL-3.0](http://opensource.org/licenses/NPOSL-3.0).

## Citation ##

If you use *mi-faser* in published research, please cite:

Zhu, C., Miller, M., Marpaka, S., Vaysberg, P., Rühlemann, M. C., Wu, G. H. F.-A., . . . Bromberg, Y. *(2017)*. Functional sequencing read annotation for high precision microbiome analysis. Nucleic Acids Res. [doi:10.1093/nar/gkx1209](https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkx1209/4670955)


## About ##

*mi-faser* is developed by Chengsheng Zhu and Maximilian Miller. Feel free to contact us for support: [services@bromberglab.org](mailto:services@bromberglab.org).
