Metadata-Version: 2.4
Name: vdj-insights
Version: 0.1.0
Summary: VDJ-Insights provides a robust framework for the accurate annotation of complex genomic immune regions.
Author: Jesse Mittertreiner, Sayed Jamiel Mohammadi, Giang Le, Jesse Bruijnesteijn, Suzan Ott
Author-email: jaimymohammadi@gmail.com
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Operating System :: POSIX :: Linux
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: biopython==1.85
Requires-Dist: openpyxl==3.1.5
Requires-Dist: PyYAML
Requires-Dist: tqdm==4.67.1
Requires-Dist: psutil==7.0.0
Requires-Dist: matplotlib==3.10.3
Requires-Dist: seaborn==0.13.2
Requires-Dist: bs4==0.0.2
Requires-Dist: venny4py==1.0.3
Requires-Dist: requests==2.32.4
Requires-Dist: Flask==3.1.1
Requires-Dist: flask_caching==2.3.1
Requires-Dist: bokeh==3.7.3
Requires-Dist: plotly==6.2.0
Requires-Dist: matplotlib-venn==1.1.2
Requires-Dist: dna-features-viewer==3.1.5
Requires-Dist: ghostscript==0.8.1
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# VDJ-Insights

## Introduction

VDJ-Insights is a robust software package for accurate annotation of the V, D, and J gene segments within immunoglobulin (IG) and T-cell receptor (TCR) genomic regions. In addition to segment annotation, it evaluates gene functionality, detects recombination signal sequences (RSS), and annotates complementary-determining regions 1 and 2 (CDR1 and CDR2). These features extend the utility of VDJ-Insights beyond gene annotation, providing a powerful framework for functional immunogenetics and enabling evolutionary and comparative analyses at individual, population, and species levels.

---

## Installation

VDJ-Insights is currently only supported on Linux systems. Before running the pipeline, please ensure that Python (version 3.7 or higher) and Conda are installed on your system. 
You can install VDJ-Insights using one of the following methods:

### Option 1: Clone the repository
1. Clone the VDJ-Insights repository:
   ```bash
   git clone https://github.com/BPRC-CGR/VDJ-insights
   ```

2. Navigate to the repository directory:
   ```bash
   cd vdj_insights
   ```

3. Run the pipeline using Python's -m option:
   ```bash
   python -m vdj_insights <annotation|html> [arguments]
   ```
**Note:** When cloning the repository, the pipeline must always be executed using the ```python -m``` option. This ensures that Python correctly recognizes the package structure and runs the pipeline without additional installation steps.

### Option 2: Install via pip
1. Use pip to install VDJ-Insights:
   ```bash
   pip install vdj_insights
   ```
2. Run the pipeline:
   ```bash
   vdj_insights <annotation|html> [arguments]
   ```

## Using VDJ-Insights
Use the following command to run the annotation script:

```bash
python vdj-insights annotation -a <assembly_directory> | -i <region_directory> -l <library_directory/library.fasta> -r <receptor_type> -s <species_name> -f <flanking_genes> -t <threads> -m <mappingtool, mapping_tool> -M <metadata_directory> -o <output_directory> --default
```

### **Required Arguments:**
| **Argument**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                    | **Description**                                                                                                                                                         | **Example**                                                                    |
|-----------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|
| `-r`,<br> `--receptor-type`                      | Type of receptor to analyze. Choices: `IG` (immunoglobulin) or `TR` (T-cell receptor).<br> **Required when using `--default`.**                                            | `-r TR`                                                                        |
| `-i`,<br> `--input` <br><br> **or** <br><br> `-a`, `--assembly` | Directory containing either extracted sequence regions (`--input`), referring to sequences of the region of interest already isolated from a genome assembly <br><br> **or** <br><Br> complete genome assembly files (`--assembly`).                                                         | `-i /path/to/region` <br> `-a /path/to/assembly`                                 |
| `-l`,<br> `--library`                            | Path to the FASTA library file containing reference V(D)J segment sequences.                                                                                                     | `-l /path/to/library.fasta`                                                    |
| `-f`,<br> `--flanking-genes`                     | Comma-separated list of flanking genes provided as key-value pairs in JSON format. If only one flanking gene is present, use `"-"` as a placeholder for the missing side.              | `-f '{"IGH": ["PACS2", "-"], "IGK": ["RPIA", "PAX8"], "IGL": ["GANZ", "TOP3B"]}'` |
| `-s`,<br> `--species`                           | Scientific species name (e.g., `Homo sapiens`).                                                                                                                                   | `-s "Homo sapiens"`                                                            |

---

### **Optional Arguments:**
| **Argument**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                 | **Description**                                                                                    | **Example**              |
|---------------------|----------------------------------------------------------------------------------------------------|-------------------------|
| `-M`,<br> `--metadata`    | Path to the metadata file (.xlsx).<br> [Download example template](https://github.com/BPRC-Bioinfo/VDJ-insights/blob/main/vdj_insights/metadata/metadata.xlsx)                                        | `-M metadata.xlsx`       |
| `-o`,<br> `--output`      | Output directory for the results (Default: `annotation_results`).         | `-o /path/to/output`     |
| `-m`,<br> `--mapping-tool`| Available mapping tools: `minimap2`, `bowtie`, `bowtie2`. (Default: all).                          | `-m minimap2`            |
| `-t`,<br> `--threads`     | Number of threads for parallel processing (Default: `8`).                                          | `-t 16`                  |
| `--default`         | Use default settings (cannot be used with `--flanking-genes`).                                     | `--default`              |
| `-S`,<br> `--scaffolding` | Path to reference genome (FASTA).<br> **Only supported for phased assembly files.** | `-S /path/to/reference.fasta`|

### Important notes

- If using the `-i/--input` flag, do not specify `-f/--flanking-genes`, as flanking genes are only required when defining regions of interest from a complete genome assembly using `-a/--assembly`.
- If using the `-i/--input` flag, input file(s) should be named in the format `<sample-name>_<region>.fasta` and must be located in the indicated directory.
- If using the `--default` flag, do not specify `-f/--flanking-genes` as they are mutually exclusive.
- If using the `--default` flag, the annotation tool automatically downloads the appropriate V(D)J gene segment library based on the specified receptor type (`-r`) and species (`-s`). There is no need to define flanking genes manually or provide a local library file.
- If using the `--scaffolding` flag, RagTag scaffolding requires a phased assembly as input. If the input assembly contains contigs of both haplotypes, it should be phased beforehand.

### Example
1. Download the T2T-CHM13v2.0 assembly file from the T2T Consortium (GCA_009914755.4) using the following command:

   ```bash
   wget https://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/genome/Homo_sapiens-GCA_009914755.4-unmasked.fa.gz
   ```

2. Extract the assembly file:
   ```bash
   gunzip Homo_sapiens-GCA_009914755.4-unmasked.fa.gz
   ```

3. Run VDJ-Insights using the T2T assembly:

   ```bash
   python -m vdj-insights annotation -a /path/to/GCA_009914755.4-unmasked.fa -r IG -s "Homo sapiens" --default
   ```
   or
   ```bash
   vdj-insights annotation -a /path/to/GCA_009914755.4-unmasked.fa -r IG -s "Homo sapiens" --default
   ```

When the `--default` flag is used, VDJ-Insights automatically downloads the appropriate V(D)J segment library for the specified receptor type (`-r`) and species (`-s`) from the IMGT, when available. It is not necessary to specify flanking genes or provide a local library file.

## Annotation results
The results generated by VDJ-Insights are stored in the **annotation** directory. This directory includes the following Excel files:
- `annotation_report_known.xlsx` contains information on known V, D, and J gene segments, including recombination signal sequences.
- `annotation_report_novel.xlsx` contains information on novel V, D, and J gene segments, including recombination signal sequences.
- `annotation_report_all.xlsx` combines information on both known and novel V, D, and J gene segments.
- `tmp/blast_results.xlsx` contains the BLAST search results used for validation of annotations.  
- `tmp/report.xlsx` provides a summary of the overall findings from the alignment analyses.
  
Each annotation report (known or novel) includes the following columns, providing detailed information about the identified segments:

| **Column**                     | **Explanation**                                                                                                                                                                                                                               | **Example**                  |
|---------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------|
| **Sample**                      | The name of the sample. | `Sample_001` |
| **Haplotype**                   | The haplotype ID (maternal and paternal). | `1` or `mat` |
| **Region**                      | The annotated region. | `IGHV` |
| **Segment**                     | The gene segment type. | `V` |
| **Start coord**                 | The start coordinate on the annotated contig. | `12345`|
| **End coord**                   | The end coordinate on the annotated contig. |`12789`|
| **Strand**                      | Segment orientation: `+` indicates 5' to 3' direction, and `-` indicates 3' to 5' direction. | `+`|
| **Library name**                   | The closest reference gene segment name associated with the identified segment. | `IGHV3-23*01`|
| **Target name**               | The name assigned to the novel gene segment, based on the closest reference gene, with "like" appended to indicate similarity. | `IGHV3-23-like` |
| **Short name**                  | The gene name, as defined by IMGT nomenclature standards. | `IGHV3*01` |
| **Similar references**          | Other reference gene segments sharing the same start and end coordinates; the best match is selected based on the mutation count and the reference gene name.                                                                                               | `IGHV3-33*02`                 |
| **Target sequence**           | The nucleotide sequence of the novel gene segment. | `ATGGTGCAAGC...` |
| **Library sequence**               | The nucleotide sequence of the closest reference gene segment. | `ATGGTGCAAAC...` |
| **Mismatches**                  | The total number of mismatches observed between the novel segment and the reference sequence.                                                                                                                                                              | `3`                           |
| **% Mismatches of total alignment** | The percentage of mismatches relative to the total alignment length between the identified segment and the reference.                                                                                                                   | `1.5%`                        |
| **% identity**                  | The percentage of identical bases between the identified segment and the reference over the full alignment.  | `98.5%` |
| **BTOP**                        | BLAST traceback string that describes the exact location of substitutions, insertions, and deletions in the alignment.| `10A5G3T` |
| **SNPs**                        | The number of single nucleotide polymorphisms (SNPs) relative to the reference. | `2` |
| **Insertions**                  | The number of insertions relative to the reference. | `1` |
| **Deletions**                   | The number of deletions relative to the reference. | `0` |
| **Mapping tool**                        | The name(s) of the mapping tool(s) used for gene segment annotation. | `Minimap2` |
| **Function**                    | The functional classification of the segment: "F/ORF" for functional/open reading frame, "P" for potentially functional/open reading frame, or "pseudogene" if an early stop codon is detected.                                               | `F/ORF`                       |
| **Status**                      | Indicates whether the gene segment is classified as **Known** or **Novel**. | `Novel`|
| **Message**                     | A generated message for the segment if stop codons are detected at critical positions. | `The STOP-CODON at the 3' end of the V-REGION can be deleted by rearrangement`  |
| **Population**                  | The population group associated with the sample, if metadata is provided. | `Dutch` |                                                  

## Web interface report 
The pipeline includes an interactive web interface for visualizing and exploring the annotation results. The web-based Flask report can be generated and opened using the following command:

```bash
python -m vdj_insights.html -i /path/to/output --show
```
or
```bash
vdj_insights html -i /path/to/output --show
```

## Citing VDJ-Insights
If VDJ-Insights contributes to your research, please cite:
<cite>

## Acknowledgements
VDJ-Insights was developed by the department of Comparative Genetics & Refinement of the Biomedical Primate Research Centre ([BPRC](https://www.bprc.nl/en)) in Rijswijk, the Netherlands.

- [@Jesse mittertreiner](https://github.com/AntiCakejesCult)
- [@Sayed Jamiel Mohammadi](https://github.com/sayedjm)
- [@Giang Le](https://github.com/GiangLeN)
- [@SusanOtt](https://github.com/SusanOtt)
- [@Jesse Bruijnesteijn](https://github.com/JesseBNL)

