Metadata-Version: 2.4
Name: sear-pipeline
Version: 0.1.0
Summary: A Structural Error-Aware Refinement Framework for Long-read Genome Assemblies
Author-email: zhangyixin <camille186093469@gmail.com>
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.7
Description-Content-Type: text/markdown

# Genome Assembly Polishing & Scaffolding Pipeline

This repository provides an automated/step-by-step pipeline to detect assembly errors, scaffold contigs using a reference genome, and perform hybrid polishing utilizing both PacBio HiFi long reads and Illumina short reads.

---

## Dependencies

Ensure the following tools are installed and available in your `$PATH`:
* [Inspector](https://github.com/Maggi-Chen/Inspector.git)
* [RagTag](https://github.com/malonge/RagTag)
* [Meryl](https://github.com/marbl/meryl)
* [Winnowmap](https://github.com/marbl/Winnowmap)
* [Samtools](https://github.com/samtools/samtools)
* [Yak](https://github.com/lh3/yak)
* [NextPolish2](https://github.com/Nextomics/NextPolish2)

---

## Step-by-Step Usage

Please export your environmental variables or modify the paths in the commands below before running.

### Environment Setup (Example)
```bash
# Define your project directory and sample ID
export PROJECT_DIR="/path/to/your/workspace"
export SAMPLE_ID="YourSampleName"

# Define raw data inputs
export HIFI_READS="${PROJECT_DIR}/data/hifi.fastq.gz"
export NGS_R1="${PROJECT_DIR}/ngs/short_reads_1.fastq"
export NGS_R2="${PROJECT_DIR}/ngs/short_reads_2.fastq"
export REF_FASTA="/path/to/reference/T2T-CHM13v2.0.fasta"

# Define initial assembly input
export INIT_ASM="${PROJECT_DIR}/data/initial_assembly.fasta"
```

### Step 1: Detect Assembly Errors (Parallel Inspector)

Run the parallelized inspector script to identify structural errors in the initial assembly.

Bash

```
nohup python parallel_inspector.py \
    -i ${HIFI_READS} \
    -c ${INIT_ASM} \
    -o sear \
    --num_subsets 4 \
    --proportion 0.3 \
    --max_concurrent 4 \
    > inspector.log 2>&1 &
```

### Step 2: Filter and Cut Assembly Errors

Filter the detected error regions and slice the problematic contigs.

Bash

```
# Filter errors
./filter_errors.sh \
    -p inspector_out_sear_part \
    -m ${SAMPLE_ID} \
    -n 4 \
    -v 2

# Cut assembly at error positions
bash cut_error.sh \
    -m ${SAMPLE_ID} \
    -i final_best_list_${SAMPLE_ID}.bed \
    -r ${INIT_ASM}
    
# Expected Output: ${PROJECT_DIR}/data/${SAMPLE_ID}_FINAL_CORRECTED.fasta
```

### Step 3: Reference-guided Scaffolding (RagTag)

Scaffold the corrected assembly using a reference genome.

Bash

```
export CORRECTED_ASM="${PROJECT_DIR}/data/${SAMPLE_ID}_FINAL_CORRECTED.fasta"

nohup ragtag.py scaffold \
    -o ragtag_output \
    ${REF_FASTA} \
    ${CORRECTED_ASM} \
    -t 32 > ragtag.log 2>&1 &
```

### Step 4: Hybrid Polishing (NextPolish2)

This step performs the final hybrid accuracy polishing using both PacBio HiFi reads and Illumina short reads (NGS). It includes repetitive K-mer filtering (Meryl), long-read alignment (Winnowmap), short-read K-mer profiling (Yak), and the final consensus correction (NextPolish2).

```bash
export SCAFFOLD_ASM="ragtag_output/ragtag.scaffold.fasta"

# ---------------------------------------------------------------------
# Part 4.1: Long-Read Alignment with Repetitive K-mer Filtering (Winnowmap)
# ---------------------------------------------------------------------
# Count and filter high-frequency 15-mers to mask repetitive regions during alignment
meryl count k=15 output merylDB ${SCAFFOLD_ASM}
meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt

# Align HiFi reads using Winnowmap
nohup winnowmap -t 64 -W repetitive_k15.txt -ax map-pb \
    ${SCAFFOLD_ASM} \
    ${HIFI_READS} \
    > hifi.winnowmap.sam 2>winnowmap.log &

# Convert SAM to sorted BAM and index it
samtools view -@ 16 -b hifi.winnowmap.sam | samtools sort -@ 16 -o hifi.winnowmap.bam
samtools index hifi.winnowmap.bam


# ---------------------------------------------------------------------
# Part 4.2: Short-Read K-mer Counting (Yak)
# ---------------------------------------------------------------------
# Generate K-mer profiles (K=21 and K=31) from short reads for accuracy calibration
nohup yak count -o sr.k21.yak -k 21 -b 37 ${NGS_R1} ${NGS_R2} > yak_k21.log 2>&1 &
nohup yak count -o sr.k31.yak -k 31 -b 37 ${NGS_R1} ${NGS_R2} > yak_k31.log 2>&1 &


# ---------------------------------------------------------------------
# Part 4.3: Final Polish Execution
# ---------------------------------------------------------------------
# Ensure hifi.winnowmap.bam, sr.k21.yak, and sr.k31.yak are completely generated before running
nohup nextPolish2 -t 32 \
    -r hifi.winnowmap.bam \
    ${SCAFFOLD_ASM} \
    sr.k21.yak sr.k31.yak \
    -o ${SAMPLE_ID}_ragtag.np2.fa \
    > np2.log 2>&1 &
```

## 📂 Expected Output Directory Structure

```
.
├── inspector_out/              # Error detection outputs
├── ragtag_output/              # RagTag scaffolding results
│   └── ragtag.scaffold.fasta   # Scaffolded genome
├── merylDB/                    # Meryl K-mer database
├── repetitive_k15.txt          # Filtered repetitive K-mers
├── hifi.winnowmap.bam          # Sorted and indexed BAM alignment
├── sr.k21.yak / sr.k31.yak     # Yak short-read K-mer profiles
└── [SampleID]_ragtag.np2.fa    # Final Polished Assembly (Success 🏁)
```
