A Genomic Language Model for Chimera Artifact Detection in Nanopore Direct RNA Sequencing

Institution Name
Conferance name and year

*Indicates Equal Contribution

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin ullamcorper tellus sed ante aliquam tempus. Etiam porttitor urna feugiat nibh elementum, et tempor dolor mattis. Donec accumsan enim augue, a vulputate nisi sodales sit amet. Proin bibendum ex eget mauris cursus euismod nec et nibh. Maecenas ac gravida ante, nec cursus dui. Vivamus purus nibh, placerat ac purus eget, sagittis vestibulum metus. Sed vestibulum bibendum lectus gravida commodo. Pellentesque auctor leo vitae sagittis suscipit.

Abstract

Nanopore Direct-RNA sequencing has revolutionized transcriptomics but is challenged by artificial chimeric reads that compromise data integrity. We present DeepChopper, a novel large language model tailored for biological sequences, which accurately identifies and removes artificial sequences in NanoPore Direct-RNA sequencing data without relying on alignment information. DeepChopper's hybrid architecture, combining HyenaDNA for long-range dependency modeling with quality-aware processing, achieves both broad context understanding and single nucleotide resolution. Across multiple cell lines and sequencing platforms, DeepChopper reduced chimeric reads by 62-84% and improved supporting rates from 8-19% to 43-55% compared to existing methods. In particular, in gene fusion detection, DeepChopper reduced false positives by 89% while increasing the proportion of supported fusions from 2% to 17%. By improving data quality, DeepChopper significantly improves the reliability of downstream analyses, particularly in cancer genomics and transcriptomics. This work demonstrates the powerful potential of large language models in analyzing complex biological data, paving the way for advancements in genomics and biotechnology.

Video Presentation

Another Carousel

Poster

BibTeX

BibTex Code Here