Metadata-Version: 2.4
Name: sc-arcane
Version: 0.1.1a0
Summary: alignment free scRNAseq expression estimation
Author-email: Jens Zentgraf <zentgraf@cs.uni-saarland.de>, Johanna Elena Schmitz <jschmitz@cs.uni-saarland.de>, Sven Rahmann <rahmann@cs.uni-saarland.de>
License: MIT License
        
        Copyright (c) 2019-2025 Jens Zentgraf & Johanna Schmitz & Sven Rahmann
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://gitlab.com/rahmannlab/arcane
Project-URL: Bug Tracker, https://gitlab.com/rahmannlab/arcane/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: Natural Language :: English
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: numba
Requires-Dist: pytest
Requires-Dist: jsonargparse>=4.29.0
Requires-Dist: pip
Dynamic: license-file

# Arcane: Alignment-free single cell RNA-seq gene expression estimation

Arcane is a lightweight, alignment-free tool for scRNA-seq quantification.

In case of problems file an issue in the issue tracker.

See CHANGELOG.md for recent changes.
Thank you!

----


# Usage Guide

`arcane` is a multi-command tool with several subcommands (like git), in particular
- `arcane index` builds an index (a bucketed 3-way Cuckoo hash table)
- `arcane express` processes a sample and generates a count matrix

It is a good idea to run `arcane express --help` to see all available options.\
Using `--help` works on any subcommand.

### Installation guide

Our software can be obtained by cloning this public git repository:
```
https://gitlab.com/rahmannlab/arcane
```
To run our software, a [conda](https://docs.conda.io/en/latest/) environment with the required libraries needs to be created.\
A list of needed libraries is provided in the ``environment.yml`` file in the cloned repository;\
it can be used to create a new environment:

```
cd arcane  # the directory of the cloned repository
conda env create
```
which will create an environment named ``arcane`` with the required dependencies,
using the provided ``environment.yml`` file in the same directory.

After all dependencies are downloaded, you activate the environment and install the package from the repository into this environment.\
Make sure that you are in the root directory of the cloned repository (where this `README.md` file or the `CHANGELOG.md` file is) and run
```
conda activate arcane  # activate environment
pip install -e .  # install arcane package using pip
```

### Prebuild index
We provide two indices to process human or mouse data.
The index contains all gapped *k*-mers (k=31,w=43, `####_#_##_###_#_###_###_###_#_###_##_#_####`) of all sequences provided by the GTF file filtered by CellRanger.
The human and mouse indices can be downloaded [here](https://doi.org/10.5281/zenodo.17663235).

### How to classify
![](mapping.png)
To compute the count matrix of a sample (`R1` and `R2`), make sure you are in an environment where *arcane* and its dependencies are installed (see **Installation guide**).\
In addition, the index must either be downloaded (here `myindex.filter` and `myindex.info`) or an own custom index was created (see **How to build a custom index**).\
Then run the `arcane express` command with a previously built index, e.g.,
```
arcane express --index myindex --R1 $R1-files --R2 $R2-files --out outfolder -c v3
```
assuming that your sample was generated with 10x Genomics Chromium chemistry v3 (10x v2-v4 is currently supported).

The parameter `--out` is required and defines the prefix for all output files; this can be a combination of path and file prefix, such as `/path/to/sorted/samplename`.


Use
```
arcane express --help
```
to get a full list of optional parameters.


### How to build a custom index
![](index.png)
To build an index for `arcane`, several parameters must be provided, which are described in the following.

First, a file name and a path for the index must be chosen.
The index is stored in two files. We will use `myindex` to store the index in the current folder.

Second, a reference file and an associate annotation file is required.
The reference and annotation file should have been filtered by CellRanger.
Also we need the *k*-mer size or the mask with that the index should be build and a name.

```
arcane filter --fasta REFERENCE --gtf ANNOTATION --mask '####_#_##_###_#_###_###_###_#_###_##_#_####' --name NAME --prefix outfolder
```

This creates a new fasta file `prefix/arcane_NAME_ref.fa.gz`.

To build the index you have to run:
```
arcane index --index myindex --ref REFERENCE --mask '####_#_##_###_#_###_###_###_#_###_##_#_####' -n 2_000_000_000
```

We must specify the size of the hash table:

- `-n` or  `--nobjects`: number of *k*-mers that will be stored in the hash table. This depends on the used reference genomes and must be estimated beforehand! As a precise estimate of the number of different *k*-mers can be difficult, you can be on the safe side and provide a generously large estimate, examine the final (low) load factor and then rebuild the index with a smaller `-n` parameter to achieve the desired load. There are also some tools that quickly estimate the number of distinct *k*-mers in large files, such as [ntCard](https://github.com/bcgsc/ntCard) or [KmerEstimate](https://github.com/srbehera11/KmerEstimate). As a guide: The Human genome consists of roughly 2.5 billion 25-mers.
**This option must be specified; there is no default!**


We may further specify additional properties of the hash table:

- `-b` or `--bucketsize` indicates how many elements can be stored in one bucket (or page). This is 4 by default.

- `--fill` between 0.0 and 1.0 describes the desired fill rate or load factor of the hash table.
Together with `-n`, the number of slots in the table is calculated as `ceil(n/fill)`. In our experiments we used 0.88. (The number of buckets is then the smallest odd integer that is at least `ceil(ceil(n/fill)/p)`.)

- `--aligned` or `--unaligned`: indicates whether each bucket should consume a number of bits that is a power of 2. Using `--aligned` ensures that each bucket stays within the same cache line, but may waste space (padding bits), yielding faster speed but possibly (much!) larger space requirements. With `--unaligned`, no bits are used for padding and buckets may cross cache line boundaries. This is slightly slower, but may save a little or a lot of space (depending on the bucket size in bits). The default is `--unaligned`, because the speed decrease is small and the memory savings can be significant.

- `--hashfunctions` defines the parameters for the hash functions used to store the key-value pairs. If the parameter is unspecified, different random functions are chosen each time. The hash functions can be specified using a colon separated list: `--hashfunctions linear945:linear9123641:linear349341847`. It is recommended to have them chosen randomly unless you need strictly reproducible behavior, in which case the example given here is recommended.

Most of the parameters can also be provided in a config file (`.yaml`):
- `--cfg` or `--config` defines the path the the config file.


### Load index into shared memory

To load the index into shared memory to run several *arcane* in parallel without increasing the memory footprint with the index size, run
```
arcane load --name myindex
```
where `myindex` is the path to a prebuilt index.

To remove the index from shared memory, run
```
arcane remove --name myindex
```

### Reproduce results from the paper

To reproduce all results, check the `workflow` folder for more information.
