Metadata-Version: 2.4
Name: polygenic-pgx
Version: 2.5.1
Summary: Polygenic score toolkit
Home-page: https://github.com/marpiech/polygenic
Author: Marcin Piechota, Wojciech Galan
Author-email: piechota@intelliseq.com
License: Intelliseq dual licenses this package. For commercial use, please contact [contact @ intelliseq.com](mailto:contact@intelliseq.com). For non-commercial use, this license permits use of the software only by government agencies, schools, universities, non-profit organizations or individuals on projects that do not receive external funding other than government research grants and contracts. Any other use requires a commercial license. For the full license, please see [LICENSE.md](https://github.com/intelliseq/polygenic/blob/master/LICENSE.md), in this source repository.
Classifier: Development Status :: 5 - Production/Stable
Classifier: License :: Free for non-commercial use
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: Unix
Classifier: Operating System :: MacOS
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: progressbar2
Requires-Dist: python-utils
Requires-Dist: pysam
Requires-Dist: pytabix
Requires-Dist: pandas
Requires-Dist: plotly
Requires-Dist: kaleido
Requires-Dist: DotMap
Requires-Dist: pyyaml
Requires-Dist: tqdm
Requires-Dist: plotnine
Requires-Dist: kaleido
Requires-Dist: polars
Requires-Dist: importlib-resources
Requires-Dist: logdecorator
Requires-Dist: pyarrow
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# polygenic - the polygenic scores toolkit

## Basic info
[![PyPI pyversions](https://img.shields.io/pypi/pyversions/polygenic.svg)](https://pypi.python.org/pypi/polygenic/)  
[![PyPI](https://img.shields.io/pypi/v/polygenic.svg)](https://pypi.python.org/pypi/polygenic)  
[![Maintainer]](https://img.shields.io/badge/maintainer-marpiech-blue)  

## Downloads
- pip [![PyPI download month](https://img.shields.io/pypi/dm/polygenic.svg)](https://pypi.python.org/pypi/polygenic/) 
- docker with data [![Docker](https://img.shields.io/docker/pulls/marpiech/polygenictk.svg)](https://hub.docker.com/repository/docker/marpiech/polygenictk) 
- docker without data [![Docker](https://img.shields.io/docker/pulls/intelliseq/polygenic.svg)](https://hub.docker.com/repository/docker/intelliseq/polygenic)

## Index
* [Summary](#summary)
* [Diplotyping Algorithm](#diplotyping-algorithm)
* [Installation](#installation)
  * [With pip](#with-pip)
  * [With conda](#with-conda)
  * [With docker](#with-docker)
* [Quick start guide](#quick-start-guide)
* [Manual](#manual)
  * [Tools](#tools)
    * [pgs-compute](#pgs_compute)
    * [pgs-build](#pgs_build)
    * [pgs-validate](#pgs_validate)
    * [vcf-index](#vcf_index)
    * [vcf-validate](#vcf_validate)
    * [vcf-stat](#vcf_stat)
    * [model-biobankuk](#model_biobankuk)
    * [model-pgscat](#model_pgscat)
    * [model-gbe](#model_gbe)
    * [model-pharmvar](#model_pharmvar)
  * [Docker images](#docker_images)
  * [Building models](#building_models)
  * [Example models](#example_models)
  * [Usecases](#usecases)
    * [pgx](#pgx)
* [License](#license)
* [Updates](#updates)

## Summary
Polygenic is a toolkit for a wide range of polygenic scores analysis tasks. The most important use cases include computing scores for samples in vcf files, building scores for GWAS results or fetching scores from repositories.

## Diplotyping Algorithm

We begin by reading individual genetic variants (genotypes) from the patient's VCF file, where each genotype carries two alleles — one per chromosome. The system supports phasing using a custom reference panel to resolve which alleles sit on the same chromosome; when phased data is available, the algorithm preserves this linkage information, while for unphased data both chromosomes are treated symmetrically. We define haplotypes as specific combinations of co-inherited variants that form recognized gene versions, such as pharmacogenomic star alleles, where each haplotype definition distinguishes core defining variants (weighted at 1.0) from supportive sub-lineage variants (weighted at 0.05). The algorithm scores every candidate haplotype against the patient's alleles, then selects the best-matching haplotype for the first chromosome — keeping all candidates within a 2% margin of the top score. The matched alleles are then "claimed" by that haplotype, and the remaining unmatched alleles (leftovers) are passed into a second round, where up to 100 candidate haplotypes are re-scored against only those residual alleles to identify the second haplotype. Each first/second haplotype pair is ranked by combined match percentage and filtered by total missing data, producing the final diplotype call — e.g., CYP2D6 \*1/\*4. Every variant in the result carries a source label — direct genotyping, LD proxy, imputation flag from VCF, allele-frequency-based imputation, reference, or missing — enabling granular quality control and full traceability of the diplotype call.

### Missing variant handling (`--ref-fallback`)

By default, variants called `./.` in the input VCF are treated as **missing** (`source: "missing"`) — they are subtracted from both the numerator and the denominator of the haplotype match score, so they neither support nor penalize any candidate. This matches the approach taken by [PharmCAT's Named Allele Matcher](https://pharmcat.clinpgx.org/methods/NamedAlleleMatcher-101/), which explicitly drops missing positions rather than imputing them: *"If the sample data has missing positions that are required by a named allele definition, the position will be dropped from consideration."*

Passing `--ref-fallback` restores the legacy behavior of filling `./.` with homozygous reference (`source: "reference"`). This is almost always the **wrong** default for pharmacogenomic panels. The canonical example is **CYP2C19**: the variant rs3758581 at chr10:94842866 has GRCh38 reference `A`, but `G` is the major allele in every population and PharmVar v5 defines every "real" star allele (\*1, \*2, \*17 …) as requiring `G` at this position, reserving CYP2C19\*38 (= old \*1.001) for the minority carrying the reference `A` (see [PharmVar GeneFocus: CYP2C19](https://pmc.ncbi.nlm.nih.gov/articles/PMC7769975/)). If rs3758581 is not on the panel and the code fills it with the reference, every sample drifts toward \*38/\*38. Default off avoids this trap. Enable `--ref-fallback` only when your panel's `./.` genuinely means "0/0 and I chose not to spell it out" — for example when re-running test fixtures authored under the old default.

### Match-confidence gate (`--top-n`)

After scoring all candidate haplotype pairs, the algorithm returns a `haplotype_id` when **either** of these holds:
- the best pair's per-chromosome `max_percent_match` is `≥ 50%` (high-confidence call), **or**
- the total scored-candidate pool size is `≤ top_n` (default **15**) — a small, non-ambiguous candidate space is itself a confidence signal.

Otherwise `haplotype_id` is `None` (the caller refuses to guess). This is more conservative than [PharmCAT](https://pharmcat.clinpgx.org/methods/NamedAlleleMatcher-101/), which always returns the top-scoring diplotype with deterministic tie-breaking, and roughly comparable to [Aldy 4](https://pmc.ncbi.nlm.nih.gov/articles/PMC9977157/), which reports "no more than three diplotypes" when it cannot fully disambiguate. Tune `--top-n` higher (laxer) or lower (stricter) depending on how much ambiguity is acceptable downstream. Set `--top-n 0` to rely solely on the 50% threshold.

### CYP2C19*38 worked example

On a clinical panel that does **not** assay rs3758581, running `pgs-compute` with the CYP2C19 PharmVar model on four real samples (three carrying non-reference CYP2C19 variants, one all-reference) produces these calls:

| Sample | Non-ref CYP2C19 variants | Default (no `--ref-fallback`) | Legacy (`--ref-fallback`) |
|---|---|---|---|
| Sample A | none | `*1.001/*1.001` (≡ \*1/\*1) | `*1.001/*1.001` |
| Sample B | rs12769205, rs4244285 (het) | `*2/*1.001` ✓ | `*1.001/*1.001` ✗ |
| Sample C | rs12248560 (hom) | `*17/*17` ✓ | `*1.001/*1.001` ✗ |
| Sample D | rs12769205, rs4244285 (het) | `*2/*1.001` ✓ | `*1.001/*1.001` ✗ |

With the legacy `--ref-fallback` on, the missing rs3758581 becomes `A/A` (genomic reference), which does not match the `G` required by \*1, \*2, \*17 etc., so every sample collapses to the empty-haplotype \*1.001 (equivalent to \*38 in PharmVar v5 terms). With the new default, rs3758581 stays missing, the scoring ignores it, and the panel's actual informative variants drive the call.

## Installation
### With pip
#### Install for user account
```
python3 -m pip install --upgrade polygenic
```
#### Install globally
```
sudo -H python3 -m pip install polygenic
```
### With conda
Run conda image
```
docker run -it conda/miniconda3 /bin/bash
```
Create python3.8 environment and install polygenic
```
yes | conda create --name py38 python=3.8
eval "$(conda shell.bash hook)"
conda activate py38
### should be 3.8
python --version

### gcc is missing to build pytabix
apt -qq update
apt -y install build-essential tabix

pip install polygenic
```
### With docker
#### Large image with all data included
```
docker run intelliseq:polygenictk:2.1.0 *command*
```
#### Thin image with just polygenic package installed
```
docker run intelliseq:polygenic:2.1.0 *command*
```
## Quick start guide
```
mkdir polygenic && cd polygenic # create working directory
wget https://downloads.intelliseq.com/public/polygenic/gbe-INI78-bone-density.yml # download model
wget https://downloads.intelliseq.com/public/polygenic/illu_merged-imputed.vcf.gz # download genotypes
wget https://downloads.intelliseq.com/public/polygenic/illu_merged-imputed.vcf.gz.tbi # download position index
wget https://downloads.intelliseq.com/public/polygenic/illu_merged-imputed.vcf.gz.idx.db # download rsid index
docker run -v $(pwd):/data intelliseq/polygenic:latest --vcf /data/illu_merged-imputed.vcf.gz --model /data/gbe-INI78-bone-density.yml --output-directory /data # compute model
```
## Manual
### Tools
#### pgs-compute
```
usage: pgstk [-h] -i VCF [-m MODEL [MODEL ...]] [-p PARAMETERS] [-s SAMPLE_NAME] [-o OUTPUT_DIRECTORY] [-n OUTPUT_NAME_APPENDIX] [-l LOG_FILE] [--af AF] [--af-field AF_FIELD]
             [-v] [--print]

pgs-compute computes polygenic scores for genotyped sample in vcf format

optional arguments:
  -h, --help            show this help message and exit
  -i, --vcf VCF         vcf.gz file with genotypes
  -m, --model MODEL [MODEL ...]
                        path to .yml model (can be specified multiple times with space as separator)
  -p, --parameters PARAMETERS
                        parameters json (to be used in formula models)
  -s, --sample-name SAMPLE_NAME
                        sample name in vcf.gz to calculate
  -o, --output-directory OUTPUT_DIRECTORY
                        output directory
  -n, --output-name-appendix OUTPUT_NAME_APPENDIX
                        appendix for output file names
  -l, --log-file LOG_FILE
                        path to log file
  --af AF               vcf file containing allele freq data
  --af-field AF_FIELD   name of the INFO field to be used as allele frequency
  -v, --version         show program's version number and exit
  --print               Print output to stdout
```
### Arguments
#### Required
- `--vcf` vcf.gz file with genotypes (tabix index should be available)
- `--model` path to model file
#### Optional
- `--log_file` log file
- `--out_dir` directory for result jsons
- `--population` population code
- `--models_path` path to a directory containing models
- `--af` an indexed vcf.gz file containing allele freq data
- `--version` prints version of package

## Building models in yml

Index:
[Model structure](#model_structure)
[Model types](#model_types)
[Parameters](#parameters)


### Model structure
##### Core structure
Models have two properties which is `model` and `description`. `model` is a specification of computation to be performed and `description` is additional information to be included in the result.
```
model:
description:
```
##### Object keys
Each object that is not collection has a set of predefined keys (required or optional) that can be used for computation. For example: `diplotype_model` object has a required `diplotypes` key.
```
diplotype_model:
  diplotypes:
```
The computation is first delegated to key specified objects and later aggregated by the top level object itself.
##### Collections
There is special category of objects that don't have predefined keys but are collections. Each key within collection becomes element of collection. Collections are easy to recognize, because they are specified in plural form like `diplotypes` or `variants`. Each element of collection will be defined as singular object of collection type. For example key in `variants` collection will becomes objects of `variant` type.
```
      variants:
        rs7041: {diplotype: C/C}
        rs4588: {diplotype: T/T}
```
##### Variants
Variants can be identified by rsid. Variant value will be computed basing on information provided: `diplotype` or `effect_allele`.
Accepted sets of fields are:
- diplotypes
    - `diplotype`
    - `symbol`
- score
    - `effect_allele`
    - `effect_size`
    - `symbol`

### Model types
There are currently implemented four types of models:  
- `score_model`
- `diplotype_model`
- `haplotype_model`
- `formula_model`
The type of model can be specified at the top of yml structure or within the `model` field.  
##### Specification of model type at the top of yml structure
```
diplotype_model:
description:
```
##### Specification of model type within the `model` field
```
model:
  diplotype_model:
description:
```
### Parameters
External parameters can be used in `formula_model` through `@parameters` keyword.  
Example parameters file in `.json` format:
```
{"sex": "F"}
```
Path to file can be provided as argument to polygenic tool:
```
--parameters /path/to/parameters.json
```
Example of use of parameters in the `formula_model`:
```
formula_model:
  formula:
    value: "@female.score_model.value if @parameters.sex == 'F' else @male.score_model.value"
  male:
    score_model:
      variants:
        ...
  female:
    score_model:
      variants:
```
## Example models
### Example diplotype model
This example diplotype model is based on [Randolph 2014](https://pubmed.ncbi.nlm.nih.gov/24447085/).
```
diplotype_model:
  diplotypes:
    1/1:
      variants:
        rs7041: {diplotype: C/C}
        rs4588: {diplotype: T/T}
    1/1s:
      variants:
        rs7041: {diplotype: C/C}
        rs4588: {diplotype: T/G}
    1/1f:
      variants:
        rs7041: {diplotype: C/A}
        rs4588: {diplotype: T/G}
    1/2:
      variants:
        rs7041: {diplotype: C/A}
        rs4588: {diplotype: T/T}
    1s/1s:
      variants:
        rs7041: {diplotype: C/C}
        rs4588: {diplotype: G/G}
    1s/1f:
      variants: 
        rs7041: {diplotype: C/A}
        rs4588: {diplotype: G/G}
    1s/2:
      variants: 
        rs7041: {diplotype: C/A}
        rs4588: {diplotype: G/T}
    1f/1f: 
      variants: 
        rs7041: {diplotype: A/A}
        rs4588: {diplotype: G/G}
    1f/2: 
      variants: 
        rs7041: {diplotype: A/A}
        rs4588: {diplotype: G/T}
    2/2: 
      variants: 
        rs7041: {diplotype: A/A}
        rs4588: {diplotype: T/T}
description:
  pmid: 24447085
  genes: [GC]
  result_diplotype_choice:
    1/1: Moderate
    1/1s: High
    1/1f: High
    1/2: Low
    1s/1s: Very high
    1s/1f: Very high
    1s/2: Moderate
    1f/1f: Very high
    1f/2: Moderate
    2/2: Very low
```

### Example haplotype model

Haplotype model can be used for HLA and PGx.  
To define haplotype models a list of alleles is required (called `variants` in this case, to be consistent with othe rypes of models). Each allele has associated list of defining mutations (alternative SNV alles) defined by Gnomad ID along with `ref`, `alt` and `effect_allele` properties. One star allele should be empty (containing only reference SNV alleles). The algorithm will utilised any phasing information in the vcf.

```
haplotype_model:
  variants:
    CYP2D6*1.001:
    CYP2D6*1.002:
      22-42126963-C-T: {ref: "C", alt: "T", effect_allele: "T"}
    CYP2D6*1.003:
      22-42128813-G-A: {ref: "G", alt: "A", effect_allele: "A"}
    CYP2D6*1.004:
      22-42128216-G-T: {ref: "G", alt: "T", effect_allele: "T"}
    CYP2D6*1.005:
      22-42128922-A-G: {ref: "A", alt: "G", effect_allele: "G"}
    CYP2D6*1.006:
      22-42129726-A-C: {ref: "A", alt: "C", effect_allele: "C"}
      22-42129950-A-C: {ref: "A", alt: "C", effect_allele: "C"}
      22-42130482-C-A: {ref: "C", alt: "A", effect_allele: "A"}
```

For copy-number star alleles (CYP2D6 `*5`/`*1xN`, CYP2C19 `*36`/`*37`) the model gains a
`copy_number:` block and `structural:` haplotypes — see **[docs/pgx-cnv.md](docs/pgx-cnv.md)**
for the VCF contract and YAML schema.

### Example score model with categories rescaling
```
score_model:
  variants:
    rs10012: {effect_allele: G, effect_size: 0.369215857410143}
    rs1014971: {effect_allele: T, effect_size: 0.075546961392531}
    rs10936599: {effect_allele: C, effect_size: 0.086359830674748}
    rs11892031: {effect_allele: C, effect_size: -0.552841968657781}
    rs1495741: {effect_allele: A, effect_size: 0.05307844348342}
    rs17674580: {effect_allele: C, effect_size: 0.187520720836463}
    rs2294008: {effect_allele: T, effect_size: 0.08278537031645}
    rs798766: {effect_allele: T, effect_size: 0.093421685162235}
    rs9642880: {effect_allele: G, effect_size: 0.093421685162235}
  categories:
    High risk: {from: 1.371624087, to: 2.581880425, scale_from: 2, scale_to: 3}
    Potential risk: {from: 1.169616034, to: 1.371624087, scale_from: 1, scale_to: 2}
    Average risk: {from: -0.346748358, to: 1.169616034, scale_from: 0, scale_to: 1}
    Low risk: {from: -1.657132197, to: -0.346748358, scale_from: -1, scale_to: 0}
description:
  about: 
  genes: []
  result_statement_choice:
    Average risk: Avg
    Potential risk: Pot
    High risk: Hig
    Low risk: Low
  science_behind_the_test:
  test_type: Polygenic Risk Score
  trait: Breast cancer
  trait_authors:
    - taken from the PGS catalog
  trait_copyright: Intelliseq all rights reserved
  trait_explained: None
  trait_heritability: None
  trait_pgs_id: PGS000001
  trait_pmids:
    - 25855707
  trait_snp_heritability: None
  trait_title: Breast_Cancer
  trait_version: 1.0
  what_you_can_do_choice:
    Average risk:
    High risk:
    Low risk:
  what_your_result_means_choice:
    Average risk:
    High risk:
    Low risk:
 ```

#### Example Formula Model
```
formula_model:
  formula:
    brownexp: "math.exp(@brown.score_model.value - 2.0769)"
    redexp: "math.exp(@red.score_model.value - 6.3953)"
    blackexp: "math.exp(@black.score_model.value - 2.4029)"
    sumexp: "@brownexp + @redexp + @blackexp"
    brown_prob: "@brownexp / (1 + @sumexp)"
    red_prob: "@redexp / (1 + @sumexp)"
    black_prob: "@blackexp / (1 + @sumexp)"
    blonde_prob: "1 - (@brown_prob + @red_prob + @black_prob)"
  brown:
    score_model:
      variants:
        rs796296176: {effect_allele: CA, effect_size: 1.2522}
        rs11547464: {effect_allele: A, effect_size: -0.61155}
        rs885479: {effect_allele: T, effect_size: 0.2937}
        rs1805008: {effect_allele: T, effect_size: -0.50143}
        rs1805005: {effect_allele: T, effect_size: 0.21172}
        rs1805006: {effect_allele: A, effect_size: 1.9293}
        rs1805007: {effect_allele: T, effect_size: -0.32318}
        rs1805009: {effect_allele: C, effect_size: 0.60861}
        rs1805009: {effect_allele: A, effect_size: 0.25624}
        rs2228479: {effect_allele: A, effect_size: -0.054143}
        rs1110400: {effect_allele: C, effect_size: -0.56315}
        rs28777: {effect_allele: C, effect_size: 0.52168}
        rs16891982: {effect_allele: C, effect_size: 0.75284}
        rs12821256: {effect_allele: G, effect_size: -0.34957}
        rs4959270: {effect_allele: A, effect_size: -0.19171}
        rs12203592: {effect_allele: T, effect_size: 1.6475}
        rs1042602: {effect_allele: T, effect_size: 0.16092}
        rs1800407: {effect_allele: A, effect_size: -0.19111}
        rs2402130: {effect_allele: G, effect_size: 0.35821}
        rs12913832: {effect_allele: T, effect_size: 1.214}
        rs2378249: {effect_allele: C, effect_size: 0.12669}
        rs683: {effect_allele: C, effect_size: 0.21172}
  red:
    score_model:
      variants:
        rs796296176: {effect_allele: CA, effect_size: 25.508}
        rs11547464: {effect_allele: A, effect_size: 2.5381}
        rs885479: {effect_allele: T, effect_size: -0.20889}
        rs1805008: {effect_allele: T, effect_size: 2.801}
        rs1805005: {effect_allele: T, effect_size: 0.93493}
        rs1805006: {effect_allele: A, effect_size: 3.65}
        rs1805007: {effect_allele: T, effect_size: 3.4408}
        rs1805009: {effect_allele: C, effect_size: 4.5868}
        rs1805009: {effect_allele: A, effect_size: 22.107}
        rs2228479: {effect_allele: A, effect_size: 0.62307}
        rs1110400: {effect_allele: C, effect_size: 1.4453}
        rs28777: {effect_allele: C, effect_size: 0.70401}
        rs16891982: {effect_allele: C, effect_size: -0.41869}
        rs12821256: {effect_allele: G, effect_size: -0.57964}
        rs4959270: {effect_allele: A, effect_size: 0.24861}
        rs12203592: {effect_allele: T, effect_size: 0.90233}
        rs1042602: {effect_allele: T, effect_size: 0.45003}
        rs1800407: {effect_allele: A, effect_size: -0.27606}
        rs2402130: {effect_allele: G, effect_size: 0.28313}
        rs12913832: {effect_allele: T, effect_size: -0.093776}
        rs2378249: {effect_allele: C, effect_size: 0.76634}
        rs683: {effect_allele: C, effect_size: -0.053427}
  black:
    score_model:
      variants:
        rs796296176: {effect_allele: CA, effect_size: 2.732}
        rs11547464: {effect_allele: A, effect_size: -16.969}
        rs885479: {effect_allele: T, effect_size: 0.39983}
        rs1805008: {effect_allele: T, effect_size: -0.86062}
        rs1805005: {effect_allele: T, effect_size: -0.0029013}
        rs1805006: {effect_allele: A, effect_size: -16.088}
        rs1805007: {effect_allele: T, effect_size: -1.3757}
        rs1805009: {effect_allele: C, effect_size: 0.060631}
        rs1805009: {effect_allele: A, effect_size: 3.9824}
        rs2228479: {effect_allele: A, effect_size: 0.17012}
        rs1110400: {effect_allele: C, effect_size: 0.29143}
        rs28777: {effect_allele: C, effect_size: 0.82228}
        rs16891982: {effect_allele: C, effect_size: 1.1617}
        rs12821256: {effect_allele: G, effect_size: -0.89824}
        rs4959270: {effect_allele: A, effect_size: -0.36359}
        rs12203592: {effect_allele: T, effect_size: 1.997}
        rs1042602: {effect_allele: T, effect_size: 0.065432}
        rs1800407: {effect_allele: A, effect_size: -0.49601}
        rs2402130: {effect_allele: G, effect_size: 0.26536}
        rs12913832: {effect_allele: T, effect_size: 1.9391}
        rs2378249: {effect_allele: C, effect_size: -0.089509}
        rs683: {effect_allele: C, effect_size: 0.15796}
description:
  name: HirisPlex

```

### Description
### Model keys glossary
- `model` - generic model that can aggregate results of other model types  
- `diplotype_model` 
    Required keys:
    - `diplotypes`
- `description` - all properties to be included in the final results  

### Usecases

#### PGX

```
python3 -m pip install polygenic
pgstk pgs-compute --vcf [PATH_TO_VCF_GZ] --model cyp2d6-pharmvar.yml --print | jq .haplotype_model.haplotypes.match
```

## License
Proprietary (contact@intelliseq.pl)

## Updates
### 2.5.1
- PACKAGING: render the README on the PyPI project page (`long_description` from README.md, `text/markdown`).
### 2.5.0
- FEATURE: copy-number (CNV) star-allele calling for CYP2D6 (`*5`, `*1xN`) and CYP2C19 (`*36`/`*37`). `polygenic` consumes copy number resolved by an upstream caller (symbolic `<DEL>`/`<DUP>` ALT + range + `FORMAT/CN` + phase) and emits non-diploid calls (`*5/*5`, `*1/*5`, `*1/*4x2`). It does no depth/ratio math — allele-specific duplication side is read from phase; unphased input yields `*a/*b (CNx)` + `allele_specific_unphased`. See [docs/pgx-cnv.md](docs/pgx-cnv.md).
- FEATURE: `haplotype_model` gains a `copy_number:` block, `structural:` haplotype definitions (`scope: whole|partial`), and a CYP2D6-only `multiplication:` block.
- TEST: structural VcfRecord unit tests and CNV deletion/duplication/regression integration tests (`test_vcfrecord_structural.py`, `test_haplotypemodel_cnv.py`).
### 2.4.0
- FEATURE: `--ref-fallback` flag (opt-in; default off). `./.` genotypes now stay `source="missing"` by default — matches PharmCAT behavior and fixes spurious CYP2C19*38 calls on panels that don't assay rs3758581.
- FEATURE: `--top-n` flag (default 15). Haplotype caller returns a call when either the match ≥50% OR the scored candidate pool is ≤top_n, enabling calls on sparse panels without ambiguity risk.
- BUG: ScoreModel.compute no longer KeyErrors on `adjusted_score` when a variant has no `af` field (model.py:549).
- BUG: Diplotype QC now correctly counts variant sources (previously miscounted every variant as "missing" due to structural mismatch in compute_qc).
- BUG: ScoreModel handles models without `args` section (previously AttributeError on `self.get("args").get("prevalence")`).
- TEST: Added PGx truth-set integration tests against 1000 Genomes phased GRCh38 slices (NA12878, NA18507, NA19240 × {CYP2C19, CYP2C9, CYP2B6, CYP2D6, SLCO1B1}); rebuildable via `scripts/build_pgx_fixtures.py`.
- TEST: Added PharmVar-named haplotype tests and NA18507 CYP2B6/SLCO1B1 regression tests.
### 2.3.17
- BUG: resolved bug with low weight for missing genotypes
### 2.3.16
- BUG: resolved bug with weight of genotypes in haplotypes
### 2.3.15
- BUG: resolved bug with not enough haplotypes to check
### 2.3.14
- BUG: resolved bug with wrong leftover genotypes
### 2.3.12
- FEATURE: added gene names to genotypes if available
### 2.3.11
- BUG: resolved bug with wrong genotype sources counts
### 2.3.10
- BUG: resolved bug with missing genotype sources counts
### 2.3.9
- FEATURE: add reference as a genotyping source
### 2.3.8
- BUG: resolved bugs inside mobigen wdl task
### 2.3.7
- FEATURE: added ldproxy imputation source
### 2.3.6
- BUG: resolved bug with missing polars package after installation
### 2.3.5
- BUG: resolved bug with 'type' object is not subscriptable running pgstk
### 2.3.4
- BUG: resolved bug with where model does not provide to or from category fields
### 2.3.3
- BUG: resolved bug with missing pyarrow package after installation
### 2.3.2
- BUG: renamed jpg to jpeg outputs from vcfstat
### 2.3.1
- BUG: resolved bug with missing importlib-resources package after installation
### 2.3.0
- FEATURE: added vcf stat tool for zygosities
- FEATURE: added vcf stat tool for baf computation
### 2.2.15
- UPDATE: updated parsing for new version of pan biobankuk
- DEV: updated numpy version to 1.23.4
### 2.2.14
- FEATURE: added module for ldproxy imputing
- FEATURE: added option for merging output as an array instead of dictionary in pgs-compute
### 2.2.13
- BUG: resolved bug with missing score in haplotype model
- DEV: cleaned up test resources
### 2.2.12
- BUG: resolved bug with empty argument in executable
### 2.2.11
- BUG: resolved bug with naming of multiple models in one file
### 2.2.10
- DOC: improved diploty model documentation
### 2.2.9
- BUG: missing effect allele in diplotyp models
### 2.2.8
- BUG: imputed source is based on IMP tag in the INFO field or GT:DS in format field
### 2.2.7
- BUG: repaired bug with missing math library in eval
### 2.2.6
- FEATURE: added qc to model results
### 2.2.5
- ENHACEMENT: libraries updates
### 2.2.0
- ENHANCEMENT: better computing of haplotype models. First one haplotype is identified and further the second haplotype is identified from leftover genotypes
- ENHANCEMENT: moved argparse from tools to pgstk
### 2.1.10
- BUG: resolved bug with wrong plink.clumped path in clumping
### 2.1.9
- BUG: resolved bug with missing index in biobankuk model
### 2.1.8
- BUG: resolved bug with biobankuk model for codenames with special characters
### 2.1.7
- BUG: resolved bug with haplotype model where none of haplotypes matched genotype. Most probable genotype is provided
### 2.1.6
- DOC: added docker badges
- FEATURE: added posibility to output all pgs results in one json file `--merge-outputs`
- FEATURE: added category to diplotype model
- FEATURE: added caching in genotyping module
### 2.1.5
- BUG: biobankuk model output files now contain only alphanumeric characters
- BUG: biobankuk model code names with special characters are now being downloaded
### 2.1.4
- FEATURE: added model_name and sample_name to description
### 2.1.3
- FEATURE: added support for multiple models in pgs-compute
- FEATURE: added missing variants count to haplotype in haplotype model
- BUG: id field in haplotype model
### 2.1.2
- FEATURE: allow gnomadid for variant in yml models
- FEATURE: added printing output option in pgs-compute
### 2.1.1
- BUG: resolved NoneType bug with empty haplotype
### 2.1.0
- FEATURE: haplotype model now works with phased data
### 2.0.0
- FEATURE: switched to yaml model definitions
- FEATURE: implemented formula, score, haplotype and diplotype model types
- FEATURE: added gene symbols to description
- DEVOPS: prepared docker image with resources for building models
