Metadata-Version: 2.4
Name: kingfisher-bnb
Version: 0.1.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Rust
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Dist: numpy
License-File: LICENSE
Summary: Fast Kingfisher rule mining (top-K non-redundant dependency rules via Fisher's exact test) using parallel Branch and Bound.
Keywords: data-mining,association-rules,dependency-rules,fishers-exact,kingfisher,branch-and-bound
Author: Paavo Reinikka
License: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/PaavoReinikka/kingfisher-bnb
Project-URL: Repository, https://github.com/PaavoReinikka/kingfisher-bnb

# kingfisher-bnb

Fast **Kingfisher** rule mining — the top-K *non-redundant, statistically
significant* dependency rules of a binary/transactional dataset — implemented as
a parallel **Best-First Branch & Bound** search in Rust with a small Python API.

Kingfisher finds rules `A ⇒ B` (and negative rules `A ⇒ ¬B`) ranked by a
statistical measure, **without a minimum-support threshold**. False discovery is
controlled by algorithmic design — top-K + a non-redundancy rule (a specific rule
must beat all its generalizations) + a raw significance cutoff — *not* by
family-wise/FDR correction, so the Fisher scores it returns are **raw,
uncorrected p-values**.

## Install

```bash
pip install kingfisher-bnb        # prebuilt wheels (Linux/macOS/Windows)
```

To build from source you need a Rust toolchain (>= 1.85) and
[maturin](https://www.maturin.rs/):

```bash
maturin develop --release         # builds + installs into the active venv
```

## Python usage

```python
import math
import kingfisher_bnb as kf

# Dense rows -> sparse item lists (and an id<->name mapping)
dense = [[1, 1, 0],
         [1, 0, 1],
         [0, 1, 1]]
sparse, id_to_name, _ = kf.dense_to_sparse(dense, ["Apple", "Banana", "Cherry"])

rules = kf.find_rules_from_data(
    data=sparse,
    k=2,            # max attribute index (n_columns - 1)
    q=10,           # top-K rules to keep
    l_max=3,        # max rule length (antecedent + consequent)
    t_type=3,       # 1=positive, 2=negative, 3=both
    m_threshold=1.0,  # raw cutoff (p<=1 => keep all top-q for Fisher)
    measure_type=1,   # 1/2=Fisher's exact, 3=chi^2, 4=mutual info, 5=leverage
)

for r in rules:
    ant = " AND ".join(id_to_name[i] for i in r.antecedent)
    cons = id_to_name[r.consequent]
    sign = "NOT " if r.is_negative else ""
    p = math.exp(r.measure_value)      # Fisher: measure_value = ln(p)
    print(f"IF {ant} THEN {sign}{cons}  (p={p:.4g})")
```

Each `Rule` exposes: `antecedent` (list[int]), `consequent` (int),
`is_negative` (bool), `measure_value` (float — `ln(p)` for Fisher, a negated
statistic otherwise so that smaller is always better), and the contingency
counts `frequency_x`, `frequency_xa`, `frequency_a`.

## CLI

```bash
kingfisher --data data/test_data.txt --cols 4 --t-type 3 --measure-type 1
```

Data format: one transaction per line, space-separated attribute indices.

## Measures

| `measure_type` | Measure | Returns |
|---|---|---|
| 1, 2 | Fisher's exact test | `ln(p)` (raw p-value) |
| 3 | Chi-squared | test statistic |
| 4 | Mutual information | bits |
| 5 | Leverage | effect size |

`t_type` is a separate axis (rule direction): `1` positive, `2` negative, `3` both.

## Attribution

This is an independent Rust implementation of the Kingfisher algorithm from
Wilhelmiina Hämäläinen's published work — written from the papers, not ported
from the original C source. If you use it in research, please cite:

> Hämäläinen, W. *Efficient discovery of the top-K optimal dependency rules with
> Fisher's exact test of significance.* ICDM 2010.

## License

MIT — see [LICENSE](LICENSE).

