Metadata-Version: 2.2
Name: atom-hifi
Version: 0.5.1
Summary: Atom-HiFi: atomistic high-fidelity representative-set selection framework
Author-email: Yihua Song <mothinesong@gmail.com>
License: MIT License
        
        Copyright (c) 2024 Yihua Song
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://gitlab.mpcdf.mpg.de/yhsong/atom-hifi
Project-URL: Source, https://gitlab.mpcdf.mpg.de/yhsong/atom-hifi
Project-URL: Issue Tracker, https://gitlab.mpcdf.mpg.de/yhsong/atom-hifi/-/issues
Keywords: machine learning,interatomic potentials,training set,SOAP,atomic environments,active learning
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Physics
Classifier: Topic :: Scientific/Engineering :: Chemistry
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: ase
Requires-Dist: matplotlib
Requires-Dist: pandas
Requires-Dist: scipy
Requires-Dist: scikit-learn
Requires-Dist: dscribe
Requires-Dist: pymoo>=0.6
Requires-Dist: pyyaml

# Atom-HiFi

**Atom**istic **Hi**gh-**Fi**delity representative-set selection framework.

Applications include:
- MLIP training-set curation and active-learning loops
- Chemical motif identification and distribution analysis
- Diversity-aware structure sampling from large databases

---

## What is Atom-HiFi?

Atom-HiFi finds the smallest subset **S** of a structure library that achieves
high **Fidelity** — meaning S covers the library's atomic-environment diversity
efficiently, without redundancy.  Agnostic to the downstream task.

---

## Key concepts

### Fidelity = L / R

**Fidelity** is the single optimisation objective.  Like a HiFi audio system, it
has two channels — **L** (Left) and **R** (Right) — whose ratio is maximised.
High Fidelity means the selection is both faithful to the library distribution
(high **Likeness**) and compact (low **Redundancy**).

**L — Likeness** measures how faithfully S reproduces the library's
atomic-environment distribution.  Each atom is assigned to a **microstate**
(Voronoi cell in whitened descriptor space from k-means); L is the Shannon
entropy ratio over those populations:

```
L = H(sub) / H(lib)      H = -Σ p_i ln p_i
```

Shannon entropy H measures distributional diversity — how evenly the population
is spread across microstates.  L = 1: S perfectly reproduces the library's
diversity.  L < 1: some environments are under-represented; e.g. L = 0.95 means
S retains 95% of the library's distributional diversity.

**R — Redundancy** measures how many atoms are packed per occupied microstate,
relative to the full library:

```
R = (N_sub / k_occ^sub) / (N_lib / k_occ^lib)
```

R = 1: same atoms-per-microstate density as the full library (no compression).
R < 1: redundancy has been removed; e.g. R = 0.4 means 60% of redundant atoms
are eliminated while the occupied microstate coverage is preserved.

The scan sweeps a bandwidth **c** (scaling factor on ε_noise) and finds **c***
that maximises **Fidelity** subject to L ≥ L_TOL (default 0.90).  The optimal
**c*** sits at the elbow of the L/R curve — the point where further reducing
redundancy begins to cost meaningful distributional diversity.

### ED-SOAP descriptor

**E**mbedded **D**ouble SOAP — two concatenated SOAP power-spectrum vectors per atom: one short-range
(bonding geometry) and one long-range (coordination shell), normalised by a
system-specific `lengthscale`.  No GPU required.  The full parameter set is
exposed in `hifi_workflow_tutorial.py` under the `EDS_*` variables.

---

## Installation

**Step 1 — install `decaf`** (Descriptor Embedding and Clustering for
Atomistic-environment Framework — the clustering backend; not on PyPI):

```bash
pip install git+https://gitlab.mpcdf.mpg.de/klai/decaf.git
```

**Step 2 — install Atom-HiFi**:

```bash
pip install atom-hifi
```

> Python ≥ 3.9 required.

---

## Quick start

`pip install atom-hifi` installs the `atom-hifi` command. Write a starter config,
edit it, and run:

```bash
atom-hifi init                 # writes a commented config.yaml
# edit config.yaml (at minimum: paths.lib_path, paths.focus_elements)
atom-hifi run config.yaml 2>&1 | tee run.out
```

The generated `config.yaml` documents every setting inline. The minimum to edit:

```yaml
paths:
  lib_path: train_structs.xyz   # ASE-readable structure library
  focus_elements: [Ni, O]       # elements to cluster on
  output_dir: fr_results
descriptor:
  kind: eds                     # 'eds' or 'ace'
```

### Python API / custom descriptors

The CLI supports the `eds` and `ace` descriptors. A **custom** descriptor is a
Python callable and is supplied via the Python API. `hifi_workflow_tutorial.py`
is the annotated example (included in the repo; pip-only users can fetch it):

```bash
curl -O https://gitlab.mpcdf.mpg.de/yhsong/atom-hifi/-/raw/main/hifi_workflow_tutorial.py
```

Edit its top-level variables (including `DESCRIPTOR_FN`) and run
`python hifi_workflow_tutorial.py`, or call the runner directly:

```python
from atom_hifi.runner import run
run({'paths': {'lib_path': 'train_structs.xyz', 'focus_elements': ['Ni', 'O']},
     'descriptor': {'kind': 'custom', 'custom_fn': my_descriptor_fn}})
```

---

## Output files

| File | Description |
|---|---|
| `representatives.xyz` | Selected representative structures |
| `fine_scan.out` | L, R, F (=L/R), \|S\|, atoms for every fine-scan point |
| `hifi_final.png` | Coarse + fine Fidelity (F = L/R) scan diagnostic plot |
| `learning_curve.png` | AL loop convergence (only with `RUN_LOOP=True`) |
| `eps_noise_raw.npz` | Cached per-element ε_noise values |
| `desc_lib.pkl` | Cached per-structure descriptors |
| `surroundings_{el}.xyz` | Per-group coordination spheres (`EXTRACT_SURROUNDINGS=True`) |

---

## Configuration reference

All settings live in `config.yaml` (run `atom-hifi init` to generate a fully
commented template). Keys are grouped:

| Group | Keys |
|---|---|
| **paths** | `lib_path`, `patient_path`, `focus_elements`, `output_dir` |
| **descriptor** | `kind`, `eds.{lengthscale, s_cut, s_nmax, s_lmax, l_cut, l_nmax, l_lmax, periodic, r_cut}`, `ace.{model_path, device, r_cut}` |
| **selection** | `method` (`mu_tiebreak` recommended) |
| **scan** | `l_tol`, `n_coarse`, `n_fine`, `n_jobs`, `c_factor_range` |
| **eps_noise** | `per_species`, `temperature` (K; sets σ_thermal ∝ √T/√mass for ε_noise calibration) |
| **loop / grid / nsga2** | `run` + per-stage tuning |
| **refit** | `delta`, `grid_point` |
| **output** | `delta_pick`, `extract_surroundings` |

Unknown keys are rejected. The same configuration can be passed as a nested dict
to `atom_hifi.runner.run(...)`; `hifi_workflow_tutorial.py` is the annotated
Python-API equivalent.

---

## Advanced usage

<details>
<summary>Active-learning loop (<code>RUN_LOOP=True</code>)</summary>

Iteratively expands the training pool by sampling batches from the full library.
Inner iterations use a coarse scan only; one final fine scan runs at the end.
Set `INITIAL_SAMPLE` and `LOOP_SKIP_FINE_SCAN` to control the initial pool size
and inner-scan resolution.

</details>

<details>
<summary>Per-element ND grid scan (<code>RUN_GRID_SCAN=True</code>)</summary>

Sweeps independent c-factors per focus element on a Cartesian grid, reusing
cached per-element DECAF fits from the 1-D scan.  Cost is O(n^N_el) cover
evaluations instead of O(n^N_el × N_el) DECAF fits — tractable for N_el ≤ 3–4.
Results in `scan_grid.csv` and `scan_grid_report.png`.

</details>

<details>
<summary>NSGA-II Pareto optimisation (<code>RUN_NSGA2=True</code>)</summary>

Stochastic multi-objective optimisation of per-element c-factors via NSGA-II
(requires `pymoo`).  Use when the grid is too large (N_el ≥ 4) or you want a
continuous Pareto front.  Results in `pareto_front.csv` and three diagnostic
PNGs.

</details>

<details>
<summary>Representative environment extraction (<code>EXTRACT_SURROUNDINGS=True</code>)</summary>

Exports the local coordination sphere around the centroid-closest atom of each
DECAF group.  Two modes: `'sphere'` (non-periodic ASE Atoms cluster) and
`'full_structure'` (original cell with center/neighbour/rest tags).  Output:
`surroundings_{el}.xyz` per focus element.

</details>

---

## Citation

If you use Atom-HiFi in your research, please cite:

> [paper in preparation — citation will be added upon publication]
