Metadata-Version: 2.4
Name: fault-tolerance
Version: 0.1.2
Summary: Fault-tolerant algorithms for large models based on bit-flip error correction
Author: Ji
Keywords: LLM,fault tolerance,bit flip,error correction,deep learning
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: torch
Requires-Dist: tqdm

# fault-tolerance

Fault-tolerant algorithms for large and little models based on bit-flip error correction.

## Requirements

- Python >= 3.8
- torch
- tqdm

## Installation

Install PyTorch first:

https://pytorch.org/get-started/locally/

Then install this package:

pip install fault-tolerance

## Usage

### 1. Error Injection (`eject_error.py`)

The `eject_error.py` module is used to inject bit-flip errors into a model.

Main function:

- **inject_error_to_model**

Parameters:

- **model**: A model implemented using PyTorch.
- **error_rate**: Bit-flip error rate used during error injection. The default value is `1e-6`.
- **seed**
- **chunk_size**

This function injects random bit errors into model parameters to simulate hardware faults.

---

### 2. FRP Protection for Large Models (`frp_large_model.py`)

This module implements FRP-based protection for large models.

Main functions:

- **encode**

  Encodes model parameters using BCH codes. The encoding result is written **in-place** to the model parameters, where the 63-bit BCH codeword is stored using `int64`.

- **decode**

  Recovers the original `float32` parameters **in-place** from the BCH-encoded `int64` values stored in `param.data`.

---

### 3. FRP Protection for Small Models (`frp_little_model.py`)

This module provides the same FRP-based protection mechanism as `frp_large_model.py`, but is optimized for **smaller models**.

---

### 4. ZMORP Protection (`zmorp_large_model.py` and `zmorp_little_model.py`)

These modules implement ZMORP-based fault-tolerance protection.

Main functions:

- **protect_model**

  Adds fault-tolerance protection to all parameters of the model:
  - `zmorp_large_model.py` protects **float32 parameters**
  - `zmorp_little_model.py` protects **float16 parameters**

- **recover_model**

  Recovers protected parameters of the model after potential bit-flip errors.

---

### Example

```python
import torch
import fault_tolerance as ft

model = ...

# Inject errors
ft.inject_error_to_model(model, error_rate=1e-6)

# Apply protection
ft.protect_model(model)

# Recover model parameters
ft.recover_model(model)

