Metadata-Version: 2.4
Name: gekolib
Version: 0.3.1
Summary: Gradient-Efficient Knowledge Optimization - Smart training for any LLM
Home-page: https://github.com/Abd0r/GEKO
Author: Syed Abdur Rehman
Author-email: ra2157218@gmail.com
Keywords: machine learning,deep learning,nlp,llm,training,optimization,curriculum learning,sample selection
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=1.9.0
Requires-Dist: tqdm>=4.60.0
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: black>=21.0; extra == "dev"
Requires-Dist: isort>=5.0; extra == "dev"
Provides-Extra: transformers
Requires-Dist: transformers>=4.20.0; extra == "transformers"
Provides-Extra: peft
Requires-Dist: peft>=0.6.0; extra == "peft"
Provides-Extra: bnb
Requires-Dist: bitsandbytes>=0.41.0; extra == "bnb"
Provides-Extra: rich
Requires-Dist: rich>=10.0.0; extra == "rich"
Provides-Extra: app
Requires-Dist: gradio>=4.0.0; extra == "app"
Requires-Dist: plotly>=5.0.0; extra == "app"
Requires-Dist: transformers>=4.20.0; extra == "app"
Requires-Dist: datasets>=2.0.0; extra == "app"
Provides-Extra: all
Requires-Dist: transformers>=4.20.0; extra == "all"
Requires-Dist: peft>=0.6.0; extra == "all"
Requires-Dist: bitsandbytes>=0.41.0; extra == "all"
Requires-Dist: rich>=10.0.0; extra == "all"
Requires-Dist: gradio>=4.0.0; extra == "all"
Requires-Dist: plotly>=5.0.0; extra == "all"
Requires-Dist: datasets>=2.0.0; extra == "all"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# GEKO: Gradient-Efficient Knowledge Optimization

<p align="center">
  <img src="https://em-content.zobj.net/source/apple/391/lizard_1f98e.png" width="120" alt="GEKO">
</p>

<p align="center">
  <a href="https://pypi.org/project/gekolib/"><img src="https://img.shields.io/pypi/v/gekolib.svg" alt="PyPI"></a>
  <img src="https://img.shields.io/badge/python-3.8+-blue.svg" alt="Python">
  <img src="https://img.shields.io/badge/pytorch-1.9+-red.svg" alt="PyTorch">
  <img src="https://img.shields.io/badge/license-Apache%202.0-green.svg" alt="License">
  <a href="https://github.com/Abd0r/GEKO/actions"><img src="https://github.com/Abd0r/GEKO/actions/workflows/tests.yml/badge.svg" alt="Tests"></a>
  <a href="https://doi.org/10.5281/zenodo.18750303"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.18750303.svg" alt="DOI"></a>
  <a href="https://pepy.tech/projects/gekolib"><img src="https://static.pepy.tech/personalized-badge/gekolib?period=total&amp;units=INTERNATIONAL_SYSTEM&amp;left_color=grey&amp;right_color=brightgreen&amp;left_text=downloads" alt="Downloads"></a>
  <a href="https://paypal.me/n4z1m"><img src="https://img.shields.io/badge/Support-PayPal-00457C.svg?logo=paypal" alt="Support via PayPal"></a>
</p>

<p align="center">
  <b>Stop paying to train what your model already knows.</b>
</p>

<p align="center"><sub>If GEKO saves you compute, a ⭐ helps others find it — thank you!</sub></p>

---

## Table of Contents

- [What is GEKO?](#what-is-geko)
- [v0.3.1 — No-Code Web App](#v031--no-code-web-app)
- [v0.3.0 — 8 Efficiency Features](#v030--8-efficiency-features)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [GEKO Web App](#geko-web-app)
- [What GEKO Looks Like](#what-geko-looks-like)
- [LoRA Integration](#lora-integration)
- [Model Compatibility](#model-compatibility)
- [GEKO vs Alternatives](#geko-vs-alternatives)
- [Cost Savings at Scale](#cost-savings-at-scale)
- [The GEKO Algorithm](#the-geko-algorithm)
- [Real Training Results](#real-training-results)
- [API Reference](#api-reference)
- [Checkpoint Resume](#checkpoint-resume)
- [FAQ](#faq)
- [Changelog](#changelog)
- [Citation](#citation)

---

## What is GEKO?

Most training loops treat every sample equally every epoch. That's wasteful — once a model has mastered a sample, continuing to train on it burns compute and can cause overfitting.

**GEKO** tracks each sample's learning state and routes compute to where it actually matters. Mastered samples get skipped. Hard samples (ones the model confidently gets wrong) get up to 5× more attention. The result: faster training, lower cost, same or better final quality.

> *Like LoRA reduced parameters, GEKO reduces wasted compute.*

> **When to use GEKO:** GEKO is a **fine-tuning tool**. It works best after a model already has a base of general knowledge from pre-training. During fine-tuning, the model already understands language — GEKO identifies which task-specific samples it has mastered vs. which ones still need work. Using GEKO during pre-training from scratch (on a randomly initialized model) is much less effective because the model has no prior knowledge to differentiate samples with.

---

## v0.3.1 — No-Code Web App

v0.3.1 adds a full **Gradio web interface** — fine-tune any HuggingFace model with GEKO without writing a single line of code.

```bash
pip install gekolib[app]
geko-app
# opens http://localhost:7860
```

| What you get | Detail |
|:------------|:-------|
| **Model & dataset picker** | Any HuggingFace model or dataset ID |
| **Eval dataset support** | Optional HF eval set + split picker |
| **GEKO config presets** | Aggressive / Balanced / Conservative — one click |
| **All sliders exposed** | freeze/focus/quality thresholds, curriculum, pruning, warmup, weight decay |
| **LoRA panel** | Enable, set rank/alpha/target modules/dropout from the UI |
| **Efficiency toggles** | torch.compile, gradient checkpointing, 8-bit optimizer |
| **Live loss curve** | Plotly chart, updates every batch |
| **Live bucket chart** | FREEZE / LIGHT / FOCUS / HARD distribution in real time |
| **Steps/sec + ETA** | Computed from a rolling 20-step window |
| **Eval loss stat box** | Populated automatically when eval dataset is set |
| **Training summary card** | Final loss, compute saved, steps, output path |
| **Resume from checkpoint** | Point to any checkpoint directory and continue |
| **Custom output directory** | Set where checkpoints are saved |
| **Stop button** | Safely halts training mid-epoch |
| **Clear / Reset button** | Wipes all outputs and charts |
| **Hardware info strip** | Shows PyTorch version + GPU/CPU at page load |

---

## v0.3.0 — 8 Efficiency Features

v0.3.0 makes GEKO production-ready for cheap LLM fine-tuning:

| Feature | What it does | Saving |
|:--------|:-------------|:-------|
| **LoRA / PEFT** | Fine-tune only 0.1–1% of parameters | Up to 10× fewer trainable params |
| **BF16 mixed precision** | Brain float 16, no GradScaler needed | ~50% memory reduction |
| **Gradient checkpointing** | Recompute activations instead of storing | ~4× activation memory reduction |
| **8-bit optimizer** | AdamW states in int8 via bitsandbytes | ~2× optimizer memory reduction |
| **torch.compile** | JIT kernel fusion (PyTorch 2.0+) | 20–50% throughput boost |
| **Fast DataLoader** | Auto workers + persistent + prefetch | Eliminates data loading bottleneck |
| **Stable bucket caching** | Skip DataLoader rebuild if distribution stable | Saves seconds per epoch |
| **Dynamic dataset pruning** | Permanently remove samples frozen N+ epochs | Dataset shrinks as model learns |

---

## Installation

```bash
# Core
pip install gekolib

# With Gradio web app (v0.3.1)
pip install gekolib[app]

# With LoRA support
pip install gekolib[peft]

# With 8-bit optimizer
pip install gekolib[bnb]

# With rich terminal UI
pip install gekolib[rich]

# Everything
pip install gekolib[all]
```

**Launch the web app:**

```bash
geko-app
# or: python -m geko.app
# Opens http://localhost:7860
```

---

## Quick Start

```python
from geko import GEKOTrainer, GEKOConfig, GEKOTrainingArgs

args = GEKOTrainingArgs(
    output_dir="./geko_output",
    num_epochs=3,
    batch_size=16,
    learning_rate=3e-5,
    gradient_accumulation_steps=4,   # effective batch = 64
    bf16=True,                        # BF16 on Ampere+ GPUs
    gradient_checkpointing=True,      # ~4× activation memory reduction
    compile_model=True,               # 20–50% throughput boost
)

config = GEKOConfig(
    prune_frozen_after=2,             # permanently remove mastered samples after 2 frozen epochs
)

trainer = GEKOTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    config=config,
    args=args,
    compute_correctness=your_correctness_fn,
)

trainer.train()
print(trainer.get_efficiency_report())
```

> **Dataset requirement:** `__getitem__` must return a dict with at least an `'input_ids'` key.

---

## GEKO Web App

The GEKO web app gives you a full no-code training UI powered by Gradio.

```bash
pip install gekolib[app]
geko-app
```

![GEKO Web App](https://raw.githubusercontent.com/Abd0r/GEKO/main/assets/GekoUIpreview.png)

**Left panel — Configuration:**
- Model name (any HuggingFace model ID, e.g. `gpt2`, `meta-llama/Llama-2-7b-hf`)
- Dataset name + train split + sample count
- Eval dataset + eval split (optional — enables live eval loss)
- Training hyperparameters: epochs, learning rate, batch size, gradient accumulation
- Precision: BF16 / FP16 / FP32
- Efficiency: torch.compile, gradient checkpointing, 8-bit optimizer
- **GEKO Config:** preset selector (Aggressive / Balanced / Conservative), freeze confidence, focus confidence, freeze quality, curriculum on/off, pruning threshold
- **LoRA:** enable toggle, rank, alpha, target modules, dropout
- Advanced: warmup steps, weight decay, max grad norm, seed
- Output directory + resume-from-checkpoint path

**Right panel — Live training dashboard:**
- Status bar (color-coded: ready → training → complete / error)
- Live loss curve (Plotly, updates each optimizer step)
- Live bucket distribution chart (FREEZE / LIGHT / FOCUS / HARD percentages)
- Stat boxes: Epoch · Step · Compute Saved · Avg Loss · Steps/sec · ETA · Eval Loss
- Training summary card (appears on completion: final loss, compute saved, steps, save path)

**Buttons:**
- `Start Training` — launches training in a background thread
- `Stop` — safely interrupts training between batches (no corrupt checkpoints)
- `Clear` — resets all outputs and charts

---

## What GEKO Looks Like

This is the real output from training GPT-2 on `open-r1/OpenR1-Math-220k`:

```
============================================================
  GEKO Training
============================================================
  Samples           : 92,733
  Epochs            : 3
  Batch size        : 16
  Device            : CUDA
  Precision         : BF16
  Grad accum        : 4
  Grad checkpointing: ON
  torch.compile     : ON
  8-bit optimizer   : OFF
  DataLoader workers: 4
  Warmup steps      : 500
  Curriculum        : ON
============================================================

[GEKO] Gradient checkpointing enabled
[GEKO] Compiling model with torch.compile (first batch will be slow)...

Epoch 1/3: 100%|██████████| 5796/5796 [32:14<00:00, loss=4.20, phase=warmup]
[Epoch 1] Average Loss: 4.2007
[GEKO] Epoch 1 Partition: FREEZE:    0 (  0.0%) | LIGHT:    2 (  0.2%) | FOCUS: 72861 ( 78.6%) | HARD: 19870 ( 21.2%)

Epoch 2/3: 100%|██████████| 5796/5796 [28:41<00:00, loss=1.88, phase=peak]
[Epoch 2] Average Loss: 1.8808
[GEKO] Epoch 2 Partition: FREEZE:    0 (  0.0%) | LIGHT: 18088 ( 19.5%) | FOCUS: 35156 ( 37.9%) | HARD: 39489 ( 42.6%)

Epoch 3/3: 100%|██████████| 5796/5796 [25:03<00:00, loss=1.73, phase=consolidate]
[Epoch 3] Average Loss: 1.7252
[GEKO] Checkpoint saved to ./geko_output/checkpoint-48

============================================================
  Training Complete
============================================================
  Samples total     : 92,733
  Samples mastered  : 18,088
  Compute saved     : 19.5%
  Final accuracy    : 19.5%
  Avg loss          : 2.46
============================================================
```

Loss dropped **58%** across 3 epochs. GEKO identified **42.6% of math samples as HARD** and routed extra compute to them automatically.

---

## LoRA Integration

```python
from peft import LoraConfig, TaskType
from geko import GEKOTrainer, apply_lora, is_peft_available

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
)

# Pass to trainer — GEKO applies LoRA automatically and prints param counts
trainer = GEKOTrainer(
    model=model,
    train_dataset=dataset,
    lora_config=lora_config,
    args=args,
)
# [GEKO] LoRA applied
#   Trainable params : 4,194,304 (0.32%)
#   Total params     : 1,311,473,664
#   Frozen params    : 1,307,279,360
```

---

## Model Compatibility

GEKO works with **any transformer-based model** — HuggingFace or custom `nn.Module`. It only needs your model to accept a dict batch and return an object with a `.loss` attribute.

| Model Family | GEKO | LoRA | Recommended LoRA targets |
|:-------------|:----:|:----:|:------------------------|
| GPT-2 / GPT-Neo | ✅ | ✅ | `c_attn`, `c_proj` |
| LLaMA 2 / 3 | ✅ | ✅ | `q_proj`, `v_proj` |
| Mistral / Mixtral | ✅ | ✅ | `q_proj`, `v_proj` |
| Phi-2 / Phi-3 | ✅ | ✅ | `q_proj`, `dense` |
| Falcon | ✅ | ✅ | `query_key_value` |
| Qwen / Qwen2 | ✅ | ✅ | `q_proj`, `v_proj` |
| Gemma / Gemma2 | ✅ | ✅ | `q_proj`, `v_proj` |
| Custom `nn.Module` | ✅ | — | Any model with `.loss` output |

---

## GEKO vs Alternatives

| Feature | Plain PyTorch | HF Trainer | SFTTrainer | **GEKO** |
|:--------|:-------------:|:----------:|:----------:|:--------:|
| Per-sample learning tracking | ❌ | ❌ | ❌ | ✅ |
| Skip mastered samples | ❌ | ❌ | ❌ | ✅ |
| Hard-sample prioritization | ❌ | ❌ | ❌ | ✅ |
| Curriculum learning | ❌ | ❌ | ❌ | ✅ |
| LoRA built-in | ❌ | ❌ | ✅ | ✅ |
| BF16 / FP16 | manual | ✅ | ✅ | ✅ |
| Gradient checkpointing | manual | ✅ | ✅ | ✅ |
| torch.compile | manual | ✅ | ✅ | ✅ |
| Dynamic dataset pruning | ❌ | ❌ | ❌ | ✅ |
| Works with any `nn.Module` | ✅ | partial | partial | ✅ |

---

## Cost Savings at Scale

GEKO's savings compound as training progresses — the more the model learns, the more samples get frozen, and the cheaper each subsequent epoch becomes.

Estimates based on GEKO's FREEZE accumulation curve vs standard uniform training:

| Task | Dataset | Standard (A100 $/hr ~$3) | GEKO (est.) | Saving |
|:-----|:--------|:------------------------:|:-----------:|:------:|
| GPT-2 fine-tune | 10k samples, 5 epochs | ~$0.50 | ~$0.35 | **~30%** |
| LLaMA-7B fine-tune | 100k samples, 10 epochs | ~$25 | ~$13 | **~50%** |
| LLaMA-13B fine-tune | 500k samples, 15 epochs | ~$200 | ~$80 | **~60%** |
| LLaMA-70B + LoRA | 1M samples, 20 epochs | ~$2,000 | ~$600 | **~70%** |

> Savings increase with more epochs — GEKO's FREEZE fraction grows each epoch, compounding over time.

---

## The GEKO Algorithm

### Sample Buckets

Every sample is classified each epoch:

| Bucket | Condition | Weight | Meaning |
|:------:|:----------|:------:|:--------|
| 🔵 **FREEZE** | correct ∧ confidence > 0.85 ∧ Q > 0.80 | `0` | Mastered — skipped entirely |
| 🟢 **LIGHT** | correct, but not confident enough to freeze | varies* | Nearly mastered |
| 🟠 **FOCUS** | wrong ∧ low confidence | `1` | Still learning |
| 🔴 **HARD** | wrong ∧ high confidence | `3` | Confident-wrong — highest priority |

*LIGHT weight varies by curriculum phase (0 at PEAK, 3 at WARMUP/CONSOLIDATE).

### Mountain Curriculum

GEKO's LR schedule and sample weights follow a five-phase mountain:

```
  LR ▲
     │         ╭──────╮
     │        ╱        ╲
     │       ╱          ╲
     │      ╱            ╲──────╮
     │─────╱                     ╲─────
     └─────────────────────────────── Progress
      WARMUP ASCENT  PEAK  DESCENT CONSOLIDATE
```

| Phase | Progress | HARD | FOCUS | LIGHT | Strategy |
|:------|:--------:|:----:|:-----:|:-----:|:---------|
| WARMUP | 0–15% | 1 | 2 | 3 | Build foundation on easy samples |
| ASCENT | 15–35% | 2 | 3 | 1 | Ramp up difficulty |
| PEAK | 35–65% | 5 | 2 | 0 | Maximum focus on hard samples |
| DESCENT | 65–85% | 2 | 3 | 1 | Wind down |
| CONSOLIDATE | 85–100% | 1 | 2 | 3 | Reinforce all learned material |

### Q-Value Learning

Each sample maintains a Q-value representing its "learnability":

$$Q_{t+1}(s) = (1 - \alpha) \cdot Q_t(s) + \alpha \cdot \left(1 - \frac{\ell_t(s)}{\ell_{max}}\right)$$

A sample cannot be frozen unless its Q-value exceeds `min_q_for_freeze` — preventing premature freezing.

---

## Real Training Results

**GPT-2 on OpenR1-Math-220k (1k samples, 3 epochs, CPU)**

| Epoch | FREEZE | LIGHT | FOCUS | HARD | Avg Loss |
|:-----:|:------:|:-----:|:-----:|:----:|:--------:|
| 0 | 0.0% | 0.0% | 100.0% | 0.0% | — |
| 1 | 0.0% | 0.2% | 78.6% | 21.2% | 4.20 |
| 2 | 0.0% | 19.5% | 37.9% | 42.6% | 1.88 |
| 3 | 0.0% | 19.5% | 37.9% | 42.6% | 1.73 |

Loss dropped **58%** in 3 epochs. GEKO correctly routed 3× compute to 43% of samples it identified as HARD math reasoning problems. 20% graduated to LIGHT. On a GPU with more epochs, FREEZE samples accumulate rapidly and savings compound.

---

## API Reference

### `GEKOTrainer`

```python
GEKOTrainer(
    model,                    # Any nn.Module
    train_dataset,            # Must return dicts with 'input_ids'
    tokenizer=None,           # Stored but not called internally
    eval_dataset=None,        # Optional evaluation dataset
    config=None,              # GEKOConfig
    args=None,                # GEKOTrainingArgs
    lora_config=None,         # peft.LoraConfig — applied automatically
    compute_confidence=None,  # fn(outputs, batch) → Tensor[B]
    compute_correctness=None, # fn(outputs, batch) → BoolTensor[B]
)
```

| Method | Description |
|:-------|:------------|
| `trainer.train()` | Run full GEKO training loop |
| `trainer.get_efficiency_report()` | Bucket stats + compute savings |
| `trainer.get_pruned_count()` | Number of samples permanently pruned |
| `trainer.save_checkpoint(path)` | Save model weights + GEKO state |
| `trainer.load_checkpoint(path)` | Restore from checkpoint and resume |

---

### `GEKOTrainingArgs`

**Core**

| Argument | Default | Description |
|:---------|:-------:|:------------|
| `output_dir` | `"./geko_output"` | Checkpoint directory |
| `num_epochs` | `3` | Training epochs |
| `batch_size` | `32` | Per-device batch size |
| `learning_rate` | `5e-5` | AdamW learning rate |
| `weight_decay` | `0.01` | AdamW weight decay |
| `warmup_steps` | `100` | Linear LR warmup steps |
| `logging_steps` | `100` | Log every N optimizer steps |
| `save_steps` | `1000` | Checkpoint every N optimizer steps |
| `eval_steps` | `500` | Evaluate every N optimizer steps |
| `max_grad_norm` | `1.0` | Gradient clipping |
| `gradient_accumulation_steps` | `1` | Steps before optimizer update |
| `seed` | `42` | Reproducibility seed |
| `save_at_end` | `True` | Save final checkpoint |

**v0.3.0 — Efficiency**

| Argument | Default | Description |
|:---------|:-------:|:------------|
| `bf16` | `False` | BF16 mixed precision (Ampere+ GPUs, no GradScaler needed) |
| `fp16` | `False` | FP16 mixed precision (older GPUs, with GradScaler) |
| `gradient_checkpointing` | `False` | Recompute activations — ~4× activation memory saving |
| `compile_model` | `False` | `torch.compile` — 20–50% throughput boost (PyTorch 2.0+) |
| `use_8bit_optimizer` | `False` | 8-bit AdamW via bitsandbytes — ~2× optimizer memory saving |
| `dataloader_num_workers` | `-1` | DataLoader workers (-1 = auto-detect; 0 on macOS) |
| `dataloader_persistent_workers` | `True` | Keep workers alive between epochs |
| `dataloader_prefetch_factor` | `2` | Batches to prefetch per worker |

---

### `GEKOConfig`

| Field | Default | Description |
|:------|:-------:|:------------|
| `freeze_confidence` | `0.85` | Confidence threshold to enter FREEZE |
| `freeze_quality` | `0.80` | Q-value threshold to enter FREEZE |
| `focus_confidence` | `0.60` | Confidence split between FOCUS and HARD |
| `bucket_weights` | `(3, 1, 0)` | Sampling weights: (HARD, FOCUS, LIGHT) |
| `use_curriculum` | `True` | Enable Mountain Curriculum |
| `q_value_lr` | `0.1` | EMA rate for Q-value updates |
| `min_q_for_freeze` | `0.8` | Minimum Q-value to freeze a sample |
| `q_value_loss_scale` | `10.0` | Loss normalizer for Q updates |
| `repartition_every` | `1` | Re-partition every N epochs |
| `log_bucket_stats` | `True` | Print bucket distribution each epoch |
| `max_times_seen` | `50` | Max training passes per sample |
| `prune_frozen_after` | `0` | Permanently remove samples frozen N+ consecutive epochs (0 = disabled) |

**Preset configs:**

```python
from geko.core import GEKO_AGGRESSIVE, GEKO_BALANCED, GEKO_CONSERVATIVE

# AGGRESSIVE: freeze_confidence=0.90, weights=(5, 2, 0) — max hard-sample focus
# BALANCED:   freeze_confidence=0.85, weights=(3, 2, 1) — default
# CONSERVATIVE: freeze_confidence=0.80, weights=(2, 2, 1) — gentle
```

---

## Checkpoint Resume

```python
trainer = GEKOTrainer(model=model, train_dataset=dataset, args=args)
trainer.load_checkpoint("./geko_output/checkpoint-1000")
trainer.train()   # picks up from where it left off — all sample states included
```

---

## FAQ

**Should I use GEKO for pre-training or fine-tuning?**
Fine-tuning only. GEKO's value comes from differentiating samples the model has already partially learned from ones it hasn't — that gap only exists meaningfully after pre-training. On a randomly initialized model, every sample looks equally hard, so GEKO's buckets won't diverge and savings will be minimal. Use GEKO for SFT, instruction tuning, domain adaptation, RLHF-style training, or any task-specific fine-tuning on top of a pre-trained base model.

**Does GEKO work with multi-GPU / DDP?**
Not natively in v0.3.0 — GEKO's per-sample state tracking is designed for single-GPU or single-node training. Multi-GPU support via `DistributedDataParallel` is planned.

**What's the memory overhead of tracking sample states?**
Negligible. Each sample stores ~5 floats (confidence, Q-value, loss, bucket, counter). 1 million samples ≈ 40 MB of state — well within RAM for any realistic dataset size.

**Does it work with custom loss functions?**
Yes. GEKO calls your model's forward pass and reads `outputs.loss`. If your model returns a custom output object with `.loss`, it works out of the box. For per-sample correctness (recommended), implement `compute_correctness(outputs, batch) → BoolTensor[B]`.

**Can I use it with HuggingFace `datasets`?**
Yes — wrap it in a `torch.utils.data.Dataset` that returns dicts with `'input_ids'`. GEKO does not call `.map()` or any HuggingFace-specific APIs internally.

**What if my dataset doesn't have a `'labels'` key?**
GEKO only requires `'input_ids'`. Labels are optional — if not present, GEKO falls back to batch-level correctness (one correctness score per batch instead of per sample). Override `compute_correctness` for true per-sample bucketing.

**Does GEKO add training overhead?**
Minimal. Per-epoch overhead is one O(N) pass over sample states to reclassify buckets — typically under 1 second for 100k samples. The weighted sampler rebuild adds another second. Both are negligible compared to actual training time.

---

## Changelog

### v0.3.1 — No-Code Web App
- **Gradio web interface** — full training UI, no code required (`geko-app` CLI entry point)
- **GEKO config presets** — Aggressive / Balanced / Conservative, one click to apply
- **`freeze_quality` slider** — exposes the Q-value gate for FREEZE in the UI
- **Eval dataset support** — enter any HuggingFace eval dataset + split; live eval loss stat box
- **Steps/sec + ETA stat boxes** — computed from a rolling 20-step window
- **Eval Loss stat box** — populated by dedicated `eval_loss` events from the trainer
- **Training summary card** — rendered on completion with final loss, compute saved, steps, output path
- **Resume from checkpoint** — textbox input wired to `trainer.load_checkpoint()`
- **Custom output directory** — user-controlled, defaults to `./geko_output`
- **Clear / Reset button** — wipes all live outputs and charts
- **Hardware info strip** — PyTorch version + GPU name/VRAM or "CPU only" at page load
- **Stop button** — safely halts training between batches
- Added `pip install gekolib[app]` extra (gradio, plotly, transformers, datasets)
- Added `geko-app` console script entry point

### v0.3.0 — Efficiency Update
- **LoRA / PEFT integration** — pass `lora_config` to `GEKOTrainer`; prints trainable/frozen param counts
- **BF16 mixed precision** — `bf16=True` in `GEKOTrainingArgs`; unified autocast with FP16
- **Gradient checkpointing** — `gradient_checkpointing=True`; ~4× activation memory reduction
- **8-bit optimizer** — `use_8bit_optimizer=True`; requires `bitsandbytes`
- **torch.compile** — `compile_model=True`; graceful fallback on unsupported environments
- **Fast DataLoader** — auto num_workers, persistent workers, prefetch factor, non-blocking GPU transfers
- **Stable bucket caching** — skip DataLoader rebuild when bucket distribution changes < 5%
- **Dynamic dataset pruning** — `prune_frozen_after` in `GEKOConfig`; permanently removes mastered samples
- Added `apply_lora` and `is_peft_available` exports
- Added `extras_require`: `peft`, `bnb`, `all` in `setup.py`
- Fixed eval DataLoader worker resolution
- All 120 tests passing

### v0.2.0
- Per-sample correctness via 1D loss — true per-sample bucketing
- One-time warning on batch-level correctness fallback
- `GEKODataset` requires `'input_ids'` key with clear `TypeError`
- All-FREEZE epoch skip
- `PartitionStats` fully serialized to checkpoints
- `load_checkpoint` restores partition history
- Linear warmup LR scheduler
- Optional evaluation loop
- `max_times_seen` enforcement
- GitHub Actions CI (Python 3.9 / 3.10 / 3.11)

### v0.1.1
- Mixed-precision training via `torch.amp`
- Gradient accumulation
- Per-sample state updates inside training loop

### v0.1.0
- Initial release: GEKO algorithm, Mountain Curriculum, Q-value tracking

---

## Citation

```bibtex
@software{geko2026,
  author = {Syed Abdur Rehman},
  title  = {GEKO: Gradient-Efficient Knowledge Optimization},
  year   = {2026},
  url    = {https://github.com/Abd0r/GEKO},
  doi    = {10.5281/zenodo.18750303}
}
```

---

## License

Apache 2.0

---

## Support

GEKO is a solo open-source project. If it's useful to you, any support goes directly toward keeping it maintained and growing.

<a href="https://paypal.me/n4z1m"><img src="https://img.shields.io/badge/Support-PayPal-00457C.svg?logo=paypal" alt="Support via PayPal"></a>

---

<p align="center"><b>GEKO</b> — Train smarter, not harder.</p>
