Metadata-Version: 2.4
Name: Enilnets
Version: 2.1.0
Summary: A simple neural network library written in Python
Author: DoctorEnilno
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENCE
Requires-Dist: numpy>=2.5.0
Dynamic: license-file

# Enilnets Library Documentation

A pure NumPy-based deep learning library with support for dense, convolutional, pooling, batch normalization, dropout, layer normalization, embedding, upsampling, global pooling, and sparse layers. Includes multiple optimizers, loss functions, activation functions, weight initialization methods, learning rate schedulers, reinforcement learning (REINFORCE, PPO, Actor-Critic), and a full generative AI framework.

---

## Table of Contents

1. [Quick Start](#quick-start)
   - [Discriminative Example](#discriminative-example)
   - [Generative Example (VAE)](#generative-example-vae)
   - [Reinforcement Learning (PPO)](#reinforcement-learning-ppo)
2. [Installation & Project Structure](#installation--project-structure)
3. [Core Architecture](#core-architecture)
   - [NeuralNet Class Overview](#neuralnet-class-overview)
   - [Internal Data Flow](#internal-data-flow)
   - [Training vs Evaluation Mode](#training-vs-evaluation-mode)
4. [Layer Types](#layer-types)
   - [Dense Layer](#dense-layer)
   - [Sparse Layer](#sparse-layer)
   - [Convolutional Layer (Conv2D)](#convolutional-layer-conv2d)
   - [Flatten Layer](#flatten-layer)
   - [Max Pooling 2D](#max-pooling-2d)
   - [Average Pooling 2D](#average-pooling-2d)
   - [Global Average Pooling 2D](#global-average-pooling-2d)
   - [Upsampling 2D](#upsampling-2d)
   - [Batch Normalization](#batch-normalization)
   - [Layer Normalization](#layer-normalization)
   - [Dropout](#dropout)
   - [Embedding Layer](#embedding-layer)
5. [Forward Pass](#forward-pass)
   - [Input Handling](#input-handling)
   - [Layer-by-Layer Computation](#layer-by-layer-computation)
   - [im2col for Convolutions](#im2col-for-convolutions)
6. [Backward Pass](#backward-pass)
   - [Automatic Delta Computation](#automatic-delta-computation)
   - [Manual Output Delta](#manual-output-delta)
   - [Per-Layer Gradient Propagation](#per-layer-gradient-propagation)
7. [Optimizers](#optimizers)
   - [SGD with Momentum](#sgd-with-momentum)
   - [RMSprop](#rmsprop)
   - [Adagrad](#adagrad)
   - [Adam with Bias Correction](#adam-with-bias-correction)
   - [L2 Regularization](#l2-regularization)
8. [Loss Functions](#loss-functions)
   - [Regression Losses](#regression-losses)
   - [Classification Losses](#classification-losses)
   - [Advanced Losses](#advanced-losses)
   - [Generative Losses](#generative-losses)
9. [Activation Functions](#activation-functions)
   - [ReLU Family](#relu-family)
   - [Sigmoid Family](#sigmoid-family)
   - [Advanced Activations](#advanced-activations)
   - [Linear / Identity](#linear--identity)
10. [Weight Initialization](#weight-initialization)
    - [Dense & Sparse Layer Initializers](#dense--sparse-layer-initializers)
    - [Convolutional Layer Initializers](#convolutional-layer-initializers)
    - [Embedding Layer Initializers](#embedding-layer-initializers)
11. [Training Utilities](#training-utilities)
    - [Train Method](#train-method)
    - [TrainBatch Method](#trainbatch-method)
    - [Learning Rate Schedulers](#learning-rate-schedulers)
    - [Metrics Computation](#metrics-computation)
12. [Model Utilities](#model-utilities)
    - [Training / Evaluation Mode](#training--evaluation-mode)
    - [Learning Rate Control](#learning-rate-control)
    - [Gradient Clipping](#gradient-clipping)
    - [Layer Freezing / Unfreezing](#layer-freezing--unfreezing)
    - [Weight Get / Set](#weight-get--set)
    - [Model Copying](#model-copying)
    - [Optimizer State Reset](#optimizer-state-reset)
    - [NaN / Inf Detection](#nan--inf-detection)
    - [Model Summary](#model-summary)
13. [Reinforcement Learning](#reinforcement-learning)
    - [Evolutionary Strategy (Evolve)](#evolutionary-strategy-evolve)
    - [REINFORCE (Policy Gradient)](#reinforce-policy-gradient)
    - [Proximal Policy Optimization (PPO)](#proximal-policy-optimization-ppo)
    - [Actor-Critic](#actor-critic)
    - [Utility Functions](#rl-utility-functions)
14. [Generative AI Framework](#generative-ai-framework)
    - [Variational Autoencoder (VAE)](#variational-autoencoder-vae)
    - [Generative Adversarial Network (GAN)](#generative-adversarial-network-gan)
    - [Diffusion Model (DDPM)](#diffusion-model-ddpm)
    - [Autoregressive Model (MADE)](#autoregressive-model-made)
    - [Normalizing Flows (RealNVP)](#normalizing-flows-realnvp)
    - [Energy-Based Model (EBM)](#energy-based-model-ebm)
    - [UNet Denoiser](#unet-denoiser)
15. [Sampling Utilities](#sampling-utilities)
    - [VAE Reparameterization](#vae-reparameterization)
    - [Langevin Dynamics](#langevin-dynamics)
    - [Gaussian & Uniform Sampling](#gaussian--uniform-sampling)
    - [Gumbel-Softmax](#gumbel-softmax)
    - [Random Masking](#random-masking)
    - [Top-p (Nucleus) Sampling](#top-p-nucleus-sampling)
    - [Discounted Returns](#discounted-returns)
    - [Generalized Advantage Estimation (GAE)](#generalized-advantage-estimation-gae)
16. [Model I/O](#model-io)
    - [JSON Serialization](#json-serialization)
    - [Pickle Serialization](#pickle-serialization)
    - [Loading Models](#loading-models)
17. [Known Limitations](#known-limitations)
18. [Version History](#version-history)

---

## Quick Start

### Discriminative Example

Build and train a classifier on flat data:

```python
from Enilnets import NeuralNet, LRScheduler
import numpy as np

model = NeuralNet(learning_rate=0.001, optimizer="adam", l2_lambda=0.01)
model.add_dense(784, 256, activation="relu")
model.add_dropout(0.3)
model.add_dense(256, 10, activation="softmax")

X_train = np.random.randn(1000, 784)
Y_train = np.eye(10)[np.random.randint(0, 10, 1000)]

# With learning rate scheduler
scheduler = LRScheduler(initial_lr=0.001, mode="cosine", max_epochs=50)
history = model.Train(X_train, Y_train, epochs=50, batch_size=32, scheduler=scheduler)
```

### Generative Example (VAE)

Train a Variational Autoencoder on image-like data:

```python
from Enilnets import VAE
import numpy as np

vae = VAE(input_dim=784, latent_dim=32,
          encoder_hidden=[512, 256], decoder_hidden=[256, 512],
          learning_rate=0.001, optimizer="adam")

X_train = np.random.rand(1000, 784)
history = vae.Train(X_train, epochs=20, batch_size=64)
generated = vae.generate(n_samples=16)
```

### Reinforcement Learning (PPO)

Train a policy network with Proximal Policy Optimization:

```python
from Enilnets import NeuralNet
import numpy as np

policy = NeuralNet(learning_rate=3e-4, optimizer="adam")
policy.add_dense(4, 64, activation="tanh")
policy.add_dense(64, 2, activation="softmax")

# states, actions, old_log_probs, advantages from environment
policy.PPO(states, actions, old_log_probs, advantages, action_type="discrete")
```

---

## Installation & Project Structure

The library is organized as a Python package with the following module layout:

```
Enilnets/
|-- __init__.py          # Package entry point: exports NeuralNet, LRScheduler, generative classes
|-- base.py              # NeuralNet class definition + method binding
|-- layers.py            # Layer factory functions (add_dense, add_conv2d, etc.)
|-- forward.py           # Forward pass implementation + im2col + normalization
|-- backward.py          # Backpropagation for all layer types
|-- optimizer.py         # Gradient update rules (SGD, Adam, RMSprop, Adagrad)
|-- loss.py              # Loss function implementations
|-- activations.py       # Activation functions and their derivatives
|-- weight_init.py       # Weight initialization strategies
|-- train.py             # Training loop, metrics, LRScheduler
|-- io.py                # Model save/load (JSON & Pickle)
|-- reinforce.py         # RL algorithms: Evolve, REINFORCE, PPO, ActorCritic
|-- generative/
|   |-- __init__.py      # Exports all generative classes and utilities
|   |-- vae.py           # Variational Autoencoder
|   |-- gan.py           # Generative Adversarial Network
|   |-- diffusion.py     # Denoising Diffusion Probabilistic Model
|   |-- autoregressive.py # MADE-style autoregressive model
|   |-- flows.py         # RealNVP normalizing flow
|   |-- ebm.py           # Energy-Based Model
|   |-- unet.py          # UNet architecture for diffusion
|   |-- sampling.py      # Sampling utilities (reparameterize, Gumbel, etc.)
|   |-- generative_loss.py # Loss functions for generative models
```

All layer addition methods, forward/backward passes, optimizers, loss functions, training methods, I/O, and RL methods are dynamically bound to the `NeuralNet` class at import time via monkey-patching in `base.py`. This allows each submodule to remain focused while the user interacts with a single unified API.

---

## Core Architecture

### NeuralNet Class Overview

The `NeuralNet` class in `base.py` is the central abstraction. It stores everything needed to define, train, and evaluate a neural network entirely in NumPy.

| Attribute | Type | Description |
|-----------|------|-------------|
| `layers` | `list[dict]` | Layer definitions with weights, biases, and hyperparameters. Each layer is a dictionary containing its type-specific parameters. |
| `learning_rate` | `float` | Global learning rate used by all optimizers. |
| `optimizer_type` | `str` | Optimizer name: `"sgd"`, `"rmsprop"`, `"adagrad"`, `"adam"`. |
| `l2_lambda` | `float` | L2 regularization coefficient applied to weight gradients. |
| `momentum` | `float` | Momentum coefficient for SGD optimizer. |
| `outputs` | `list[ndarray]` | Cached layer outputs during the most recent forward pass. `outputs[0]` is the input, `outputs[i]` is the output of layer `i-1`. |
| `pre_activations` | `list[ndarray]` | Cached pre-activation values (z = Wx + b) for layers that have activations. Used during backprop for computing derivatives. |
| `batchnorm_cache` | `list` | BatchNorm statistics cache storing `(x, x_norm, mean, var, gamma, epsilon, axes)` for each BatchNorm layer during training. |
| `layernorm_cache` | `list` | LayerNorm statistics cache storing `(x, x_norm, mean, var, gamma, epsilon, axes)` for each LayerNorm layer. |
| `deltas` | `list[ndarray]` | Gradient error terms per layer, computed during backpropagation. |
| `opt_state` | `list[dict]` | Optimizer state (momentum, velocity, squared gradients) for each trainable layer. Lazily initialized on first `update()` call. |
| `t` | `int` | Global timestep counter, incremented on every `update()` call. Used for Adam bias correction. |
| `training` | `bool` | Training mode flag. Affects BatchNorm, Dropout, and layer behavior. |

### Internal Data Flow

1. **Layer Definition**: The user calls `add_*` methods which append dictionaries to `self.layers`. Each dictionary stores weights, biases, and type-specific metadata.
2. **Forward Pass**: `Forward(inputs)` iterates through `self.layers`, computes each layer's output, and caches results in `self.outputs`, `self.pre_activations`, `self.batchnorm_cache`, and `self.layernorm_cache`.
3. **Loss Computation**: `ComputeLoss(output, target)` computes the scalar loss value.
4. **Backward Pass**: `Backward(targets)` computes error gradients (`self.deltas`) by propagating from the output layer back to the input, using cached pre-activations and layer-specific backward functions.
5. **Parameter Update**: `update()` computes weight/bias gradients from `self.deltas` and `self.outputs`, applies L2 regularization, and updates parameters using the chosen optimizer.

### Training vs Evaluation Mode

- **Training mode** (`training=True`, set via `.train()`): BatchNorm uses batch statistics and updates running averages. Dropout randomly zeros neurons. All caches are populated.
- **Evaluation mode** (`training=False`, set via `.eval()`): BatchNorm uses running statistics (no cache). Dropout is disabled (identity pass). Caches are not populated.

---

## Layer Types

### Dense Layer

A fully connected (affine) layer: `output = activation(W @ input + b)`.

```python
model.add_dense(n_in, n_out, activation="relu", init_method="xavier_uniform", use_bias=True)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `n_in` | `int` | required | Number of input features. |
| `n_out` | `int` | required | Number of output features (neurons). |
| `activation` | `str` | `"relu"` | Activation function name (see [Activation Functions](#activation-functions)). |
| `init_method` | `str` | `"xavier_uniform"` | Weight initialization strategy (see [Weight Initialization](#weight-initialization)). |
| `use_bias` | `bool` | `True` | Whether to include a bias vector. If `False`, bias is zeros and not updated. |

**Stored in layer dict**: `"type": "dense"`, `"weights"` (shape `(n_out, n_in)`), `"bias"` (shape `(n_out,)`), `"activation"`, `"use_bias"`.

### Sparse Layer

A dense layer with a fixed random connectivity mask. Only a fraction of weights are non-zero and trainable.

```python
model.add_sparse(n_in, n_out, connectivity=0.5, activation="relu", init_method="xavier_uniform")
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `n_in` | `int` | required | Number of input features. |
| `n_out` | `int` | required | Number of output features. |
| `connectivity` | `float` | `0.5` | Fraction of weights to keep (0 to 1). A mask is generated randomly and fixed for the layer's lifetime. |
| `activation` | `str` | `"relu"` | Activation function name. |
| `init_method` | `str` | `"xavier_uniform"` | Weight initialization strategy. |

**Stored in layer dict**: `"type": "sparse"`, `"weights"`, `"bias"`, `"mask"` (binary matrix, same shape as weights), `"activation"`. During forward and backward passes, the mask is applied to zero out masked weights. During updates, gradients are also masked.

### Convolutional Layer (Conv2D)

A 2D convolution with no padding (valid convolution). Uses `im2col` for efficient matrix multiplication.

```python
model.add_conv2d(in_ch, out_ch, k, activation="relu", init_method="he_normal", stride=1)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `in_ch` | `int` | required | Number of input channels. |
| `out_ch` | `int` | required | Number of output channels (filters). |
| `k` | `int` | required | Kernel size (square kernel `k x k`). |
| `activation` | `str` | `"relu"` | Activation function name. |
| `init_method` | `str` | `"he_normal"` | Weight initialization strategy. |
| `stride` | `int` | `1` | Stride (stored but currently only stride=1 is fully supported by `im2col`). |

**Input shape**: `(batch, in_ch, H, W)`  
**Output shape**: `(batch, out_ch, H-k+1, W-k+1)`  
**Stored in layer dict**: `"type": "conv2d"`, `"weights"` (shape `(out_ch, in_ch, k, k)`), `"bias"` (shape `(out_ch,)`), `"in_ch"`, `"out_ch"`, `"k"`, `"activation"`, `"stride"`.

**Note**: There is no padding support. Each conv2d layer reduces spatial dimensions by `k-1` on each side.

### Flatten Layer

Reshapes a multi-dimensional tensor into a 2D matrix `(batch, -1)`.

```python
model.add_flatten()
```

**Stored in layer dict**: `"type": "flatten"`. No parameters. Used to transition from conv layers to dense layers.

### Max Pooling 2D

Downsamples by taking the maximum value in each `p x p` non-overlapping window.

```python
model.add_maxpool2d(pool_size=2)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `pool_size` | `int` | `2` | Size of the pooling window. |

**Input shape**: `(batch, C, H, W)`  
**Output shape**: `(batch, C, H//p, W//p)` (dimensions are truncated to multiples of `p`).  
**Stored in layer dict**: `"type": "maxpool2d"`, `"p"`.

**Backward pass**: Uses a strided view to identify maxima and distributes gradients only to the max positions within each window.

### Average Pooling 2D

Downsamples by taking the mean value in each `p x p` non-overlapping window.

```python
model.add_avgpool2d(pool_size=2)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `pool_size` | `int` | `2` | Size of the pooling window. |

**Input shape**: `(batch, C, H, W)`  
**Output shape**: `(batch, C, H//p, W//p)`.  
**Stored in layer dict**: `"type": "avgpool2d"`, `"p"`.

**Backward pass**: Distributes gradient evenly across all positions in each `p x p` window.

### Global Average Pooling 2D

Computes the mean across spatial dimensions `(H, W)`, reducing `(batch, C, H, W)` to `(batch, C, 1, 1)`.

```python
model.add_global_avgpool2d()
```

**Stored in layer dict**: `"type": "globalavgpool2d"`. No parameters.

**Backward pass**: Distributes the incoming gradient evenly across all spatial positions.

### Upsampling 2D

Nearest-neighbor upsampling by repeating rows and columns.

```python
model.add_upsample2d(scale_factor=2)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `scale_factor` | `int` | `2` | Factor by which to repeat each spatial dimension. |

**Input shape**: `(batch, C, H, W)`  
**Output shape**: `(batch, C, H*scale, W*scale)`.  
**Stored in layer dict**: `"type": "upsample2d"`, `"scale_factor"`.

**Backward pass**: Sums gradients from the repeated positions back to the original positions.

### Batch Normalization

Normalizes activations across the batch dimension. Supports 2D `(batch, features)` and 4D `(batch, C, H, W)` inputs.

```python
model.add_batchnorm(num_features, epsilon=1e-5, momentum=0.1)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `num_features` | `int` | required | Number of features/channels to normalize. |
| `epsilon` | `float` | `1e-5` | Small constant for numerical stability. |
| `momentum` | `float` | `0.1` | Momentum for updating running statistics. `running_stat = (1-momentum)*running_stat + momentum*batch_stat`. |

**Stored in layer dict**: `"type": "batchnorm"`, `"num_features"`, `"epsilon"`, `"momentum"`, `"running_mean"`, `"running_var"`, `"gamma"` (scale, initialized to 1), `"beta"` (shift, initialized to 0).

**Training**: Computes batch mean and variance, normalizes, applies gamma/beta, updates running statistics, and stores a cache for backward.  
**Evaluation**: Uses running mean and variance, no cache stored.  
**Backward**: Computes gradients w.r.t. input, gamma, and beta using the cached statistics.

### Layer Normalization

Normalizes across the feature dimension(s) independently for each sample. Supports 2D and 4D inputs.

```python
model.add_layernorm(normalized_shape, epsilon=1e-5)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `normalized_shape` | `int` or `tuple` | required | Shape of the features to normalize. For 2D: an int (number of features). For 4D: a tuple like `(C, H, W)`. |
| `epsilon` | `float` | `1e-5` | Small constant for numerical stability. |

**Stored in layer dict**: `"type": "layernorm"`, `"normalized_shape"`, `"epsilon"`, `"gamma"`, `"beta"`.

Unlike BatchNorm, LayerNorm has no running statistics. It always computes mean and variance on the fly. The backward pass computes `dx`, `dgamma`, and `dbeta`.

### Dropout

Randomly zeros a fraction of activations during training for regularization.

```python
model.add_dropout(rate=0.5)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `rate` | `float` | `0.5` | Fraction of neurons to drop (set to 0). Must be in `[0, 1)`. |

**Stored in layer dict**: `"type": "dropout"`, `"rate"`, `"mask"` (binary mask created during forward pass in training mode).

**Training**: Each element is kept with probability `1-rate`, and surviving elements are scaled by `1/(1-rate)` (inverted dropout). The mask is stored for backward.  
**Evaluation**: Identity pass, no masking.  
**Backward**: Multiplies incoming gradient by the stored mask and the same scaling factor.

### Embedding Layer

A lookup table that maps integer token indices to dense vectors.

```python
model.add_embedding(vocab_size, embed_dim, init_method="normal")
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `vocab_size` | `int` | required | Number of unique tokens in the vocabulary. |
| `embed_dim` | `int` | required | Dimension of each embedding vector. |
| `init_method` | `str` | `"normal"` | Initialization strategy. |

**Stored in layer dict**: `"type": "embedding"`, `"weights"` (shape `(vocab_size, embed_dim)`), `"vocab_size"`, `"embed_dim"`.

**Input**: Integer array of shape `(batch, seq_len)` or `(batch,)` (1D is reshaped to `(batch, 1)`).  
**Output**: Embedding vectors of shape `(batch, seq_len, embed_dim)`.  
**Backward**: Sparse gradient -- only the rows corresponding to seen indices are updated. The `_last_input` key stores the input indices for gradient computation.

---

## Forward Pass

### Input Handling

The `Forward(self, inputs, training=False, dropout_rate=0.0)` method handles input normalization:

- **1D input** `(features,)` -> reshaped to `(1, features)` (single sample batch).
- **3D input** `(C, H, W)` -> reshaped to `(1, C, H, W)` (single image batch).
- **2D input** `(batch, features)` and **4D input** `(batch, C, H, W)` are used as-is.

All inputs are cast to `np.float64` for numerical stability.

### Layer-by-Layer Computation

The forward pass iterates through `self.layers` in order. For each layer:

1. **Dense/Sparse**: `z = x @ W.T + b`, then `x = activation(z)`. Pre-activation `z` is cached.
2. **Conv2D**: Uses `im2col` to unfold the input into columns, performs matrix multiplication with flattened kernels, reshapes back, adds bias, then applies activation.
3. **Flatten**: Reshapes to `(batch, -1)`.
4. **MaxPool2D**: Strided view into `p x p` blocks, takes max along the block axes.
5. **AvgPool2D**: Same strided view, takes mean.
6. **GlobalAvgPool2D**: Mean over axes `(2, 3)` with `keepdims=True`.
7. **Upsample2D**: `x.repeat(scale, axis=2).repeat(scale, axis=3)`.
8. **BatchNorm**: Normalizes using batch stats (training) or running stats (eval), then applies `gamma * x_norm + beta`.
9. **LayerNorm**: Normalizes per-sample using feature stats, then applies `gamma * x_norm + beta`.
10. **Dropout**: Random mask with inverted scaling during training; identity during eval.
11. **Embedding**: Integer index lookup into the weight matrix.

After each layer, the output is appended to `self.outputs`, pre-activations to `self.pre_activations`, and normalization caches to `self.batchnorm_cache` / `self.layernorm_cache`.

### im2col for Convolutions

The `im2col(input_data, filter_h, filter_w, stride=1, pad=0)` function converts image batches into column matrices suitable for efficient matrix multiplication:

1. Pads the input with zeros if `pad > 0`.
2. Uses NumPy's `as_strided` to create a view where each receptive field is a row.
3. Transposes and reshapes to `(N * out_h * out_w, C * filter_h * filter_w)`.

This allows convolutions to be computed as a single large matrix multiplication: `output = col @ W_flat.T`, which is significantly faster than nested loops in pure NumPy.

---

## Backward Pass

### Automatic Delta Computation

`Backward(self, targets=None, output_delta=None)` supports two modes:

**Mode 1: Automatic (targets provided)**

```python
model.Backward(ys)
```

- If the last layer uses `"softmax"` activation, the delta is computed as `(out - targets) / batch_size` (the combined softmax + cross-entropy gradient simplification).
- Otherwise, delta = `(out - targets) * derivative(activation, pre_activation) / batch_size`.

**Mode 2: Manual (output_delta provided)**

```python
model.Backward(None, output_delta=custom_delta)
```

- Used in reinforcement learning and generative models where the output gradient is computed externally.
- `output_delta` is reshaped to `(batch, features)` if 1D.

### Per-Layer Gradient Propagation

After computing the output delta, the backward pass iterates from the second-to-last layer back to the first:

For each layer `l` (current) and `l+1` (next), it computes the error `err` flowing into layer `l` based on the next layer's type:

| Next Layer Type | Error Computation |
|-----------------|-------------------|
| `dense` / `sparse` | `err = next_delta @ W_next` |
| `flatten` | `err = next_delta.reshape(outputs[l+1].shape)` |
| `conv2d` | `err = conv2d_backward_input(next_delta, W_next, outputs[l+1].shape)` -- transposed convolution via im2col |
| `maxpool2d` | `err = maxpool2d_backward(next_delta, outputs[l+1], pool_size)` -- routes gradient to max positions |
| `avgpool2d` | `err = avgpool2d_backward(next_delta, outputs[l+1], pool_size)` -- distributes gradient evenly |
| `globalavgpool2d` | `err = globalavgpool2d_backward(next_delta, outputs[l+1])` -- distributes over spatial dims |
| `upsample2d` | `err = upsample2d_backward(next_delta, outputs[l+1], scale)` -- sums repeated positions |
| `dropout` | `err = next_delta * mask / (1 - rate)` (or identity if mask is None) |
| `batchnorm` | `err, dgamma, dbeta = batchnorm_backward(next_delta, cache)` -- stores `d_gamma`, `d_beta` on the layer dict |
| `layernorm` | `err, dgamma, dbeta = layernorm_backward(next_delta, cache)` -- stores `d_gamma`, `d_beta` on the layer dict |
| `embedding` | `err = next_delta` (gradient flows back to previous layer) |

Then, if the current layer has an activation (dense, sparse, conv2d), the error is multiplied by the activation derivative evaluated at the pre-activation: `self.deltas[l] = err * derivative(activation, pre_activation)`.

**Important**: For BatchNorm and LayerNorm, the backward pass **requires** that `Forward(training=True)` was called first to populate the caches. If a cache is `None`, a `ValueError` is raised.

---

## Optimizers

All optimizers are implemented in `optimizer.py` and applied in the `update()` method. The optimizer is selected via the `optimizer` parameter in `NeuralNet.__init__()`.

### SGD with Momentum

```python
model = NeuralNet(optimizer="sgd", learning_rate=0.01, momentum=0.9)
```

Update rule for weights (same structure for biases, gamma, beta):

```
velocity = momentum * velocity - learning_rate * gradient
weight += velocity
```

Momentum accumulates velocity in the direction of persistent gradients, helping escape shallow local minima and accelerating convergence in consistent gradient directions.

### RMSprop

```python
model = NeuralNet(optimizer="rmsprop", learning_rate=0.001)
```

Update rule:

```
v = 0.999 * v + 0.001 * gradient^2
weight -= learning_rate * gradient / (sqrt(v) + 1e-8)
```

RMSprop adapts the learning rate per parameter by dividing by a running average of squared gradients. This helps with non-stationary objectives and sparse gradients.

### Adagrad

```python
model = NeuralNet(optimizer="adagrad", learning_rate=0.01)
```

Update rule:

```
v += gradient^2
weight -= learning_rate * gradient / (sqrt(v) + 1e-8)
```

Adagrad accumulates all historical squared gradients. It performs larger updates for infrequent parameters and smaller updates for frequent ones. Note that the learning rate naturally decays over time.

### Adam with Bias Correction

```python
model = NeuralNet(optimizer="adam", learning_rate=0.001)
```

Adam combines momentum (first moment) and RMSprop (second moment) with bias correction:

```
m = 0.9 * m + 0.1 * gradient        # first moment
v = 0.999 * v + 0.001 * gradient^2  # second moment
m_hat = m / (1 - 0.9^t)              # bias-corrected first moment
v_hat = v / (1 - 0.999^t)            # bias-corrected second moment
weight -= learning_rate * m_hat / (sqrt(v_hat) + 1e-8)
```

Where `t` is the global timestep incremented on each `update()` call. Bias correction prevents the initial estimates from being biased toward zero.

### L2 Regularization

L2 regularization (weight decay) is applied to all weight gradients before the optimizer step:

```
grad_w = grad_w + l2_lambda * weights * mask
```

The `mask` term ensures that sparse layers only regularize their active connections. L2 regularization penalizes large weights, encouraging simpler models and reducing overfitting.

**Optimizer state initialization**: On the first call to `update()`, `self.opt_state` is lazily initialized with zero-initialized momentum/velocity buffers matching the shape of each layer's trainable parameters.

---

## Loss Functions

All loss functions are implemented in `loss.py` via `ComputeLoss(self, output, target, function="mse", reduction="mean", **kwargs)`.

### Regression Losses

#### MSE (Mean Squared Error)

```python
model.ComputeLoss(output, target, function="mse", reduction="mean")
```

`loss = (output - target)^2`

Standard regression loss. Penalizes large errors quadratically.

#### MAE (Mean Absolute Error)

```python
model.ComputeLoss(output, target, function="mae", reduction="mean")
```

`loss = |output - target|`

More robust to outliers than MSE since errors are not squared.

#### Huber Loss

```python
model.ComputeLoss(output, target, function="huber", delta=1.0, reduction="mean")
```

```
loss = 0.5 * diff^2              if diff < delta
loss = delta * (diff - 0.5*delta)  otherwise
```

Combines MSE for small errors and MAE for large errors. `delta` controls the transition point.

#### Smooth L1 Loss

```python
model.ComputeLoss(output, target, function="smooth_l1", reduction="mean")
```

Huber loss with `delta=1.0` hardcoded. Commonly used in object detection.

### Classification Losses

#### Binary Cross-Entropy

```python
model.ComputeLoss(output, target, function="binary_cross_entropy", reduction="mean")
```

`loss = -(target * log(output) + (1 - target) * log(1 - output))`

For binary classification with sigmoid output. Output is clipped to `[1e-12, 1-1e-12]` for numerical stability.

#### Cross-Entropy / Categorical Cross-Entropy

```python
model.ComputeLoss(output, target, function="cross_entropy", reduction="mean")
```

`loss = -target * log(output)`

For multi-class classification with softmax output. Output clipped to `[1e-12, 1.0]`. Supports `"mean"`, `"sum"`, and `"none"` (per-element) reduction.

**Note**: When `reduction="mean"`, the loss is divided by the batch size. When `"sum"`, it is summed. When `"none"`, the raw per-element loss array is returned.

#### Focal Loss

```python
model.ComputeLoss(output, target, function="focal", alpha=0.25, gamma=2.0, reduction="mean")
```

Down-weights easy examples and focuses on hard examples:

```
pt = output * target + (1 - output) * (1 - target)
loss = -(alpha * target * (1-pt)^gamma * log(output) + (1-alpha) * (1-target) * pt^gamma * log(1-output))
```

Useful for imbalanced datasets. `alpha` balances positive/negative examples, `gamma` focuses on hard examples.

#### Hinge Loss

```python
model.ComputeLoss(output, target, function="hinge", reduction="mean")
```

`loss = max(0, 1 - target * output)`

For SVM-style classification. Target should be `+1` or `-1`.

#### BCE with Logits (Numerically Stable)

```python
model.ComputeLoss(output, target, function="bce_logits", reduction="mean")
```

`loss = max(output, 0) - output * target + log(1 + exp(-|output|))`

Computes binary cross-entropy directly from logits (pre-sigmoid values) without explicitly computing sigmoid, avoiding numerical issues for extreme values.

### Advanced Losses

#### Wasserstein Loss

```python
model.ComputeLoss(output, target, function="wasserstein", reduction="mean")
```

`loss = -output * target`

Used in Wasserstein GANs. Target is `+1` for real and `-1` for fake.

#### Cosine Similarity Loss

```python
model.ComputeLoss(output, target, function="cosine_similarity", reduction="mean")
```

`loss = 1 - cos(output, target)`

Measures the cosine distance between vectors. Useful for embedding learning and contrastive tasks.

#### Triplet Loss

```python
model.ComputeLoss(output, target, function="triplet", margin=1.0, negative=neg_samples, reduction="mean")
```

```
d_pos = ||anchor - positive||^2
d_neg = ||anchor - negative||^2
loss = max(0, d_pos - d_neg + margin)
```

`output` is the anchor, `target` is the positive, and `negative` kwarg provides the negative samples. Used in metric learning (e.g., FaceNet).

#### NT-Xent (Normalized Temperature-scaled Cross Entropy)

```python
model.ComputeLoss(output, target, function="ntxent", temperature=0.5, reduction="mean")
```

SimCLR contrastive loss. Computes pairwise cosine similarities, applies temperature scaling, and uses a cross-entropy formulation where positive pairs are on the diagonal.

### Generative Losses

#### KL Divergence (for VAE)

```python
model.ComputeLoss(output, target, function="kl_divergence", mu=mu, logvar=logvar, reduction="mean")
```

`loss = -0.5 * sum(1 + logvar - mu^2 - exp(logvar), axis=-1)`

KL divergence between the approximate posterior `q(z|x)` and the prior `N(0, I)`. Used as a regularization term in VAEs.

---

## Activation Functions

All activations and their derivatives are implemented in `activations.py`.

### ReLU Family

| Name | Forward | Derivative | Notes |
|------|---------|------------|-------|
| `relu` | `max(0, x)` | `1 if x > 0 else 0` | Most common default. |
| `leakyrelu` | `x if x > 0 else 0.01*x` | `1 if x > 0 else 0.01` | Small negative slope prevents dying ReLU. |
| `elu` | `x if x > 0 else exp(x)-1` | `1 if x > 0 else exp(x)` | Smooth negative region, mean closer to zero. |
| `selu` | `scale * (x if x>0 else alpha*(exp(x)-1))` | `scale * (1 if x>0 else alpha*exp(x))` | Self-normalizing; `alpha=1.6733`, `scale=1.0507`. |

### Sigmoid Family

| Name | Forward | Derivative | Notes |
|------|---------|------------|-------|
| `sigmoid` | `1 / (1 + exp(-x))` | `sigmoid(x) * (1 - sigmoid(x))` | Clipped to `[-500, 500]` to prevent overflow. |
| `tanh` | `tanh(x)` | `1 - tanh(x)^2` | Zero-centered output. |
| `softmax` | `exp(x - max(x)) / sum(exp(x - max(x)))` | *Handled specially in backward* | Numerically stable via max subtraction. |
| `softplus` | `log(1 + exp(x))` | `sigmoid(x)` | Smooth approximation of ReLU. |

### Advanced Activations

| Name | Forward | Derivative | Notes |
|------|---------|------------|-------|
| `gelu` | `0.5*x*(1 + tanh(sqrt(2/pi)*(x + 0.044715*x^3)))` | `CDF + x*PDF` | Used in Transformer architectures. |
| `swish` | `x * sigmoid(x)` | `sigmoid(x) + x*sigmoid(x)*(1-sigmoid(x))` | Self-gated, smooth. |
| `mish` | `x * tanh(log(1 + exp(x)))` | `tanh(sp) + x*sigmoid(x)*(1-tanh(sp)^2)` | `sp = softplus(x)`. Smooth and self-regularizing. |

### Linear / Identity

| Name | Forward | Derivative | Notes |
|------|---------|------------|-------|
| `linear` | `x` | `1` | No transformation. Used for output layers before softmax/sigmoid. |

---

## Weight Initialization

All initializers are in `weight_init.py` and automatically called by layer addition methods.

### Dense & Sparse Layer Initializers

For a layer with `n_in` inputs and `n_out` outputs:

| Method | Formula | Best For |
|--------|---------|----------|
| `xavier_uniform` | `U(-sqrt(6/(n_in+n_out)), sqrt(6/(n_in+n_out)))` | Tanh/sigmoid activations |
| `xavier_normal` | `N(0, sqrt(2/(n_in+n_out)))` | Tanh/sigmoid activations |
| `he_uniform` | `U(-sqrt(6/n_in), sqrt(6/n_in))` | ReLU activations |
| `he_normal` | `N(0, sqrt(2/n_in))` | ReLU activations (default for conv) |
| `normal` | `N(0, 0.1)` | General purpose, small initial values |
| `orthogonal` | SVD-based orthogonal matrix | RNNs, preserving gradient norms |
| `zeros` | All zeros | Biases, or when you want to start from zero |
| `ones` | All ones | Special cases |

### Convolutional Layer Initializers

Same methods as dense, but fan-in is computed as `in_ch * k * k` (number of input connections per filter):

| Method | Formula |
|--------|---------|
| `xavier_uniform` | `U(-sqrt(6/(in_ch*k*k + out_ch)), ...)` |
| `xavier_normal` | `N(0, sqrt(2/(in_ch*k*k + out_ch)))` |
| `he_uniform` | `U(-sqrt(6/(in_ch*k*k)), ...)` |
| `he_normal` | `N(0, sqrt(2/(in_ch*k*k)))` (default) |
| `normal` | `N(0, 0.1)` |
| `orthogonal` | SVD on `(out_ch, in_ch*k*k)`, reshaped to `(out_ch, in_ch, k, k)` |
| `zeros` | All zeros |
| `ones` | All ones |

### Embedding Layer Initializers

| Method | Formula |
|--------|---------|
| `normal` | `N(0, 0.1)` (default) |
| `xavier_uniform` | `U(-sqrt(6/(vocab_size + embed_dim)), ...)` |
| `xavier_normal` | `N(0, sqrt(2/(vocab_size + embed_dim)))` |
| `zeros` | All zeros |

**Note**: All weights and biases are stored as `np.float64` for maximum numerical precision.

---

## Training Utilities

### Train Method

The `Train` method in `train.py` provides a complete training loop with validation support, metric tracking, and learning rate scheduling.

```python
history = model.Train(
    X_train, Y_train,
    epochs=10, batch_size=32,
    X_val=None, Y_val=None,
    loss_function=None,
    verbose=True,
    scheduler=None,
    **loss_kwargs
)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `X_train` | `ndarray` | required | Training inputs. |
| `Y_train` | `ndarray` | required | Training targets. |
| `epochs` | `int` | `10` | Number of training epochs. |
| `batch_size` | `int` | `32` | Batch size for mini-batch gradient descent. |
| `X_val` | `ndarray` or `None` | `None` | Validation inputs. If provided, validation metrics are computed each epoch. |
| `Y_val` | `ndarray` or `None` | `None` | Validation targets. |
| `loss_function` | `str` or `None` | `None` | Loss function name. If `None`, auto-detects based on last layer activation (`"cross_entropy"` for softmax, `"mse"` otherwise). |
| `verbose` | `bool` | `True` | Whether to print progress. |
| `scheduler` | `LRScheduler` or `None` | `None` | Learning rate scheduler instance. |
| `**loss_kwargs` | | | Additional arguments passed to `ComputeLoss` (e.g., `delta` for Huber loss). |

**Returns**: `history` dict with keys `"loss"`, `"accuracy"`, `"val_loss"`, `"val_accuracy"`, `"lr"`.

**Training loop details**:
1. If a scheduler is provided, `scheduler.step(epoch)` is called at the start of each epoch to update the learning rate.
2. Training data is shuffled each epoch.
3. Batches are processed sequentially. For each batch:
   - `TrainBatch` is called (forward, loss, backward, update).
   - Loss and accuracy are weighted by actual batch size (handles last incomplete batch).
4. Epoch averages are computed and stored in history.
5. If validation data is provided, the model runs in eval mode for validation (though `Forward` is called with `training=False`, the caches are not used for updates).

### TrainBatch Method

```python
loss, out = model.TrainBatch(xs, ys, loss_function=None, **loss_kwargs)
```

A single training step that:
1. Calls `Forward(xs, training=True)`
2. Auto-detects loss function if not provided
3. Computes loss via `ComputeLoss`
4. Calls `Backward(ys)`
5. Calls `update()` to apply gradients

Returns the scalar loss and the network output.

### Learning Rate Schedulers

The `LRScheduler` class in `train.py` supports multiple decay strategies:

```python
from Enilnets import LRScheduler

# Step decay: halve LR every 10 epochs
scheduler = LRScheduler(initial_lr=0.001, mode="step", drop=0.5, epochs_drop=10)

# Exponential decay: multiply by 0.95 each epoch
scheduler = LRScheduler(initial_lr=0.001, mode="exponential", decay=0.95)

# Cosine annealing
scheduler = LRScheduler(initial_lr=0.001, mode="cosine", max_epochs=100)

# Warmup + cosine
scheduler = LRScheduler(initial_lr=0.001, mode="warmup_cosine", max_epochs=100, warmup_epochs=5)
```

| Mode | Formula | Parameters |
|------|---------|------------|
| `"step"` | `lr * drop^(epoch // epochs_drop)` | `drop=0.5`, `epochs_drop=10` |
| `"exponential"` | `lr * decay^epoch` | `decay=0.95` |
| `"cosine"` | `lr * 0.5 * (1 + cos(pi * epoch / max_epochs))` | `max_epochs=100` |
| `"warmup_cosine"` | Linear warmup then cosine | `max_epochs=100`, `warmup_epochs=5` |
| `"plateau"` | Returns initial_lr (placeholder) | None |

The scheduler's `step(epoch)` method returns the learning rate for that epoch. The `Train` method calls `self.set_lr(lr)` before each epoch.

### Metrics Computation

#### Accuracy

```python
acc = model.compute_accuracy(predictions, targets)
```

- **Multi-class** (`predictions.shape[-1] > 1`): Compares `argmax` of predictions vs targets.
- **Binary** (`predictions.shape[-1] == 1`): Thresholds at 0.5.

Returns mean accuracy as a float.

#### Precision, Recall, F1

```python
metrics = model.compute_precision_recall_f1(predictions, targets)
# Returns: {"precision": float, "recall": float, "f1": float}
```

Binary classification metrics. Uses the same multi-class/binary detection as accuracy. Computed with `1e-12` epsilon to prevent division by zero.

---

## Model Utilities

### Training / Evaluation Mode

```python
model.train()   # Set training=True, returns self (for chaining)
model.eval()    # Set training=False, returns self
```

These affect BatchNorm (batch vs running stats) and Dropout (active vs identity).

### Learning Rate Control

```python
model.set_lr(0.0001)   # Set learning rate
lr = model.get_lr()    # Get current learning rate
```

### Gradient Clipping

```python
model.clip_gradients(max_norm=1.0)
```

Clips the L2 norm of all deltas across all layers:

1. Computes `total_norm = sqrt(sum(||delta||^2))` over all non-None deltas.
2. If `total_norm > max_norm`, scales all deltas by `max_norm / total_norm`.

Call this after `Backward()` and before `update()` to prevent exploding gradients.

### Layer Freezing / Unfreezing

```python
model.freeze()           # Freeze all layers
model.freeze(2)          # Freeze layer index 2 only
model.unfreeze()         # Unfreeze all layers
model.unfreeze(2)        # Unfreeze layer index 2 only
```

Frozen layers are skipped during `update()`. The `_frozen` flag is checked in the optimizer loop. This is useful for transfer learning and fine-tuning.

### Weight Get / Set

```python
weights = model.get_weights()   # Returns list of dicts, one per layer
model.set_weights(weights)      # Restores weights from list of dicts
```

`get_weights()` copies `"weights"`, `"bias"`, `"gamma"`, `"beta"`, and `"mask"` for each layer. `set_weights()` restores them. Useful for checkpointing, model averaging, and transfer.

### Model Copying

```python
net_copy = model.copy()
```

Creates a deep copy of the network including layers and optimizer state. The new network has the same architecture, weights, and optimizer buffers but is an independent object.

### Optimizer State Reset

```python
model.reset_optimizer_state()
```

Clears all optimizer momentum/velocity buffers and resets the timestep `t` to 0. Useful when starting training on a new dataset or after significant hyperparameter changes.

### NaN / Inf Detection

```python
issues = model.check_nan_inf()
# Returns list of strings like ["Layer 2 weights has NaN/Inf", "Delta 5 has NaN/Inf"]
```

Checks all weights, biases, gamma, beta, and deltas for non-finite values. Returns an empty list if everything is clean. Call this periodically during training to catch numerical instability early.

### Model Summary

```python
model.summary()
```

Prints a formatted table showing:
- Optimizer, learning rate, L2 lambda
- Per-layer information: type, input/output shapes, parameter counts
- Total parameter count

Example output:
```
Model Summary
======================================================================
Optimizer: ADAM | LR: 0.001 | L2: 0.01
======================================================================
Layer 0: DENSE        Input:    784 Output:    256 Params: 200960
Layer 1: DROPOUT
Layer 2: DENSE        Input:    256 Output:     10 Params: 2570
Total Parameters: 203530
======================================================================
```

---

## Reinforcement Learning

All RL methods are in `reinforce.py` and bound to `NeuralNet`.

### Evolutionary Strategy (Evolve)

```python
best_score = model.Evolve(inputs, score_fn, noise=0.05, tries=10, sigma=1.0)
```

A black-box optimization method that perturbs network weights with Gaussian noise and keeps the best variant.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `inputs` | `ndarray` | required | Input data to evaluate the network on. |
| `score_fn` | `callable` | required | Function that takes network output and returns a scalar score (higher is better). |
| `noise` | `float` | `0.05` | Standard deviation scale for weight perturbations. |
| `tries` | `int` | `10` | Number of candidate networks to try. |
| `sigma` | `float` | `1.0` | Additional scaling factor for noise. |

**Algorithm**:
1. Evaluate the current network on `inputs` to get a baseline score.
2. For each try, create a deep copy of the network, add Gaussian noise to all weights and biases (respecting sparse masks), evaluate the candidate.
3. If the candidate scores higher, keep it as the new best.
4. Restore the best network.

Returns the best score achieved.

### REINFORCE (Policy Gradient)

```python
mean_return = model.Reinforce(
    states, actions, returns,
    action_type="discrete", std=1.0, normalize_returns=True
)
```

Monte-Carlo policy gradient method.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `states` | `ndarray` | required | Observed states, shape `(N, features)`. |
| `actions` | `ndarray` | required | Discrete: `(N,)` integer indices. Continuous: `(N, action_dim)`. |
| `returns` | `ndarray` | required | Discounted returns for each state-action pair, shape `(N,)` or `(N, 1)`. |
| `action_type` | `str` | `"discrete"` | `"discrete"` (categorical) or `"continuous"` (Gaussian). |
| `std` | `float` | `1.0` | Standard deviation for continuous Gaussian policy. |
| `normalize_returns` | `bool` | `True` | Whether to z-score normalize returns before computing gradients. |

**Discrete actions**: Network output is treated as action probabilities. The gradient is `(out - one_hot(actions)) * returns / batch_size`.  
**Continuous actions**: Network output is treated as action means. The gradient is `-(actions - means) / std^2 * returns / batch_size`.

Returns the mean of the raw (un-normalized) returns.

### Proximal Policy Optimization (PPO)

```python
policy_loss = model.PPO(
    states, actions, old_log_probs, advantages,
    action_type="discrete", epsilon=0.2, std=1.0,
    value_targets=None, value_coeff=0.5, entropy_coeff=0.01
)
```

PPO is a policy gradient method that clips the objective to prevent overly large policy updates.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `states` | `ndarray` | required | Observed states, shape `(N, features)`. |
| `actions` | `ndarray` | required | Discrete: `(N,)` integers. Continuous: `(N, action_dim)`. |
| `old_log_probs` | `ndarray` | required | Log probabilities under the old policy, shape `(N, 1)`. |
| `advantages` | `ndarray` | required | Advantage estimates, shape `(N, 1)`. |
| `action_type` | `str` | `"discrete"` | `"discrete"` or `"continuous"`. |
| `epsilon` | `float` | `0.2` | Clipping parameter. |
| `std` | `float` | `1.0` | Fixed std for continuous Gaussian policy. |
| `value_targets` | `ndarray` or `None` | `None` | Target values for value head (not yet implemented in gradient). |
| `value_coeff` | `float` | `0.5` | Coefficient for value loss (reserved). |
| `entropy_coeff` | `float` | `0.01` | Coefficient for entropy bonus (encourages exploration). |

**Discrete PPO**:
1. Computes action probabilities from network output.
2. Computes log probabilities for the taken actions.
3. Computes probability ratio `ratio = exp(new_log_prob - old_log_prob)`.
4. Computes clipped surrogate objective: `min(ratio * advantage, clip(ratio, 1-eps, 1+eps) * advantage)`.
5. Approximates policy gradient: for each sample, if not clipped, gradient flows through the taken action proportional to `-advantage / prob`.
6. Adds entropy gradient: `(1 + log_probs) * entropy_coeff / batch_size`.

**Continuous PPO**: Uses Gaussian log probabilities and computes gradient as `-(actions - means) / std^2 * advantages / batch_size`.

Returns the mean policy loss (negative of the clipped objective).

### Actor-Critic

```python
value_loss = model.ActorCritic(
    states, actions, returns, values,
    action_type="discrete", std=1.0
)
```

Combines policy gradient with a value function baseline.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `states` | `ndarray` | required | Observed states. |
| `actions` | `ndarray` | required | Taken actions. |
| `returns` | `ndarray` | required | Discounted returns, shape `(N, 1)`. |
| `values` | `ndarray` | required | Predicted values from value network, shape `(N, 1)`. |
| `action_type` | `str` | `"discrete"` | `"discrete"` or `"continuous"`. |
| `std` | `float` | `1.0` | Std for continuous actions. |

**Algorithm**:
1. Computes advantages: `advantages = returns - values`.
2. Uses the same policy gradient as REINFORCE but weighted by advantages instead of raw returns.
3. Returns the mean squared advantage (a proxy for value function error).

**Note**: This implementation uses a single network. In practice, you may want separate actor and critic networks or a network with two output heads.

### RL Utility Functions

#### compute_returns

```python
from Enilnets import compute_returns

returns = compute_returns(rewards, gamma=0.99)
```

Computes discounted returns for a single episode:

```
G_t = reward_t + gamma * G_{t+1}
```

Iterates backwards through the reward array. Returns an array of the same shape.

#### gae (Generalized Advantage Estimation)

```python
from Enilnets.generative.sampling import gae

advantages, returns = gae(rewards, values, gamma=0.99, lambda_=0.95)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `rewards` | `ndarray` | required | Step rewards, shape `(T,)`. |
| `values` | `ndarray` | required | Value estimates including bootstrap, shape `(T+1,)`. |
| `gamma` | `float` | `0.99` | Discount factor. |
| `lambda_` | `float` | `0.95` | GAE lambda (0 = high bias, 1 = high variance). |

Computes TD-residuals `delta_t = reward_t + gamma * V(s_{t+1}) - V(s_t)`, then accumulates them with exponential decay: `A_t = delta_t + gamma * lambda * A_{t+1}`. Returns `(advantages, returns)` where `returns = advantages + values[:T]`.

---

## Generative AI Framework

All generative models are in the `generative/` subpackage and imported from the top-level `Enilnets` package.

### Variational Autoencoder (VAE)

File: `generative/vae.py`

A VAE learns a probabilistic latent representation of data. It consists of an encoder (maps data to latent distribution parameters) and a decoder (maps latent samples back to data).

```python
from Enilnets import VAE

vae = VAE(
    input_dim=784, latent_dim=32,
    encoder_hidden=[512, 256],
    decoder_hidden=[256, 512],
    activation="swish",
    learning_rate=0.001, optimizer="adam", l2_lambda=0.0
)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `input_dim` | `int` | required | Dimensionality of input data (flattened). |
| `latent_dim` | `int` | required | Dimensionality of the latent space. |
| `encoder_hidden` | `list[int]` | `[512, 256]` | Hidden layer sizes for the encoder. |
| `decoder_hidden` | `list[int]` | `[256, 512]` | Hidden layer sizes for the decoder. |
| `activation` | `str` | `"swish"` | Activation for hidden layers. |
| `learning_rate` | `float` | `0.001` | Learning rate for both encoder and decoder. |
| `optimizer` | `str` | `"adam"` | Optimizer type. |
| `l2_lambda` | `float` | `0.0` | L2 regularization. |

**Architecture**:
- **Encoder**: Dense layers with specified activation, final layer outputs `latent_dim * 2` values (mu and logvar) with linear activation.
- **Decoder**: Dense layers with specified activation, final layer outputs `input_dim` values with sigmoid activation (assumes data in [0, 1]).

**Methods**:

| Method | Signature | Description |
|--------|-----------|-------------|
| `encode` | `encode(x)` -> `(mu, logvar)` | Maps input to latent distribution parameters. |
| `decode` | `decode(z)` -> `recon` | Maps latent samples to reconstructed data. |
| `forward` | `forward(x)` -> `(recon, mu, logvar, z)` | Full forward pass through encoder + reparameterization + decoder. |
| `loss` | `loss(x, recon=None, mu=None, logvar=None)` -> `float` | Computes reconstruction loss (binary cross-entropy) + KL divergence. |
| `train_step` | `train_step(x)` -> `float` | One training step: forward, backward through decoder, backward through encoder, update both networks. |
| `Train` | `Train(X_train, epochs=10, batch_size=64, verbose=True)` -> `list[float]` | Full training loop. Returns list of average losses per epoch. |
| `generate` | `generate(n_samples=1)` -> `ndarray` | Samples from N(0, I) in latent space and decodes. |
| `reconstruct` | `reconstruct(x)` -> `ndarray` | Encodes and decodes input (reconstruction). |
| `interpolate` | `interpolate(x1, x2, n_steps=10)` -> `ndarray` | Linear interpolation in latent space between two inputs. |

**Training details**:
1. Forward: encode -> reparameterize -> decode.
2. Decoder backward: computes `d_recon = (recon - x) / batch_size`, multiplies by sigmoid derivative `recon * (1 - recon)`, backpropagates through decoder, updates decoder weights.
3. Encoder backward: computes gradient of latent samples w.r.t. decoder input, combines with reparameterization gradients to get `d_mu` and `d_logvar`, backpropagates through encoder, updates encoder weights.
4. Loss = binary cross-entropy reconstruction + KL(q(z|x) || N(0, I)).

### Generative Adversarial Network (GAN)

File: `generative/gan.py`

A GAN trains a generator to produce realistic data and a discriminator to distinguish real from fake.

```python
from Enilnets import GAN

gan = GAN(
    latent_dim=100, data_dim=784,
    generator_hidden=[256, 512],
    discriminator_hidden=[512, 256],
    g_activation="swish", d_activation="leakyrelu",
    loss_type="bce",
    learning_rate=0.0002, optimizer="adam", l2_lambda=0.0
)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `latent_dim` | `int` | required | Dimensionality of the noise vector. |
| `data_dim` | `int` | required | Dimensionality of generated data. |
| `generator_hidden` | `list[int]` | `[256, 512]` | Hidden layer sizes for generator. |
| `discriminator_hidden` | `list[int]` | `[512, 256]` | Hidden layer sizes for discriminator. |
| `g_activation` | `str` | `"swish"` | Generator hidden activation. |
| `d_activation` | `str` | `"leakyrelu"` | Discriminator hidden activation. |
| `loss_type` | `str` | `"bce"` | `"bce"`, `"bce_logits"`, or `"wasserstein"`. |
| `learning_rate` | `float` | `0.0002` | LR for both networks. |
| `optimizer` | `str` | `"adam"` | Optimizer type. |
| `l2_lambda` | `float` | `0.0` | L2 regularization. |

**Architecture**:
- **Generator**: Dense layers with `g_activation`, final layer with `tanh` activation.
- **Discriminator**: Dense layers with `d_activation`, final layer with `sigmoid` (for BCE) or `linear` (for Wasserstein).

**Methods**:

| Method | Signature | Description |
|--------|-----------|-------------|
| `generate` | `generate(n_samples)` -> `ndarray` | Samples noise and runs generator forward. |
| `discriminate` | `discriminate(x)` -> `ndarray` | Runs discriminator forward. |
| `Train` | `Train(X_train, epochs=10, batch_size=64, d_steps=1, g_steps=1, verbose=True)` -> `dict` | Alternates discriminator and generator training. |
| `sample` | `sample(n_samples=16)` -> `ndarray` | Alias for `generate`. |

**Loss types**:

| Type | Discriminator Target | Generator Gradient |
|------|---------------------|-------------------|
| `"bce"` | Real=1, Fake=0 | `-1 / D(fake)` |
| `"bce_logits"` | Real=1, Fake=0 | Logits-based stable gradient |
| `"wasserstein"` | Real=1, Fake=-1 | `-1` (constant) |

**Training loop**:
1. For each batch, train discriminator for `d_steps` iterations on real + fake data.
2. Train generator for `g_steps` iterations by backpropagating through the discriminator to get gradients w.r.t. generator input.
3. Track and report D_loss and G_loss per epoch.

### Diffusion Model (DDPM)

File: `generative/diffusion.py`

Implements Denoising Diffusion Probabilistic Models (DDPM). The model learns to reverse a gradual noising process.

```python
from Enilnets import DiffusionModel

diffusion = DiffusionModel(
    data_shape=(784,), time_steps=1000,
    beta_schedule="linear", beta_start=1e-4, beta_end=0.02,
    denoiser_type="mlp", denoiser_hidden=[512, 512, 512],
    learning_rate=0.001, optimizer="adam", l2_lambda=0.0
)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `data_shape` | `tuple` | required | Shape of data. `(D,)` for flattened, `(C, H, W)` for images. |
| `time_steps` | `int` | `1000` | Number of diffusion timesteps. |
| `beta_schedule` | `str` | `"linear"` | `"linear"` or `"cosine"` noise schedule. |
| `beta_start` | `float` | `1e-4` | Starting beta value (linear schedule). |
| `beta_end` | `float` | `0.02` | Ending beta value (linear schedule). |
| `denoiser_type` | `str` | `"mlp"` | `"mlp"` or `"conv"` denoiser architecture. |
| `denoiser_hidden` | `list[int]` | `[512, 512, 512]` | Hidden sizes for MLP denoiser. |
| `learning_rate` | `float` | `0.001` | Learning rate. |
| `optimizer` | `str` | `"adam"` | Optimizer type. |
| `l2_lambda` | `float` | `0.0` | L2 regularization. |

**Noise schedules**:

- **Linear**: `betas = linspace(beta_start, beta_end, time_steps)`
- **Cosine**: Uses a cosine-squared schedule with offset `s=0.008` for smoother noise addition.

**Precomputed constants** (computed in `__init__`):
- `alphas = 1 - betas`
- `alphas_cumprod = cumprod(alphas)` -- cumulative product of alphas
- `sqrt_alphas_cumprod`, `sqrt_one_minus_alphas_cumprod` -- for forward diffusion
- `sqrt_recip_alphas`, `posterior_variance` -- for reverse diffusion

**Methods**:

| Method | Signature | Description |
|--------|-----------|-------------|
| `train_step` | `train_step(x_0)` -> `float` | One training step: sample timestep t, add noise, predict noise, compute MSE loss, backpropagate. |
| `Train` | `Train(X_train, epochs=10, batch_size=64, verbose=True)` -> `list[float]` | Full training loop. |
| `sample` | `sample(n_samples=16, shape=None, clip=True)` -> `ndarray` | Generate samples by iteratively denoising from pure noise. |
| `denoise` | `denoise(x_noisy, t_start, t_end=0)` -> `ndarray` | Denoise a partially noised input from timestep `t_start` down to `t_end`. |

**Denoiser architectures**:

- **MLP**: Concatenates flattened input with sinusoidal time embedding, passes through dense layers.
- **Conv**: Stack of conv2d layers. Time embedding is broadcast spatially and added to feature maps.

**Forward diffusion**: `x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * noise`

**Reverse diffusion**: At each step, predict noise, compute mean of p(x_{t-1} | x_t), add scaled noise (except at t=0).

### Autoregressive Model (MADE)

File: `generative/autoregressive.py`

A masked autoregressive model that enforces causality -- each output dimension only depends on previous dimensions.

```python
from Enilnets import AutoregressiveModel

ar = AutoregressiveModel(
    data_dim=784, hidden_dims=[512, 512],
    data_shape=(28, 28), activation="swish",
    learning_rate=0.001, optimizer="adam", l2_lambda=0.0
)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `data_dim` | `int` | required | Total number of dimensions. |
| `hidden_dims` | `list[int]` | `[512, 512]` | Hidden layer sizes. |
| `data_shape` | `tuple` or `None` | `None` | Original shape for reshaping output (e.g., `(28, 28)`). |
| `activation` | `str` | `"swish"` | Hidden activation. |
| `learning_rate` | `float` | `0.001` | Learning rate. |
| `optimizer` | `str` | `"adam"` | Optimizer type. |
| `l2_lambda` | `float` | `0.0` | L2 regularization. |

**Architecture**: Standard MLP with linear output activation.

**Causality enforcement**: A lower-triangular mask (with zeros on and above diagonal) is applied to the input before feeding it to the network: `x_masked[i] = sum_{j < i} mask[i,j] * x[j]`. This ensures the i-th output only sees dimensions 0 through i-1.

**Methods**:

| Method | Signature | Description |
|--------|-----------|-------------|
| `forward` | `forward(x, training=True)` -> `ndarray` | Causal forward pass. |
| `loss` | `loss(x)` -> `float` | MSE between predictions and targets. |
| `train_step` | `train_step(x)` -> `float` | One training step with custom backpropagation. |
| `Train` | `Train(X_train, epochs=10, batch_size=64, verbose=True)` -> `list[float]` | Full training loop. |
| `generate` | `generate(n_samples=1, shape=None)` -> `ndarray` | Autoregressive sampling: predict dim 0, use it to predict dim 1, etc. |
| `complete` | `complete(partial_x, n_dims=None)` -> `ndarray` | Complete a partial sample by autoregressively filling remaining dimensions. |

**Generation**: Starts with zeros, iteratively predicts each dimension and adds small Gaussian noise for diversity.

### Normalizing Flows (RealNVP)

File: `generative/flows.py`

Real-valued Non-Volume Preserving (RealNVP) flow for density estimation and sampling.

```python
from Enilnets import RealNVP

flow = RealNVP(
    data_dim=784, n_coupling=4, hidden_dim=256,
    activation="swish",
    learning_rate=0.001, optimizer="adam", l2_lambda=0.0
)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `data_dim` | `int` | required | Dimensionality of data. |
| `n_coupling` | `int` | `4` | Number of coupling layers. |
| `hidden_dim` | `int` | `256` | Hidden dimension for s and t networks. |
| `activation` | `str` | `"swish"` | Activation for coupling networks. |
| `learning_rate` | `float` | `0.001` | Learning rate. |
| `optimizer` | `str` | `"adam"` | Optimizer type. |
| `l2_lambda` | `float` | `0.0` | L2 regularization. |

**Architecture**: Each coupling layer has two networks:
- **s_net** (scale): Predicts log-scale factors. Output activation is `tanh`.
- **t_net** (translation): Predicts translation. Output activation is `linear`.

Both are 3-layer MLPs: `data_dim//2 -> hidden_dim -> hidden_dim -> data_dim - data_dim//2`.

**Coupling transform**: For input split into `x1` and `x2`:
```
y2 = x2 * exp(s(x1)) + t(x1)
output = concat(x1, y2)
log_det += sum(s(x1))
```

Alternating masks (even/odd splits) ensure all dimensions are transformed.

**Methods**:

| Method | Signature | Description |
|--------|-----------|-------------|
| `forward` | `forward(x)` -> `(z, log_det)` | Maps data to latent space. Returns transformed data and log determinant. |
| `inverse` | `inverse(z)` -> `x` | Maps latent samples back to data space. |
| `log_prob` | `log_prob(x)` -> `ndarray` | Computes log probability: `log p(z) + log_det` where `p(z)` is N(0, I). |
| `loss` | `loss(x)` -> `float` | Negative mean log probability. |
| `train_step` | `train_step(x)` -> `float` | Returns loss (training uses Evolve, see below). |
| `Train` | `Train(X_train, epochs=10, batch_size=64, verbose=True)` -> `list[float]` | Trains each coupling layer sequentially using evolutionary strategy. |
| `sample` | `sample(n_samples=1)` -> `ndarray` | Samples from base distribution and applies inverse transform. |
| `interpolate` | `interpolate(x1, x2, n_steps=10)` -> `ndarray` | Linear interpolation in latent space. |

**Training**: Uses `Evolve` (evolutionary strategy) rather than analytical backprop through the log-determinant Jacobian. Each coupling layer is trained while keeping previous layers fixed.

### Energy-Based Model (EBM)

File: `generative/ebm.py`

An energy-based model that assigns low energy to real data and high energy to generated samples.

```python
from Enilnets import EnergyBasedModel

ebm = EnergyBasedModel(
    data_dim=784, hidden_dims=[512, 512],
    activation="swish",
    learning_rate=0.001, optimizer="adam", l2_lambda=0.0
)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `data_dim` | `int` | required | Dimensionality of data. |
| `hidden_dims` | `list[int]` | `[512, 512]` | Hidden layer sizes for energy network. |
| `activation` | `str` | `"swish"` | Hidden activation. |
| `learning_rate` | `float` | `0.001` | Learning rate. |
| `optimizer` | `str` | `"adam"` | Optimizer type. |
| `l2_lambda` | `float` | `0.0` | L2 regularization. |

**Architecture**: MLP mapping data to a scalar energy value. Final layer has `linear` activation.

**Methods**:

| Method | Signature | Description |
|--------|-----------|-------------|
| `energy` | `energy(x)` -> `ndarray` | Computes scalar energy for input data, shape `(batch, 1)`. |
| `_energy_grad` | `_energy_grad(x)` -> `(energy, grad)` | Finite-difference gradient of energy w.r.t. input (eps=1e-4). |
| `train_step` | `train_step(x_data, n_cd_steps=10, step_size=0.1, noise_scale=0.005)` -> `float` | One contrastive divergence step. |
| `Train` | `Train(X_train, epochs=10, batch_size=64, n_cd_steps=10, step_size=0.1, noise_scale=0.005, verbose=True)` -> `list[float]` | Full training loop. |
| `sample` | `sample(n_samples=1, n_steps=100, step_size=0.1, noise_scale=0.005)` -> `ndarray` | Langevin dynamics sampling from random initialization. |
| `score` | `score(x)` -> `ndarray` | Returns the energy gradient (score function). |

**Training (Contrastive Divergence)**:
1. Generate negative samples by running Langevin dynamics from random noise for `n_cd_steps` iterations.
2. Compute energy on real data and negative samples.
3. Update network to push down energy on real data (target=1) and push up on negative samples (target=-1).
4. Loss = mean(energy(data) - energy(negative_samples)).

**Sampling**: Langevin dynamics iterates `x = x - step_size * grad(energy) + noise_scale * N(0, I)`.

### UNet Denoiser

File: `generative/unet.py`

A UNet architecture for spatial denoising, designed for use with diffusion models.

```python
from Enilnets import UNetDenoiser, time_embedding

unet = UNetDenoiser(
    in_ch=1, base_ch=64, time_emb_dim=128,
    ch_mult=(1, 2, 4)
)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `in_ch` | `int` | required | Number of input channels. |
| `base_ch` | `int` | `64` | Base number of channels. |
| `time_emb_dim` | `int` | `128` | Dimensionality of time embedding. |
| `ch_mult` | `tuple` | `(1, 2, 4)` | Channel multipliers for each encoder level. |

**Architecture**:
- **Time embedding MLP**: `time_emb_dim -> time_emb_dim*4 -> time_emb_dim*4` with swish activation.
- **Encoder path**: `len(ch_mult)` levels. Each level has 2 conv2d layers (k=1) with swish activation. Time embedding is added to features (broadcast spatially if channel dims match).
- **Downsampling**: Average pooling by factor 2 between encoder levels.
- **Bottleneck**: 2 conv2d layers at the deepest level.
- **Decoder path**: Mirrors encoder with skip connections. Upsamples, concatenates with skip, applies 2 conv2d layers.
- **Output**: Conv2d (k=1) mapping back to `in_ch` channels with linear activation.

**Methods**:

| Method | Signature | Description |
|--------|-----------|-------------|
| `forward` | `forward(x, t)` -> `ndarray` | Full UNet forward pass with time conditioning. |
| `backward` | `backward(grad_output)` | Raises `NotImplementedError`. |
| `get_params` | `get_params()` -> `list[NeuralNet]` | Returns all sub-networks for external optimization. |

**Important**: The UNet uses `k=1` convolutions to avoid spatial dimension changes (since the library has no padding support). The `backward` method is not implemented; use `DiffusionModel` with `denoiser_type="mlp"` for fully trainable diffusion, or implement custom backpropagation for the UNet.

**time_embedding function**:

```python
emb = time_embedding(t, dim, max_period=10000)
```

Sinusoidal time embedding used in diffusion models:
```
freqs = exp(-log(max_period) * arange(dim//2) / (dim//2))
emb = [sin(t * freqs), cos(t * freqs)]
```

If `dim` is odd, a zero column is appended.

---

## Sampling Utilities

File: `generative/sampling.py`

### VAE Reparameterization

```python
from Enilnets import reparameterize

z = reparameterize(mu, logvar)
```

The reparameterization trick: `z = mu + exp(0.5 * logvar) * eps` where `eps ~ N(0, I)`.

Enables backpropagation through stochastic nodes by separating the randomness from the parameters.

### Langevin Dynamics

```python
from Enilnets import langevin_dynamics

x_sampled = langevin_dynamics(energy_fn, x_init, n_steps=20, step_size=0.1, noise_scale=0.005)
```

Langevin Monte Carlo for sampling from energy-based models:

```
for step in range(n_steps):
    energy, grad = energy_fn(x)
    x = x - step_size * grad + noise_scale * N(0, I)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `energy_fn` | `callable` | required | Function taking `x` and returning `(energy, grad_energy)`. |
| `x_init` | `ndarray` | required | Initial samples. |
| `n_steps` | `int` | `20` | Number of Langevin steps. |
| `step_size` | `float` | `0.1` | Gradient descent step size. |
| `noise_scale` | `float` | `0.005` | Standard deviation of injected noise. |

### Gaussian & Uniform Sampling

```python
from Enilnets import gaussian_sample, uniform_sample

# Gaussian: N(mean, std^2)
samples = gaussian_sample(mean, std, shape=None)
# If shape is None, uses mean.shape

# Uniform: U(low, high)
samples = uniform_sample(low, high, shape)
```

### Gumbel-Softmax

```python
from Enilnets import gumbel_softmax_sample

samples = gumbel_softmax_sample(logits, temperature=1.0, hard=False)
```

Differentiable sampling from categorical distributions:

1. Add Gumbel noise: `y = logits + Gumbel(0, 1)`
2. Apply temperature-scaled softmax: `y_soft = softmax(y / temperature)`
3. If `hard=True`, use straight-through estimator: `y_hard - y_soft + y_soft` (discrete forward, continuous backward).

Useful for training models with discrete latent variables.

### Random Masking

```python
from Enilnets import random_mask

mask = random_mask(shape, ratio)
```

Generates a binary mask where each element is 1 with probability `ratio` (keep ratio), 0 otherwise. Returns `np.float64` array.

### Top-p (Nucleus) Sampling

```python
from Enilnets.generative.sampling import top_p_sampling

samples = top_p_sampling(logits, p=0.9, temperature=1.0)
```

Nucleus sampling for text generation:

1. Convert logits to probabilities with temperature scaling.
2. Sort probabilities in descending order.
3. Find the smallest set of tokens whose cumulative probability exceeds `p`.
4. Sample from this truncated distribution.

Always keeps at least the top token. Returns one-hot encoded samples.

### Discounted Returns

```python
from Enilnets import compute_returns

returns = compute_returns(rewards, gamma=0.99)
```

Computes cumulative discounted returns for a single episode:

```
G_t = reward_t + gamma * G_{t+1}
```

Iterates backward through the reward array. Returns array of same shape.

### Generalized Advantage Estimation (GAE)

```python
from Enilnets.generative.sampling import gae

advantages, returns = gae(rewards, values, gamma=0.99, lambda_=0.95)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `rewards` | `ndarray` | required | Step rewards, shape `(T,)`. |
| `values` | `ndarray` | required | Value estimates including bootstrap `V(s_{T+1})`, shape `(T+1,)`. |
| `gamma` | `float` | `0.99` | Discount factor. |
| `lambda_` | `float` | `0.95` | GAE lambda parameter. |

Algorithm:
```
for t in reversed(range(T)):
    delta = rewards[t] + gamma * values[t+1] - values[t]
    gae_t = delta + gamma * lambda_ * gae_t
    advantages[t] = gae_t
returns = advantages + values[:T]
```

GAE provides a bias-variance tradeoff controlled by `lambda_`: `lambda_=0` gives high-bias TD(0), `lambda_=1` gives high-variance Monte Carlo.

---

## Model I/O

File: `io.py`

### JSON Serialization

```python
model.Save("model.json")
```

Saves the model to a human-readable JSON file. All NumPy arrays are converted to Python lists via a custom encoder. Stores:
- `version`: 3
- `layers`: All layer dictionaries
- `optimizer`: Optimizer type string
- `learning_rate`, `l2_lambda`, `momentum`: Hyperparameters
- `t`: Global timestep

**Note**: JSON does not preserve NumPy array types; arrays are stored as nested lists.

### Pickle Serialization

```python
model.Save("model.pkl")
```

Saves the model to a binary pickle file. Preserves NumPy arrays exactly. More compact and faster than JSON. Detected automatically by `.pkl` extension.

### Loading Models

```python
model = NeuralNet()  # Create a new instance
model.Load("model.json")  # or "model.pkl"
```

The `Load` method:
1. Detects file format by extension (`.pkl` for pickle, otherwise JSON).
2. Loads the payload.
3. Reconstructs layers, converting list data back to `np.float64` arrays for all weight-related keys (`weights`, `bias`, `mask`, `gamma`, `beta`, `running_mean`, `running_var`).
4. Restores hyperparameters (`learning_rate`, `optimizer_type`, `l2_lambda`, `momentum`, `t`).
5. Resets optimizer state (`opt_state = []`) -- you may want to call `reset_optimizer_state()` after loading.

**Important**: The loaded model does not restore optimizer momentum buffers. If you need to resume training exactly, you would need to save and load `opt_state` separately (not currently supported).

---

## Known Limitations

1. **No padding in Conv2D**: All convolutions use `pad=0`, so spatial dimensions shrink by `k-1` per layer. The UNet works around this by using `k=1` convolutions.

2. **UNet backward not implemented**: The `UNetDenoiser.backward()` method raises `NotImplementedError`. Use `DiffusionModel` with `denoiser_type="mlp"` for fully trainable diffusion, or implement custom backpropagation.

3. **Flows use evolutionary training**: `RealNVP` uses `Evolve` (black-box optimization) rather than analytical backprop through the log-determinant Jacobian. This is slower and less precise than gradient-based training.

4. **GAN training can be unstable**: As with all GANs, convergence depends heavily on architecture, learning rates, and data. The library provides three loss types but no spectral normalization or other advanced stabilization techniques.

5. **No GPU acceleration**: Pure NumPy implementation runs on CPU only. Large models and datasets will be slow.

6. **Stride support is limited**: While `stride` is stored in conv2d layers, the `im2col` implementation only fully supports `stride=1`.

7. **No recurrent layers**: No LSTM, GRU, or vanilla RNN support. For sequence modeling, use the embedding layer with dense layers or implement custom recurrence.

8. **No automatic differentiation**: Gradients are hand-coded for each layer type. Adding new layers requires implementing both forward and backward passes.

9. **Optimizer state not saved**: `Save()`/`Load()` do not persist optimizer momentum/velocity buffers. Training resumes from scratch in terms of optimizer state.

10. **Single precision not supported**: All computations use `np.float64`. This provides numerical stability but uses more memory than `float32`.

---

## Version History

### v2.0.0

Major update including:

- **New layer types**: LayerNorm, Embedding, GlobalAvgPool2D, Upsample2D, Sparse
- **Learning rate schedulers**: Step decay, exponential decay, cosine annealing, warmup+cosine, plateau
- **Reinforcement learning**: REINFORCE, PPO, Actor-Critic, Evolutionary Strategy
- **Gradient clipping**: L2 norm clipping across all layers
- **Layer freezing/unfreezing**: Fine-grained control over which layers train
- **NaN/Inf detection**: `check_nan_inf()` for debugging numerical issues
- **Improved BatchNorm**: Full 2D and 4D support with proper running statistics
- **New loss functions**: Cosine similarity, triplet margin, NT-Xent (SimCLR), focal loss, BCE with logits, Wasserstein loss
- **Generative AI framework**: VAE, GAN, Diffusion Model, Autoregressive Model, RealNVP, Energy-Based Model, UNet Denoiser
- **Sampling utilities**: Reparameterization, Langevin dynamics, Gumbel-Softmax, top-p sampling, GAE
- **Perceptual loss utilities**: Placeholder functions for VGG-based perceptual loss
- **Model copying and state reset**: `copy()` and `reset_optimizer_state()`

---

## Complete API Reference

### NeuralNet Methods

| Method | Description |
|--------|-------------|
| `__init__(learning_rate, optimizer, l2_lambda, momentum)` | Constructor |
| `summary()` | Print architecture summary |
| `add_dense(n_in, n_out, activation, init_method, use_bias)` | Add fully connected layer |
| `add_sparse(n_in, n_out, connectivity, activation, init_method)` | Add sparse connected layer |
| `add_conv2d(in_ch, out_ch, k, activation, init_method, stride)` | Add 2D convolution |
| `add_flatten()` | Add flatten layer |
| `add_maxpool2d(pool_size)` | Add max pooling |
| `add_avgpool2d(pool_size)` | Add average pooling |
| `add_global_avgpool2d()` | Add global average pooling |
| `add_upsample2d(scale_factor)` | Add 2x upsampling |
| `add_batchnorm(num_features, epsilon, momentum)` | Add batch normalization |
| `add_layernorm(normalized_shape, epsilon)` | Add layer normalization |
| `add_dropout(rate)` | Add dropout regularization |
| `add_embedding(vocab_size, embed_dim, init_method)` | Add embedding lookup layer |
| `Forward(inputs, training, dropout_rate)` | Forward pass |
| `predict(inputs)` | Alias for Forward |
| `train()` | Set training mode |
| `eval()` | Set evaluation mode |
| `set_lr(lr)` | Set learning rate |
| `get_lr()` | Get learning rate |
| `freeze(layer_idx)` | Freeze layer(s) |
| `unfreeze(layer_idx)` | Unfreeze layer(s) |
| `clip_gradients(max_norm)` | Clip gradient norms |
| `get_weights()` | Copy all weights |
| `set_weights(weights)` | Restore weights |
| `copy()` | Deep copy network |
| `reset_optimizer_state()` | Clear optimizer buffers |
| `check_nan_inf()` | Check for NaN/Inf |
| `Backward(targets, output_delta)` | Backpropagation |
| `update()` | Apply parameter updates |
| `TrainBatch(xs, ys, loss_function, **kwargs)` | Train one batch |
| `Train(X, Y, epochs, batch_size, X_val, Y_val, loss_function, verbose, scheduler, **kwargs)` | Full training loop |
| `ComputeLoss(out, tgt, function, reduction, **kwargs)` | Compute loss |
| `compute_accuracy(pred, tgt)` | Compute classification accuracy |
| `compute_precision_recall_f1(pred, tgt)` | Compute precision, recall, F1 |
| `Evolve(inputs, score_fn, noise, tries, sigma)` | Evolutionary strategy |
| `Reinforce(states, actions, returns, action_type, std, normalize_returns)` | Policy gradient |
| `PPO(states, actions, old_log_probs, advantages, action_type, epsilon, std, value_targets, value_coeff, entropy_coeff)` | Proximal Policy Optimization |
| `ActorCritic(states, actions, returns, values, action_type, std)` | Actor-Critic |
| `Save(file)` | Save model to file |
| `Load(file)` | Load model from file |

### Generative Classes

| Class | Module | Description |
|-------|--------|-------------|
| `VAE` | `generative.vae` | Variational Autoencoder |
| `GAN` | `generative.gan` | Generative Adversarial Network |
| `DiffusionModel` | `generative.diffusion` | DDPM diffusion model |
| `AutoregressiveModel` | `generative.autoregressive` | MADE-style autoregressive model |
| `RealNVP` | `generative.flows` | RealNVP normalizing flow |
| `EnergyBasedModel` | `generative.ebm` | Energy-based model |
| `UNetDenoiser` | `generative.unet` | UNet for diffusion denoising |
| `LRScheduler` | `train` | Learning rate scheduler |

### Generative Loss Functions

| Function | Module | Description |
|----------|--------|-------------|
| `kl_divergence_gaussian(mu, logvar, reduction)` | `generative_loss` | KL(q(z\|x) \|\| N(0,I)) |
| `adversarial_loss_discriminator(real_logits, fake_logits, loss_type)` | `generative_loss` | Discriminator loss (BCE/BCE_logits/Wasserstein) |
| `adversarial_loss_generator(fake_logits, loss_type)` | `generative_loss` | Generator loss |
| `diffusion_loss(pred_noise, true_noise, reduction)` | `generative_loss` | MSE noise prediction |
| `nll_loss(log_px, log_det_jacobian, reduction)` | `generative_loss` | Flow negative log-likelihood |
| `energy_loss(data_energy, sample_energy, margin)` | `generative_loss` | EBM contrastive loss |
| `perceptual_loss(x, y, feature_extractor)` | `generative_loss` | Perceptual loss (falls back to MSE) |
| `vgg_loss(x, y)` | `generative_loss` | Placeholder for VGG perceptual loss |

### Sampling Functions

| Function | Module | Description |
|----------|--------|-------------|
| `reparameterize(mu, logvar)` | `sampling` | VAE reparameterization trick |
| `langevin_dynamics(energy_fn, x_init, n_steps, step_size, noise_scale)` | `sampling` | MCMC sampling for EBMs |
| `gaussian_sample(mean, std, shape)` | `sampling` | Gaussian sampling |
| `uniform_sample(low, high, shape)` | `sampling` | Uniform sampling |
| `gumbel_softmax_sample(logits, temperature, hard)` | `sampling` | Differentiable categorical sampling |
| `random_mask(shape, ratio)` | `sampling` | Random boolean mask |
| `top_p_sampling(logits, p, temperature)` | `sampling` | Nucleus (top-p) sampling |
| `compute_returns(rewards, gamma)` | `sampling` | Discounted returns |
| `gae(rewards, values, gamma, lambda_)` | `sampling` | Generalized Advantage Estimation |

### Weight Initialization Functions

| Function | Description |
|----------|-------------|
| `init_weights(n_in, n_out, method)` | Dense/sparse weight initialization |
| `init_conv_weights(in_ch, out_ch, k, method)` | Conv2D weight initialization |
| `init_embedding_weights(vocab_size, embed_dim, method)` | Embedding weight initialization |

### Utility Functions

| Function | Description |
|----------|-------------|
| `im2col(input_data, filter_h, filter_w, stride, pad)` | Convert images to column matrix for efficient convolution |
| `time_embedding(t, dim, max_period)` | Sinusoidal time embedding for diffusion models |
| `activate(name, x)` | Apply activation function |
| `derivative(name, x)` | Compute activation derivative |
