Metadata-Version: 2.4
Name: Enilnets
Version: 1.0.1
Description-Content-Type: text/markdown
License-File: LICENCE
Requires-Dist: numpy>=2.5.0
Dynamic: description
Dynamic: description-content-type
Dynamic: license-file
Dynamic: requires-dist

# Enilnets Library Documentation

A pure NumPy-based deep learning library with support for dense, convolutional, pooling, batch normalization, dropout, and sparse layers. Includes multiple optimizers, loss functions, activation functions, and weight initialization methods.

---

## Table of Contents

1. [Quick Start](#quick-start)
2. [Core Architecture](#core-architecture)
3. [Model Configuration](#model-configuration)
   - [NeuralNet Constructor](#neuralnet-constructor)
   - [Summary](#summary)
4. [Layer Types](#layer-types)
   - [Dense Layer](#dense-layer)
   - [Sparse Layer](#sparse-layer)
   - [Conv2D Layer](#conv2d-layer)
   - [Flatten Layer](#flatten-layer)
   - [MaxPool2D Layer](#maxpool2d-layer)
   - [AvgPool2D Layer](#avgpool2d-layer)
   - [BatchNorm Layer](#batchnorm-layer)
   - [Dropout Layer](#dropout-layer)
5. [Forward Pass](#forward-pass)
6. [Backward Pass](#backward-pass)
7. [Optimizers](#optimizers)
8. [Loss Functions](#loss-functions)
9. [Training](#training)
10. [Activation Functions](#activation-functions)
11. [Weight Initialization](#weight-initialization)
12. [Reinforcement Learning](#reinforcement-learning)
13. [Model I/O](#model-io)
14. [Utility Functions](#utility-functions)

---

## Quick Start

```python
from Enilnets import NeuralNet
import numpy as np

# Create a simple classifier
model = NeuralNet(learning_rate=0.001, optimizer="adam", l2_lambda=0.01)

# Build architecture
model.add_dense(784, 256, activation="relu")
model.add_dropout(0.3)
model.add_dense(256, 10, activation="softmax")

# Train
X_train = np.random.randn(1000, 784)
Y_train = np.eye(10)[np.random.randint(0, 10, 1000)]

history = model.Train(X_train, Y_train, epochs=10, batch_size=32)

# Predict
predictions = model.Forward(X_test)
```

---

## Core Architecture

The library is built around the `NeuralNet` class in `base.py`, which maintains:

| Attribute | Type | Description |
|-----------|------|-------------|
| `layers` | `list` | Layer definitions with weights, biases, and hyperparameters |
| `learning_rate` | `float` | Global learning rate |
| `optimizer_type` | `str` | Optimizer name: `"sgd"`, `"rmsprop"`, `"adagrad"`, `"adam"` |
| `l2_lambda` | `float` | L2 regularization coefficient |
| `momentum` | `float` | Momentum coefficient for SGD |
| `outputs` | `list` | Cached layer outputs during forward pass |
| `pre_activations` | `list` | Cached pre-activation values (z) |
| `batchnorm_cache` | `list` | BatchNorm statistics cache |
| `deltas` | `list` | Gradient errors per layer |
| `opt_state` | `list` | Optimizer state (momentum, velocity) |
| `t` | `int` | Global timestep for bias correction (Adam) |

---

## Model Configuration

### NeuralNet Constructor

```python
NeuralNet(learning_rate=0.001, optimizer="adam", l2_lambda=0.01, momentum=0.9)
```

**Parameters:**

| Parameter | Default | Description |
|-----------|---------|-------------|
| `learning_rate` | `0.001` | Step size for parameter updates |
| `optimizer` | `"adam"` | Optimization algorithm. Options: `"sgd"`, `"rmsprop"`, `"adagrad"`, `"adam"` |
| `l2_lambda` | `0.01` | L2 regularization strength applied to weights |
| `momentum` | `0.9` | Momentum factor for SGD optimizer |

**Example:**

```python
# Adam with low regularization
model = NeuralNet(learning_rate=0.001, optimizer="adam", l2_lambda=0.001)

# SGD with high momentum
model = NeuralNet(learning_rate=0.01, optimizer="sgd", momentum=0.95, l2_lambda=0.0)
```

### Summary

```python
model.summary()
```

Prints a model architecture overview including layer types, dimensions, and total parameter count. No parameters, returns `None`.

**Output Example:**
```
Model Summary
============================================================
Optimizer: ADAM | LR: 0.001 | L2: 0.01
============================================================
Layer 0: DENSE - Input: 784, Output: 256, Params: 200960
Layer 1: DROPOUT
Layer 2: DENSE - Input: 256, Output: 10, Params: 2570
Total Parameters: 203530
============================================================
```

---

## Layer Types

All layer methods are bound to the `NeuralNet` class and return `None` (they mutate `self.layers`).

### Dense Layer

```python
model.add_dense(n_in, n_out, activation="relu", init_method="xavier_uniform")
```

Fully connected (linear) layer: `output = activation(x @ W^T + b)`

**Parameters:**

| Parameter | Default | Description |
|-----------|---------|-------------|
| `n_in` | *required* | Number of input features |
| `n_out` | *required* | Number of output features (neurons) |
| `activation` | `"relu"` | Activation function name (see [Activation Functions](#activation-functions)) |
| `init_method` | `"xavier_uniform"` | Weight initialization method (see [Weight Initialization](#weight-initialization)) |

**Layer Dictionary Keys:**
- `type`: `"dense"`
- `weights`: `(n_out, n_in)` ndarray
- `bias`: `(n_out,)` ndarray
- `activation`: activation string

**Example:**
```python
model.add_dense(784, 256, activation="relu", init_method="he_normal")
model.add_dense(256, 10, activation="softmax")
```

---

### Sparse Layer

```python
model.add_sparse(n_in, n_out, connectivity=0.5, activation="relu", init_method="xavier_uniform")
```

Dense layer with random connectivity masking. Only a fraction of weights are non-zero.

**Parameters:**

| Parameter | Default | Description |
|-----------|---------|-------------|
| `n_in` | *required* | Number of input features |
| `n_out` | *required* | Number of output features |
| `connectivity` | `0.5` | Fraction of weights to keep (0.0 to 1.0) |
| `activation` | `"relu"` | Activation function name |
| `init_method` | `"xavier_uniform"` | Weight initialization method |

**Layer Dictionary Keys:**
- `type`: `"sparse"`
- `weights`: `(n_out, n_in)` ndarray (masked)
- `bias`: `(n_out,)` ndarray
- `mask`: `(n_out, n_in)` boolean ndarray
- `activation`: activation string

**Note:** During backpropagation, gradients are masked by `layer["mask"]` to maintain sparsity.

**Example:**
```python
# 30% connectivity sparse layer
model.add_sparse(784, 256, connectivity=0.3, activation="relu")
```

---

### Conv2D Layer

```python
model.add_conv2d(in_ch, out_ch, k, activation="relu", init_method="he_normal")
```

2D convolutional layer. Uses `im2col` for efficient convolution computation.

**Parameters:**

| Parameter | Default | Description |
|-----------|---------|-------------|
| `in_ch` | *required* | Number of input channels |
| `out_ch` | *required* | Number of output channels (filters) |
| `k` | *required* | Kernel size (square kernel: k x k) |
| `activation` | `"relu"` | Activation function name |
| `init_method` | `"he_normal"` | Weight initialization method |

**Layer Dictionary Keys:**
- `type`: `"conv2d"`
- `weights`: `(out_ch, in_ch, k, k)` ndarray
- `bias`: `(out_ch,)` ndarray
- `in_ch`, `out_ch`, `k`: integers
- `activation`: activation string

**Notes:**
- Stride is fixed at 1, padding is fixed at 0.
- Output spatial dimensions: `(H - k + 1, W - k + 1)`
- Input must be 4D: `(batch, channels, height, width)`

**Example:**
```python
# 3x3 conv, 1 input channel -> 32 output channels
model.add_conv2d(1, 32, k=3, activation="relu")
# 3x3 conv, 32 input channels -> 64 output channels
model.add_conv2d(32, 64, k=3, activation="relu")
```

---

### Flatten Layer

```python
model.add_flatten()
```

Flattens multi-dimensional input to 2D: `(batch, ...)` -> `(batch, -1)`.

**Parameters:** None

**Layer Dictionary Keys:**
- `type`: `"flatten"`

**Example:**
```python
model.add_conv2d(1, 32, k=3)
model.add_maxpool2d(2)
model.add_flatten()  # Flattens (B, 32, H, W) to (B, 32*H*W)
model.add_dense(32*14*14, 128)
```

---

### MaxPool2D Layer

```python
model.add_maxpool2d(pool_size=2)
```

Max pooling with square kernel. Reduces spatial dimensions by `pool_size`.

**Parameters:**

| Parameter | Default | Description |
|-----------|---------|-------------|
| `pool_size` | `2` | Size of pooling window (p x p) |

**Layer Dictionary Keys:**
- `type`: `"maxpool2d"`
- `p`: pool size integer

**Notes:**
- Input is trimmed to multiples of `pool_size` before pooling.
- Backpropagation routes gradient only to the max-valued position(s) in each window.

**Example:**
```python
model.add_conv2d(1, 32, k=3)
model.add_maxpool2d(pool_size=2)  # Halves spatial dimensions
```

---

### AvgPool2D Layer

```python
model.add_avgpool2d(pool_size=2)
```

Average pooling with square kernel.

**Parameters:**

| Parameter | Default | Description |
|-----------|---------|-------------|
| `pool_size` | `2` | Size of pooling window (p x p) |

**Layer Dictionary Keys:**
- `type`: `"avgpool2d"`
- `p`: pool size integer

**Notes:**
- Backpropagation distributes gradient equally across all positions in each window.

**Example:**
```python
model.add_conv2d(1, 32, k=3)
model.add_avgpool2d(pool_size=2)
```

---

### BatchNorm Layer

```python
model.add_batchnorm(num_features, epsilon=1e-5, momentum=0.1)
```

Batch normalization layer. Normalizes across the batch dimension.

**Parameters:**

| Parameter | Default | Description |
|-----------|---------|-------------|
| `num_features` | *required* | Number of features (must match flattened input dimension) |
| `epsilon` | `1e-5` | Small constant for numerical stability |
| `momentum` | `0.1` | Running statistics update momentum (0.1 means 90% old, 10% new) |

**Layer Dictionary Keys:**
- `type`: `"batchnorm"`
- `num_features`: integer
- `epsilon`, `momentum`: floats
- `running_mean`, `running_var`: running statistics
- `gamma`: scale parameter, initialized to 1
- `beta`: shift parameter, initialized to 0

**Notes:**
- Input is flattened to 2D, normalized, then reshaped back.
- During training, uses batch statistics and updates running statistics.
- During inference (`training=False`), uses running statistics.

**Example:**
```python
model.add_dense(256, 128)
model.add_batchnorm(128)  # Must match output of previous layer
model.add_dense(128, 10, activation="softmax")
```

---

### Dropout Layer

```python
model.add_dropout(rate=0.5)
```

Randomly zeroes elements during training with probability `rate`.

**Parameters:**

| Parameter | Default | Description |
|-----------|---------|-------------|
| `rate` | `0.5` | Dropout probability (0.0 = no dropout, 1.0 = drop everything) |

**Layer Dictionary Keys:**
- `type`: `"dropout"`
- `rate`: float
- `mask`: cached mask during training (set during forward pass)

**Notes:**
- Active only when `training=True` in `Forward()` or `TrainBatch()`.
- Scales surviving activations by `1/(1-rate)` (inverted dropout).

**Example:**
```python
model.add_dense(256, 128, activation="relu")
model.add_dropout(0.3)  # 30% dropout
model.add_dense(128, 10, activation="softmax")
```

---

## Forward Pass

### Forward / Predict

```python
output = model.Forward(inputs, training=False, dropout_rate=0.0)
output = model.predict(inputs)  # Alias for Forward
```

Runs the forward pass through all layers.

**Parameters:**

| Parameter | Default | Description |
|-----------|---------|-------------|
| `inputs` | *required* | Input array. Can be 1D `(features,)`, 2D `(batch, features)`, 3D `(channels, height, width)`, or 4D `(batch, channels, height, width)` |
| `training` | `False` | Whether to enable training-specific behavior (dropout, batch norm updates) |
| `dropout_rate` | `0.0` | Fallback dropout rate if layer doesn't specify one |

**Returns:**
- `output`: ndarray of shape matching the last layer's output

**Side Effects:**
- Populates `self.outputs` (layer-by-layer activations)
- Populates `self.pre_activations` (pre-activation values for dense/conv layers)
- Populates `self.batchnorm_cache` (batch norm statistics)
- Sets `layer["mask"]` for dropout layers during training

**Example:**
```python
# Inference
pred = model.Forward(X_test, training=False)

# Training (enables dropout and batch norm updates)
pred = model.Forward(X_batch, training=True)
```

---

## Backward Pass

### Backward

```python
model.Backward(targets)
```

Computes gradients via backpropagation and stores them in `self.deltas`.

**Parameters:**

| Parameter | Type | Description |
|-----------|------|-------------|
| `targets` | ndarray | Target values. Shape: `(batch, output_dim)` or `(output_dim,)` for single sample |

**Side Effects:**
- Populates `self.deltas` with gradients for each layer
- For batch norm layers, stores `d_gamma` and `d_beta` in the layer dict

**Gradient Computation:**
- Output layer: If softmax activation, uses `delta = (output - targets) / batch_size`
- Otherwise: `delta = (output - targets) * activation_derivative(pre_activation) / batch_size`
- Hidden layers: Propagates error backward through weights, then multiplies by activation derivative

**Example:**
```python
model.Forward(X_batch, training=True)
model.Backward(Y_batch)
model.update()  # Apply gradients
```

---

## Optimizers

### Update

```python
model.update()
```

Applies computed gradients to update all layer parameters. No parameters, no return value.

**Supported Optimizers:**

| Optimizer | Description | Hyperparameters Used |
|-----------|-------------|---------------------|
| `"sgd"` | Stochastic Gradient Descent with momentum | `learning_rate`, `momentum` |
| `"rmsprop"` | RMSProp adaptive learning | `learning_rate` |
| `"adagrad"` | AdaGrad adaptive learning | `learning_rate` |
| `"adam"` | Adam (default) | `learning_rate`, `t` (timestep) |

**Adam Configuration (fixed):**
- `beta1 = 0.9` (first moment decay)
- `beta2 = 0.999` (second moment decay)
- `epsilon = 1e-8` (numerical stability)
- Bias correction applied: `m / (1 - beta1^t)`, `v / (1 - beta2^t)`

**L2 Regularization:**
- Applied to all dense, sparse, and conv2d weights: `grad_w += l2_lambda * weights`
- Not applied to biases or batch norm parameters

**Sparse Layer Handling:**
- Gradients are masked by `layer["mask"]` before update

**Example:**
```python
model = NeuralNet(optimizer="adam", learning_rate=0.001)
# ... forward and backward ...
model.update()
```

---

## Loss Functions

### ComputeLoss

```python
loss = model.ComputeLoss(output, target, function="mse", reduction="mean", **kwargs)
```

Computes the loss between predictions and targets.

**Parameters:**

| Parameter | Default | Description |
|-----------|---------|-------------|
| `output` | *required* | Model predictions |
| `target` | *required* | Ground truth values |
| `function` | `"mse"` | Loss function name (see below) |
| `reduction` | `"mean"` | `"mean"`, `"sum"`, or `"none"` (returns per-element loss) |
| `**kwargs` | | Additional arguments for specific loss functions |

**Available Loss Functions:**

| Function | Description | Extra Args | Formula |
|----------|-------------|------------|---------|
| `"mse"` | Mean Squared Error | none | `(o - t)^2` |
| `"mae"` | Mean Absolute Error | none | `|o - t|` |
| `"huber"` | Huber Loss | `delta=1.0` | `0.5*diff^2` if `diff < delta`, else `delta*(diff - 0.5*delta)` |
| `"smooth_l1"` | Smooth L1 Loss | none | `0.5*diff^2` if `diff < 1`, else `diff - 0.5` |
| `"binary_cross_entropy"` | Binary Cross-Entropy | none | `-t*log(o) - (1-t)*log(1-o)` |
| `"cross_entropy"` / `"categorical_cross_entropy"` | Categorical Cross-Entropy | none | `-t*log(o)` |
| `"focal"` | Focal Loss | `alpha=0.25`, `gamma=2.0` | Down-weights easy examples |
| `"hinge"` | Hinge Loss (SVM) | none | `max(0, 1 - t*o)` |

**Notes:**
- For cross-entropy losses, outputs are clipped to `[1e-12, 1.0]` to prevent log(0).
- For binary cross-entropy, outputs are clipped to `[1e-12, 1 - 1e-12]`.

**Example:**
```python
# MSE loss
loss = model.ComputeLoss(pred, target, "mse", "mean")

# Focal loss for imbalanced classification
loss = model.ComputeLoss(pred, target, "focal", alpha=0.25, gamma=2.0)

# Per-sample losses (no reduction)
losses = model.ComputeLoss(pred, target, "cross_entropy", "none")
```

---

## Training

### TrainBatch

```python
loss = model.TrainBatch(xs, ys, loss_function=None, **loss_kwargs)
```

Trains on a single batch (forward + backward + update).

**Parameters:**

| Parameter | Default | Description |
|-----------|---------|-------------|
| `xs` | *required* | Input batch |
| `ys` | *required* | Target batch |
| `loss_function` | `None` | Loss function name. Auto-detected: `"cross_entropy"` if last layer uses `"softmax"`, else `"mse"` |
| `**loss_kwargs` | | Extra arguments passed to `ComputeLoss` |

**Returns:**
- `loss`: float, the computed loss value

**Example:**
```python
loss = model.TrainBatch(X_batch, Y_batch, loss_function="cross_entropy")
```

---

### Train

```python
history = model.Train(X_train, Y_train, epochs=10, batch_size=32, 
                      X_val=None, Y_val=None, loss_function=None, 
                      verbose=True, **loss_kwargs)
```

Full training loop with batching, optional validation, and history tracking.

**Parameters:**

| Parameter | Default | Description |
|-----------|---------|-------------|
| `X_train` | *required* | Training inputs |
| `Y_train` | *required* | Training targets |
| `epochs` | `10` | Number of training epochs |
| `batch_size` | `32` | Batch size |
| `X_val` | `None` | Validation inputs (optional) |
| `Y_val` | `None` | Validation targets (optional) |
| `loss_function` | `None` | Loss function name (auto-detected if None) |
| `verbose` | `True` | Print progress per epoch |
| `**loss_kwargs` | | Extra arguments for `ComputeLoss` |

**Returns:**
- `history`: dict with keys `"loss"`, `"accuracy"`, `"val_loss"`, `"val_accuracy"` (lists of floats)

**Behavior:**
- Shuffles data each epoch
- Computes average loss and accuracy per epoch
- If validation data provided, computes validation metrics after each epoch
- Accuracy: multi-class uses `argmax`, binary uses `> 0.5` threshold

**Example:**
```python
history = model.Train(
    X_train, Y_train,
    epochs=20, batch_size=64,
    X_val=X_val, Y_val=Y_val,
    loss_function="cross_entropy",
    verbose=True
)

# Plot training history
import matplotlib.pyplot as plt
plt.plot(history["loss"], label="train")
plt.plot(history["val_loss"], label="val")
plt.legend()
plt.show()
```

---

### ComputeAccuracy

```python
acc = model.compute_accuracy(predictions, targets)
```

Standalone accuracy computation.

**Parameters:**

| Parameter | Description |
|-----------|-------------|
| `predictions` | Model output array |
| `targets` | Ground truth array |

**Returns:**
- `acc`: float, proportion of correct predictions (0.0 to 1.0)

**Logic:**
- If last dimension > 1: multi-class, uses `argmax`
- If last dimension == 1: binary, uses `> 0.5` threshold

**Example:**
```python
acc = model.compute_accuracy(model.Forward(X_test), Y_test)
print(f"Accuracy: {acc:.4f}")
```

---

## Activation Functions

Activation functions are defined in `Enilnets.activations`. Both `activate()` and `derivative()` are available as standalone functions, but are typically used internally by the library.

### Standalone Functions

```python
from Enilnets.activations import activate, derivative

out = activate("relu", x)
grad = derivative("relu", x)
```

**Available Activations:**

| Name | Function `activate(x)` | Derivative `derivative(x)` | Notes |
|------|------------------------|---------------------------|-------|
| `"relu"` | `max(0, x)` | `1` if `x > 0`, else `0` | Default for hidden layers |
| `"leakyrelu"` | `x` if `x > 0`, else `0.01*x` | `1` if `x > 0`, else `0.01` | Negative slope 0.01 |
| `"elu"` | `x` if `x > 0`, else `exp(x) - 1` | `1` if `x > 0`, else `exp(x)` | |
| `"selu"` | `scale * (x if x>0 else alpha*(exp(x)-1))` | `scale * (1 if x>0 else alpha*exp(x))` | alpha=1.673, scale=1.051 |
| `"gelu"` | `0.5*x*(1+tanh(sqrt(2/pi)*(x+0.044715*x^3)))` | `cdf + x*pdf` | Gaussian Error Linear Unit |
| `"swish"` | `x * sigmoid(x)` | `s + x*s*(1-s)` | Self-gated |
| `"sigmoid"` | `1/(1+exp(-x))` | `sigmoid(x)*(1-sigmoid(x))` | Clipped to [-500, 500] |
| `"tanh"` | `tanh(x)` | `1 - tanh(x)^2` | |
| `"softmax"` | `exp(x - max(x)) / sum(exp(x - max(x)))` | N/A (handled specially in backprop) | Output layer only |
| `"linear"` | `x` | `1` | Default if no activation specified |

**Usage in Layers:**
```python
model.add_dense(256, 128, activation="gelu")
model.add_dense(128, 10, activation="softmax")
```

---

## Weight Initialization

Weight initialization functions are defined in `Enilnets.weight_init`. Used internally by layer constructors.

### Standalone Functions

```python
from Enilnets.weight_init import init_weights, init_conv_weights

w, b = init_weights(n_in, n_out, method="xavier_uniform")
w, b = init_conv_weights(in_ch, out_ch, k, method="he_normal")
```

**Available Methods:**

| Method | Dense Formula | Conv Formula | Best For |
|--------|--------------|--------------|----------|
| `"xavier_uniform"` | `U(-sqrt(6/(n_in+n_out)), sqrt(6/(n_in+n_out)))` | `U(-sqrt(6/(in_ch*k*k+out_ch)), ...)` | Sigmoid/tanh |
| `"xavier_normal"` | `N(0, sqrt(2/(n_in+n_out)))` | `N(0, sqrt(2/(in_ch*k*k+out_ch)))` | Sigmoid/tanh |
| `"he_uniform"` | `U(-sqrt(6/n_in), sqrt(6/n_in))` | `U(-sqrt(6/(in_ch*k*k)), ...)` | ReLU variants |
| `"he_normal"` | `N(0, sqrt(2/n_in))` | `N(0, sqrt(2/(in_ch*k*k)))` | ReLU variants (default for conv) |
| `"normal"` | `N(0, 0.1)` | `N(0, 0.1)` | General purpose |
| `"orthogonal"` | SVD-based orthogonal init | SVD-based, reshaped | RNNs, deep nets |

**Biases:**
- All methods initialize biases to zeros.

---

## Reinforcement Learning

### Reinforce

```python
best_score = model.Reinforce(inputs, score_fn, noise=0.05, tries=10, sigma=1.0)
```

Evolutionary strategy that perturbs weights and keeps the best performing variant.

**Parameters:**

| Parameter | Default | Description |
|-----------|---------|-------------|
| `inputs` | *required* | Input to evaluate on |
| `score_fn` | *required* | Callable that takes model output and returns a scalar score (higher = better) |
| `noise` | `0.05` | Standard deviation multiplier for weight perturbations |
| `tries` | `10` | Number of candidate networks to try |
| `sigma` | `1.0` | Base standard deviation for perturbations |

**Returns:**
- `best_score`: float, the highest score achieved

**Side Effects:**
- Mutates `self.layers` to the best performing configuration
- Original weights are lost unless saved beforehand

**Algorithm:**
1. Evaluate baseline score with current weights
2. For each try: add Gaussian noise to all weights/biases, evaluate score
3. Keep the candidate with highest score
4. Restore best weights to model

**Example:**
```python
def score_fn(output):
    # Higher score = better policy
    return np.mean(output[:, 0])  # Example: maximize first output dimension

model.Reinforce(state_input, score_fn, noise=0.1, tries=20)
```

---

## Model I/O

### Save

```python
model.Save(file)
```

Saves model architecture, weights, and training state to disk.

**Parameters:**

| Parameter | Description |
|-----------|-------------|
| `file` | File path. Extension determines format: `.pkl` for pickle, anything else for JSON |

**Saved Data:**
- Layer configurations and weights
- Optimizer type and hyperparameters
- Training timestep `t`

**Example:**
```python
model.Save("model.pkl")      # Binary pickle format
model.Save("model.json")     # Human-readable JSON format
```

---

### Load

```python
model.Load(file)
```

Restores model from saved file.

**Parameters:**

| Parameter | Description |
|-----------|-------------|
| `file` | File path (`.pkl` or `.json`) |

**Restored Data:**
- All layer weights and configurations
- Optimizer settings
- Resets `opt_state` (optimizer momentum buffers are cleared)

**Example:**
```python
model = NeuralNet()
model.Load("model.pkl")
predictions = model.Forward(X_test)
```

---

## Utility Functions

### im2col

```python
from Enilnets.forward import im2col

col = im2col(input_data, filter_h, filter_w, stride=1, pad=0)
```

Converts image batches to column format for efficient convolution.

**Parameters:**

| Parameter | Default | Description |
|-----------|---------|-------------|
| `input_data` | *required* | 4D array `(N, C, H, W)` |
| `filter_h` | *required* | Filter height |
| `filter_w` | *required* | Filter width |
| `stride` | `1` | Stride |
| `pad` | `0` | Zero padding |

**Returns:**
- `col`: 2D array `(N * out_h * out_w, C * filter_h * filter_w)`

**Example:**
```python
col = im2col(images, 3, 3, stride=1, pad=1)
# Now matrix multiplication can replace convolution
```

---

## Complete Example: MNIST Classifier

```python
from Enilnets import NeuralNet
import numpy as np

# Load data (pseudo-code)
# X_train: (60000, 1, 28, 28), Y_train: (60000, 10) one-hot
# X_test: (10000, 1, 28, 28), Y_test: (10000, 10)

# Build model
model = NeuralNet(learning_rate=0.001, optimizer="adam", l2_lambda=0.0001)

# Conv block 1
model.add_conv2d(1, 32, k=3, activation="relu", init_method="he_normal")
model.add_maxpool2d(2)

# Conv block 2
model.add_conv2d(32, 64, k=3, activation="relu", init_method="he_normal")
model.add_maxpool2d(2)

# Classifier
model.add_flatten()
model.add_dense(64 * 7 * 7, 256, activation="relu")
model.add_dropout(0.5)
model.add_dense(256, 10, activation="softmax")

# Print summary
model.summary()

# Train
history = model.Train(
    X_train, Y_train,
    epochs=10, batch_size=128,
    X_val=X_test, Y_val=Y_test,
    loss_function="cross_entropy",
    verbose=True
)

# Save
model.Save("mnist_model.pkl")

# Load and predict later
model2 = NeuralNet()
model2.Load("mnist_model.pkl")
predictions = model2.Forward(X_test)
```

---

## Architecture Notes

### Data Format

The library uses **channels-first** format for convolutions:
- 4D input: `(batch, channels, height, width)`
- This matches PyTorch convention, not TensorFlow's channels-last.

### Backward Pass Details

The backward pass handles layer transitions automatically:

| From Layer | To Layer | Error Propagation |
|------------|----------|-------------------|
| Dense/Sparse | Dense/Sparse | `np.dot(delta, W.T)` |
| Conv2D | Conv2D | `conv2d_backward_input` (full convolution with flipped weights) |
| Flatten | Dense | `reshape` to match output shape |
| MaxPool2D | Any | Routes to max positions only |
| AvgPool2D | Any | Distributes equally |
| Dropout | Any | Scales by `mask / (1 - rate)` |
| BatchNorm | Any | Backprop through normalization + gamma/beta gradients |

### Numerical Stability

- Sigmoid clips inputs to `[-500, 500]` to prevent overflow
- Cross-entropy clips probabilities to `[1e-12, 1.0]`
- BatchNorm uses `epsilon=1e-5` for division stability
- All computations use `float64` dtype

---

## API Reference Summary

### NeuralNet Methods

| Method | Description |
|--------|-------------|
| `__init__(lr, opt, l2, mom)` | Constructor |
| `summary()` | Print architecture |
| `add_dense(n_in, n_out, ...)` | Add dense layer |
| `add_sparse(n_in, n_out, ...)` | Add sparse layer |
| `add_conv2d(in_ch, out_ch, k, ...)` | Add conv layer |
| `add_flatten()` | Add flatten layer |
| `add_maxpool2d(p)` | Add max pool |
| `add_avgpool2d(p)` | Add avg pool |
| `add_batchnorm(n_features, ...)` | Add batch norm |
| `add_dropout(rate)` | Add dropout |
| `Forward(x, training, dropout_rate)` | Forward pass |
| `predict(x)` | Alias for Forward |
| `Backward(targets)` | Backpropagation |
| `update()` | Apply gradients |
| `TrainBatch(xs, ys, ...)` | Train one batch |
| `Train(X, Y, epochs, ...)` | Full training loop |
| `ComputeLoss(out, tgt, ...)` | Compute loss |
| `compute_accuracy(pred, tgt)` | Compute accuracy |
| `Reinforce(inputs, score_fn, ...)` | Evolutionary optimization |
| `Save(file)` | Save model |
| `Load(file)` | Load model |

### Standalone Functions

| Function | Module | Description |
|----------|--------|-------------|
| `activate(name, x)` | `Enilnets.activations` | Apply activation |
| `derivative(name, x)` | `Enilnets.activations` | Activation derivative |
| `init_weights(n_in, n_out, method)` | `Enilnets.weight_init` | Init dense weights |
| `init_conv_weights(in_ch, out_ch, k, method)` | `Enilnets.weight_init` | Init conv weights |
| `im2col(data, fh, fw, stride, pad)` | `Enilnets.forward` | Image to columns |
| `batchnorm_forward(x, layer, training)` | `Enilnets.forward` | Batch norm forward |
| `maxpool2d_backward(d, x, p)` | `Enilnets.backward` | Max pool backprop |
| `avgpool2d_backward(d, x, p)` | `Enilnets.backward` | Avg pool backprop |
| `batchnorm_backward(dout, cache)` | `Enilnets.backward` | Batch norm backprop |
| `conv2d_backward_input(d, w, shape)` | `Enilnets.backward` | Conv input gradient |
