đź§™
THIS WILL OVERWRITE YOUR EXISTING CONFIGURATION!
What kind of model are you training?

This determines which model architecture to use.

Loading models...

Loading available models...

Which model variant do you want to train?

What type of training do you want to perform?

Choose between LoRA (parameter-efficient) or full model training.

LoRA (Recommended)

Low-Rank Adaptation - Efficient, faster, smaller output files

Full Model

Train entire model - Higher quality, requires more resources

Lower ranks produce smaller LoRA files with faster training, higher ranks capture more detail.
Quantization

Quantise the base model or text encoders to adjust memory usage. Text encoder quantisation only affects the precaching step—the encoder is unloaded before training—so reserve it for extremely constrained setups or when you need the model to honour quantised encoder outputs. Defaults apply automatically for most runs.

Loading quantization options...
Quantization options are not available for this model configuration.
Acceleration Strategy

Choose how SimpleTuner should manage large-model training. You can enable grouped CPU offloading, Hugging Face Accelerate + DeepSpeed, or PyTorch FSDP2 sharding. Hardware > Accelerate retains all advanced switches for later tweaks.

Keep the default single-process Accelerate launch. You can still enable CPU offload or sharding later from Hardware > Accelerate if your run requires it.
Group Offload Options

Diffusers’ group offload moves configured module groups to CPU (or disk) between forward passes, freeing VRAM on small GPUs. SimpleTuner wires the correct CLI flags automatically.

Block level keeps multiple layers together for higher throughput; leaf level maximises memory savings.
Only used with block-level grouping. Higher values reduce transfers but require more VRAM.
Leave blank to keep tensors in host RAM. Provide a fast NVMe directory when memory is extremely tight.
DeepSpeed Configuration

Configure 🤗 Accelerate’s DeepSpeed integration. Choose the ZeRO stage and optional offload targets. Advanced edits remain available from Hardware > Accelerate after closing the wizard.

NVMe requires a fast disk path; CPU keeps tensors in host memory.
Optimizer offload reduces GPU memory at the cost of host or disk bandwidth.
Matches --offload_param_path. Leave blank to require a path before training starts.

                            
The wizard keeps this JSON in sync. Advanced edits can be made directly from the Hardware tab after closing the wizard.
FSDP2 Configuration

FSDP2 shards model parameters, optimizer state, and activations across GPUs. SimpleTuner enables Accelerate’s DTensor-backed implementation and exposes the most common toggles here.

Sharded checkpoints save memory by keeping tensors distributed when writing to disk.
Set > 1 to shard attention / context across GPUs. Requires models that support context parallelism.
Transformer-based wrapping covers most diffusion transformers. Size-based lets you wrap by parameter count.
Override Accelerate’s detected layer classes when validation errors request a specific module.
Configure memory optimization strategies

Choose strategies to reduce VRAM usage. Multiple strategies can be combined where compatible.
You can change these settings at any time using the Load Presets button in the Memory Optimization section.

Low System RAM Detected ( GB). Most memory optimization strategies require 64GB+ of system RAM to be effective.
Loading presets...
No quantization applied. Recommended default. Good balance of quality and memory savings. More aggressive savings. Does not work on AMD or Apple machines. Experimental. Maximum savings but may impact quality.
Advanced strategies require careful configuration and may have stability issues.
No presets available for this tab. Select a different model or check the Advanced tab for distributed strategies.
Set a custom number of blocks to swap if presets don't match your needs. Higher values save more VRAM, use more system RAM, and increase training time.
How long should training run?

Choose whether to stop after a set number of epochs (full passes through your dataset) or a fixed number of steps (individual optimizer updates).

When you pick epochs, SimpleTuner will automatically calculate the matching step count once your dataset size is known.

One epoch equals one full pass over every sample. Helpful when you want consistent coverage of the dataset.

Steps will be derived automatically from dataset size and batch settings.

A step is one optimizer update (forward + backward). Great for matching published schedules or quick experiments.

When steps are set, epochs will be inferred so training stops exactly at this count.
Do you want to publish your model to Hugging Face Hub?

Automatically upload your trained model when complete.

Yes, publish when complete
No, keep local only
Also publish intermediate checkpoints?

Enable this to push every checkpoint to the Hub. This consumes considerably more Hub storage; disable it to upload only the final checkpoint (you can still upload others later from the Checkpoints page).

Sends Hub uploads to a background worker so checkpoints and final saves do not block the training loop.
Provide the destination repo, e.g. yourname/awesome-lora. Required when publishing.
This note appears at the top of the auto-generated model card.
How often should checkpoints be saved?

Checkpoints let you resume training and pick the best model version.

Recommended: 50 for quick experiments, 100 for short runs, 500 for more files, 1000 for very long runs with many files
Epoch-based checkpoints fire when an epoch finishes. Combine with step checkpoints for extra safety.
Do you want to run validations during training?

Validations generate sample images to monitor training progress.

Yes, enable validations

Recommended - helps monitor training quality

No, skip validations

Faster training, but no progress preview

Choose an interval that matches how frequently you want fresh previews.
Write the prompt exactly as you would when sampling the trained model. You can add additional prompts later from the Validation tab once the wizard is complete.
Lyrics provide additional conditioning for audio generation. Use [Section] tags like [Verse], [Chorus], [Bridge] to mark different parts. This field is optional.
Match the aspect ratios or sizes you plan to render (supports comma-separated values).
Use the same number of steps you typically rely on for quality previews.
Enable real-time preview of validation images as they're being generated using Tiny AutoEncoders. Requires webhook configuration and model support (Flux, SDXL, SD3).
Only decode previews every N sampling steps. Higher values reduce Tiny AutoEncoder overhead.
Checkpoint interval: steps. Validation interval: steps.
Great! Your intervals align, so validation images will be stored with each checkpoint's model card.
When these intervals align, validation renders are bundled with checkpoint model cards; otherwise they remain in the output directory. Try using steps for validations to line things up.
Alignment is optional; only set it if saving validation images alongside checkpoints matters to you.
Set both checkpoint and validation intervals to see if they align. Matching values will save your validation images with each checkpoint automatically.
Enable logging to external platform?

Track metrics, losses, and generated images externally (optional).

Defaults to when left blank.
Leave empty to auto-generate a timestamped name for each run.
File name from simpletuner/custom-trackers (without .py), which must expose a single GeneralTracker subclass.
Logging is currently disabled. These values will be applied automatically once you select a provider.
Configure your training dataset
You have an existing dataset configured. Choose an option:
Keep existing datasets

Continue with your current dataset configuration

Create new dataset

Launch dataset wizard to configure a new dataset

Configure learning rate and optimizer?

Pick a preset tailored to your training mode or adjust the values manually.

Manual configuration

Adjust optimizer, learning rate, batch size, and gradient accumulation directly here

Use small, positive values. Lower rates improve stability.
Higher batch sizes increase memory use but can stabilise updates.
Accumulate gradients across multiple mini-batches to simulate larger batch sizes, as recommended in the quickstarts for low VRAM runs.
No optimizer options available from the registry; defaults will be used.
Review your configuration
Configuration complete! You can now start training or review individual settings in the UI.