Model Settings

Configure training mode, hyperparameters, and inference settings for your VLM in Datature Vi.

Before You Start

Model settings control three things: how the model trains (training mode and quantization), how it learns (hyperparameters), and how it generates output at inference (evaluation settings). You access all of them by clicking the Model node in the workflow canvas.


Model options

Architecture size

The number of parameters in the model, measured in billions (B). Larger models have more capacity to learn complex patterns but require more GPU memory and training time.

Size range
Parameters
GPU memory needed
Best for
Small
0.8–4B
4–10 GB
Quick experiments, edge and resource-limited deployments
Medium
7–9B
16–20 GB
Standard production use cases
Large
27–35B
55–80 GB
Complex tasks, maximum accuracy requirements

Start with a smaller model to validate your approach, then scale up if you need higher accuracy.

Training mode

Training mode determines which parameters are updated during training.

LoRA (Low-Rank Adaptation) inserts small trainable adapter layers into a frozen base model. Only the adapter weights are updated. This requires 3–5× less memory and trains 2–3× faster than full fine-tuning. The accuracy difference is small for most tasks.

Full fine-tuning (SFT) updates every parameter in the network. This requires more GPU memory and training time but gives maximum flexibility for tasks that differ substantially from the base model's training data.

Aspect
LoRA
Full fine-tuning
GPU memory
Low (3–5× reduction)
High
Training speed
Faster (2–3×)
Slower
Final accuracy
Good for most tasks
Highest possible
LoRA required for NVILA-Lite
yes (only option)

Start with LoRA. Switch to full fine-tuning only if LoRA results fall short of your accuracy target and you have the GPU resources to support it.

Start with LoRA for your first training run. It trains 2-3x faster and uses 3-5x less GPU memory. Switch to full fine-tuning only if: (a) LoRA results plateau and you need higher accuracy, (b) your images look very different from typical photos (microscopy, satellite, X-ray), or (c) you're using NVILA-Lite which only supports full fine-tuning.

For a deeper explanation, see What Are VLMs and the LoRA glossary entry.

Quantization

Quantization reduces model weight precision to save GPU memory. Both available formats reduce memory by approximately 4×:

  • NF4 (Normalized Float 4): Optimized for transformer models. Better quality preservation than FP4 for VLMs. This is the recommended default.
  • FP4 (4-bit Floating Point): Standard 4-bit format. Use if NF4 causes compatibility issues.

Precision type

The numeric format used for calculations during training:

  • BFloat16 (recommended): Best numerical stability for large models on modern GPUs (NVIDIA Ampere+). Runs at ~2× the speed of Float32.
  • Float16: Good for older GPUs without BFloat16 support. Slightly more prone to gradient instability on large models.
  • Float32: Use only when debugging numerical stability issues. Slowest and uses the most memory.

Hyperparameters

Each hyperparameter control in the Model node maps 1:1 to a property under hyperparameters on the model block in the saved run.json file you get with a model download: same keys and values, no hidden renames. Use that JSON as the canonical reference when you compare UI training runs to Vi SDK flow specs.

Epochs

The number of complete passes through your training data. One epoch means the model sees every training image once.

Choosing the right number depends on dataset size:

Epochs

Dataset size
Recommended epochs
Under 100 images
100–300
100–1,000 images
50–150
1,000+ images
20–100

Watch for overfitting: if validation performance stops improving or gets worse while training loss keeps falling, you have too many epochs. Monitor the loss curves in your training run to catch this early.

An epoch is one complete pass through your entire dataset. Smaller datasets need more epochs because the model sees fewer unique examples per pass. Larger datasets need fewer epochs because there's more variety in each pass.

Watch the validation loss curve during training. If training loss keeps dropping but validation loss starts rising, reduce the epoch count. For more on reading loss curves, see How Does VLM Training Work.

Learning rate

Controls how much the model's parameters change with each training step. The typical range is 1e-5 to 1e-4.

  • A rate that is too high causes unstable training (loss oscillates or diverges)
  • A rate that is too low makes training converge slowly

Start with the default. Adjust only if training looks unstable (try lower) or converges unusually slowly (try slightly higher).

Think of the learning rate like step size when walking downhill: too large and you overshoot, too small and you never arrive.

The notation 1e-5 means 0.00001. The "e" means "times 10 to the power of." The default range (1e-5 to 1e-4) works for most tasks.

If training loss spikes or oscillates, your learning rate is probably too high. Try halving it. If training loss decreases very slowly, try doubling it.

Batch size

The number of training images processed in each step before the model updates its weights. Larger batches make training faster but require more GPU memory.

  • If you see out-of-memory errors, reduce batch size
  • Use gradient accumulation to simulate a larger effective batch size when GPU memory is limited

If you get an "out of memory" error during training, reduce the batch size first. Gradient accumulation simulates a larger batch size without extra memory: batch_size=4 with accumulation_steps=2 gives an effective batch of 8.

Gradient accumulation steps

Accumulates gradients across multiple steps before applying a weight update. Setting this to N has a similar effect to multiplying your batch size by N, without the extra memory cost. Useful when GPU memory prevents you from using a larger batch size directly.

Optimizer

The algorithm that computes weight updates from gradients. The default optimizer is tuned for VLM fine-tuning. Change this only if you have a specific reason: optimizer choice rarely has more impact than learning rate and epoch count.


Evaluation settings

These settings control how the model generates output at inference time. They affect response length, diversity, and coherence.

Max new tokens

The maximum number of tokens the model can generate in a single response. Keep this high enough to accommodate your longest expected output. For short classification answers this can be small; for detailed descriptions it should be larger.

Temperature

Controls output randomness. Lower values (0.1–0.5) produce more consistent, deterministic responses. Higher values (0.7–1.0) produce more varied outputs. For structured outputs (JSON, fixed-format answers) keep temperature low.

Top-p (nucleus sampling)

Limits token selection to the smallest set of candidates whose cumulative probability exceeds the threshold. A value of 0.9 means the model only considers tokens that together account for 90% of the probability mass. Lower values make outputs more focused; higher values allow more variety.

Top-k results

Limits token selection to the k most likely candidates at each step. Works alongside top-p to control generation diversity.

Repetition penalty

Reduces the likelihood of the model repeating the same words or phrases. Values above 1.0 penalize repetition. If your model produces repetitive outputs, increase this slightly (try 1.11.3).


Frequently asked questions

No. Settings are fixed when you save a workflow. To try different settings, create a new workflow within the same training project. You can maintain multiple workflows and compare their training results.

Start with LoRA. It trains faster, uses less memory, and produces good results for most vision-language tasks. Switch to full fine-tuning only if LoRA accuracy is consistently insufficient for your use case and you have the GPU resources for it. NVILA-Lite does not support LoRA: full fine-tuning is its only option.

In order of impact: (1) model architecture and size, (2) training data quality and quantity, (3) epoch count, (4) learning rate, (5) batch size and gradient accumulation. Focus on architecture, data, and epochs before fine-tuning other settings.

An out-of-memory (OOM) error means the model and batch data exceed your GPU's available VRAM. Fix it by: reducing batch size, enabling quantization (NF4), switching from full fine-tuning to LoRA, or upgrading to a GPU with more VRAM. See resource usage for GPU memory specifications.

Do this with the Vi SDK

import vi

client = vi.Client(
    secret_key="your-secret-key",
    organization_id="your-organization-id"
)

flow = client.flows.get("your-flow-id")
blocks = []
for block in flow.spec.blocks:
    settings = dict(block.settings)
    if "model" in block.block:
        settings["trainingMode"] = {"type": "LoRA"}
        settings["quantization"] = {"enabled": True, "type": "NF4"}
        settings["compute"] = {"precisionType": "BFloat16"}
        settings["hyperparameters"] = {
            "epochs": 10,
            "learningRate": 0.0001,
            "batchSize": 8,
            "gradientAccumulation": 1,
            "optimizer": "AdamW"
        }
        settings["evaluation"] = {
            "maxNewTokens": 512,
            "topK": 50,
            "topP": 1,
            "temperature": 1,
            "repetitionPenalty": 1.05
        }
    blocks.append({
        "block": block.block,
        "settings": settings,
        "style": block.style,
    })

client.flows.update(flow_id=flow.flow_id, spec={"blocks": blocks})

For more details, see the full SDK reference.

Next steps

Model Architectures

Review the seven available VLM architectures and choose the best fit for your task.

Start A Training Run

Configure training settings, select GPU hardware, and start a training run.

How Does VLM Training Work?

Learn the concepts behind fine-tuning, LoRA, and how VLMs learn from your data.