Model Settings

Configure model architecture, training parameters, and inference settings for optimal VLM performance.

Model settings

Model settings define the architecture, training approach, and inference behavior of your VLM. Proper configuration of these settings directly impacts training speed, memory usage, model quality, and inference performance.

💡

Looking for a quick start?

For streamlined workflow setup with recommended default settings:

📋

Prerequisites

Before configuring model settings, ensure you have:

Understanding model settings

Model settings are organized into three main categories:

  • Model Options — Architecture size, training mode, and memory optimization
  • Hyperparameters — Training behavior and convergence settings
  • Evaluation — Inference behavior and output generation

These settings work together to define:

  • Training efficiency — How fast your model trains and how much memory it requires
  • Model capacity — The model's ability to learn complex patterns
  • Convergence behavior — How the model improves during training
  • Inference quality — Output diversity, length, and coherence

Model options

Access model configuration by clicking the Model node in the workflow canvas.

Model architecture size

The number of parameters in the neural network, measured in billions (B). This fundamental setting determines the model's capacity to learn and represent complex patterns.

Available sizes:

SizeParametersBest forMemory requirements
Small1-3BQuick experiments, limited resources8-16 GB GPU
Medium7-13BStandard production use cases16-32 GB GPU
Large20-34BComplex tasks, high accuracy needs40-80 GB GPU

How it impacts training:

  • Larger models have more capacity to learn complex patterns and nuanced relationships
  • Smaller models train faster and require less computational resources
  • Memory usage scales roughly linearly with parameter count
📘

Choosing the right size

Start with smaller models for initial experiments to validate your approach quickly. Scale up to larger models when you need higher accuracy or have proven the concept works with your data.

When to use each size:

Small models (1-3B parameters)

Best for:

  • Initial prototyping and experimentation
  • Limited GPU resources (single consumer GPU)
  • Fast iteration during development
  • Simple tasks with clear visual patterns
  • Real-time inference requirements

Tradeoffs:

  • Lower capacity for complex reasoning
  • May struggle with subtle distinctions
  • Faster training and inference
  • Lower memory and compute costs

Example use cases:

  • Binary classification (good/defective)
  • Single object detection
  • Simple quality control
Medium models (7-13B parameters)

Best for:

  • Production deployments
  • Most standard computer vision tasks
  • Multi-class detection and classification
  • Balanced performance and resource usage

Tradeoffs:

  • Good balance of accuracy and efficiency
  • Reasonable training times
  • Moderate GPU requirements
  • Suitable for most use cases

Example use cases:

  • Multi-object detection
  • Complex defect classification
  • Retail product recognition
  • General-purpose visual understanding
Large models (20-34B parameters)

Best for:

  • Maximum accuracy requirements
  • Complex reasoning tasks
  • Fine-grained distinctions
  • Production systems with ample resources

Tradeoffs:

  • Highest accuracy potential
  • Significantly longer training times
  • High GPU memory requirements
  • May require distributed training

Example use cases:

  • Medical image analysis
  • Detailed inspection tasks
  • Open-ended visual reasoning
  • Challenges requiring nuanced understanding

Training mode

Training mode determines how the model's parameters are updated during training. This critical choice affects training speed, memory usage, and model flexibility.

Available modes:

LoRA Training

Trains only small adapter layers while keeping the base model frozen.

  • How it works: Inserts small trainable layers (Low-Rank Adaptation) into the frozen base model
  • Memory requirements: Low (only adapter gradients stored)
  • Training time: Faster (fewer parameters to update)
  • Flexibility: Good for most use cases, may be less flexible for drastically different tasks

Full Finetuning (SFT)

Updates all model parameters during training using Supervised Fine-Tuning.

  • How it works: Every layer in the neural network is adjusted based on your training data
  • Memory requirements: High (requires storing gradients for all parameters)
  • Training time: Longer (more parameters to update)
  • Flexibility: Maximum adaptation to your specific task

Recommendation: Start with LoRA Training

LoRA Training is recommended for most use cases as it offers 2-3x faster training with significantly lower memory usage while maintaining comparable quality to Full Finetuning.

Comparison:

AspectLoRA TrainingFull Finetuning (SFT)
Memory usageLow (3-5x reduction)High
Training speedFaster (2-3x speedup)Slower
Adaptation flexibilityGoodMaximum
Best forMost standard use casesDrastically different tasks
GPU requirementsConsumer GPU sufficientHigh-end GPU required

When to use LoRA Training:

Standard use cases (recommended)

Use for most production applications:

  • Standard computer vision tasks (detection, classification)
  • Limited GPU resources
  • Faster iteration during development
  • Domain adaptation (e.g., retail → manufacturing)
  • Cost-effective training

When to use Full Finetuning (SFT):

Maximum adaptation required

Use when your task is significantly different from the base model's training:

  • Highly specialized domain (e.g., microscopy, satellite imagery)
  • Novel visual patterns not seen in general training
  • Maximum accuracy is critical and resources are available
  • Task requires fundamental changes to feature extraction

Quantization

Quantization reduces model precision to save memory, enabling training of larger models on limited GPU resources. This technique uses lower-bit representations for model weights and activations.

Available formats:

NF4 (Normalized Float 4)

4-bit format optimized specifically for neural networks with normalized value distribution.

  • Memory savings: ~4x reduction compared to full precision
  • Quality: Excellent preservation of model quality
  • Best for: Neural network training (recommended default)
  • Designed for: Transformer models and VLMs

FP4 (4-bit Floating Point)

Standard 4-bit floating point quantization for weights.

  • Memory savings: ~4x reduction compared to full precision
  • Quality: Very good
  • Best for: General quantization, compatibility scenarios
  • Use when: NF4 has compatibility issues or specific FP4 requirements

Recommendation: NF4

NF4 is recommended for VLM training as it's specifically optimized for transformer models, providing better quality preservation than FP4 with the same memory savings.

📘

How quantization works

Quantization represents model weights using 4 bits instead of 16-bit or 32-bit precision, reducing memory usage by approximately 4x. This enables training larger models or using larger batch sizes with the same GPU memory.

Quantization formats compared:

FormatMemory savingsQualityBest for
NF4~4x reductionExcellentNeural network training (recommended)
FP4~4x reductionVery goodGeneral quantization, compatibility

Impact example:

Without quantization (16-bit):
- 13B model: ~40 GB GPU memory

With NF4 quantization (4-bit):
- 13B model: ~12 GB GPU memory
- 3-4x memory reduction
- Minimal quality loss

When to use each format:

NF4 (recommended)

Use for most scenarios:

  • Standard VLM training workflows
  • Production model training
  • When you want best quality with quantization
  • Recommended default choice

Advantages:

  • Optimized value distribution for neural network weights
  • Better preservation of model quality than FP4
  • Specifically designed for transformer architectures
  • Minimal accuracy loss compared to full precision
  • Industry best practice for VLM training
FP4

Use when:

  • Compatibility issues with NF4
  • Debugging quantization-related issues
  • Specific requirements for standard floating point format
  • Legacy configurations

Note: For most use cases, NF4 is preferred as it provides better quality with the same memory savings.


Precision type

The numerical format used for calculations during training. This setting balances computation speed, memory usage, and numerical stability.

Available options:

BFloat16

  • Precision: 16-bit brain floating point format
  • Accuracy: Better numerical stability than Float16 for large models
  • Speed: ~2x faster than Float32
  • Memory: 2x reduction vs Float32
  • Use case: Preferred for training large models (recommended)

Float16

  • Precision: 16-bit floating point
  • Accuracy: Good for most use cases
  • Speed: ~2x faster than Float32
  • Memory: 2x reduction vs Float32
  • Use case: Standard training with older GPUs

Float32

  • Precision: 32-bit floating point
  • Accuracy: Highest numerical precision
  • Speed: Slowest (baseline)
  • Memory: Highest usage
  • Use case: Debugging numerical issues, research requiring maximum precision

Recommendation: BFloat16

BFloat16 is recommended for VLM training as it provides the speed and memory benefits of 16-bit precision with better numerical stability than Float16, especially important for large models.

Comparison table:

PrecisionSpeedMemoryStabilityModern GPU support
BFloat16Fast (2x)LowExcellentNVIDIA Ampere+, AMD MI200+
Float16Fast (2x)LowGoodUniversal
Float32Baseline (1x)HighExcellentUniversal

When to use each precision:

BFloat16 (recommended)

Best for most scenarios:

  • Modern GPUs (NVIDIA A100, H100, RTX 30/40 series)
  • Training large models (7B+ parameters)
  • Balanced speed and stability
  • Production training workflows

Why it's better:

  • Preserves Float32's exponent range (better for extreme values)
  • Reduces gradient underflow/overflow issues
  • Widely supported in modern ML frameworks
  • Minimal accuracy loss compared to Float32
Float16

Use when:

  • Older GPUs without BFloat16 support
  • Maximum speed is critical
  • Working with smaller models (<7B parameters)

Considerations:

  • May encounter numerical instability with very large or small gradients
  • Requires gradient scaling for stability
  • Slightly more prone to training issues than BFloat16
Float32

Use when:

  • Debugging numerical stability issues
  • Research requiring maximum precision
  • Training is unstable with lower precision
  • GPU memory is not a constraint

Tradeoffs:

  • 2x slower than Float16/BFloat16
  • 2x more memory usage
  • Rarely necessary for production training

Hyperparameters

Hyperparameters control the training process dynamics—how the model learns from your data and how quickly it converges.

Epochs

The number of complete passes through your entire training dataset. Each epoch represents one full cycle of training where the model sees every training image once.

How it works:

1 epoch = model sees all training images once
10 epochs = model sees all training images 10 times

Example calculation:

  • 100 training images
  • Batch size of 2
  • 1 epoch = 50 training steps (100 images ÷ 2 per batch)
  • 10 epochs = 500 total training steps

Choosing the right number:

Recommended epoch ranges

Small datasets (<100 images):

  • Recommended: 100-300 epochs
  • Reasoning: More passes needed to learn from limited data
  • Watch for: Overfitting after ~200 epochs

Medium datasets (100-1000 images):

  • Recommended: 50-150 epochs
  • Reasoning: Balanced learning with sufficient data
  • Watch for: Convergence plateau around 100 epochs

Large datasets (1000+ images):

  • Recommended: 20-100 epochs
  • Reasoning: Fewer passes needed with abundant data
  • Watch for: Diminishing returns after 50-75 epochs
Signs you need more epochs

Increase epochs when:

  • Training loss is still decreasing steadily
  • Validation metrics improving each epoch
  • Model hasn't converged yet
  • Early in experimentation phase

Example: Training loss: epoch 50 = 0.45, epoch 100 = 0.32, epoch 150 = 0.28

  • Still improving → continue training
Signs you need fewer epochs

Reduce epochs when:

  • Training loss plateaus early
  • Validation performance stops improving or degrades (overfitting)
  • Training time is excessive
  • Model converges quickly

Example: Training loss: epoch 30 = 0.25, epoch 60 = 0.24, epoch 90 = 0.24

  • Converged at epoch 30 → reduce to 50 epochs
💡

Pro tip: Use early stopping

Monitor validation metrics during training. If performance stops improving for 10-20 epochs, training can often be stopped early. Create multiple runs with different epoch counts to find the optimal value for your dataset.


Learning rate

Learning rate controls how much the model's parameters change with each training step. It's one of the most critical hyperparameters affecting training success.

How it works:

  • High learning rate: Large parameter updates, faster initial learning, risk of instability
  • Low learning rate: Small parameter updates, stable but slow learning, may get stuck
  • Optimal learning rate: Balances speed and stability for efficient convergence

Typical range: 0.00001 (1e-5) to 0.0001 (1e-4)

Recommended starting values:

Model sizeLoRA TrainingFull Finetuning (SFT)
Small (1-3B)0.00010.00005
Medium (7-13B)0.00010.00003
Large (20-34B)0.000050.00001
⚠️

Learning rate significantly impacts training

Too high causes training instability and divergence. Too low results in slow training or getting stuck in poor solutions. Start with recommended values and adjust based on training behavior.

Tuning the learning rate:

Learning rate too high (signs and fixes)

Signs:

  • Training loss increases or oscillates wildly
  • Loss suddenly spikes to very large values
  • Model predictions become nonsensical
  • Training diverges or produces NaN values

Example:

Epoch 1: loss = 2.3
Epoch 2: loss = 1.8
Epoch 3: loss = 5.7 ← Spike indicates too high
Epoch 4: loss = NaN ← Training diverged

Fix:

  • Reduce learning rate by 2-5x (e.g., 0.0001 → 0.00002)
  • Start a new training run with adjusted rate
  • Consider using smaller batch size for more stable gradients
Learning rate too low (signs and fixes)

Signs:

  • Training loss decreases very slowly
  • Progress stalls at high loss values
  • Training takes excessively long
  • Model underfits the data

Example:

Epoch 10: loss = 2.1
Epoch 20: loss = 2.08
Epoch 30: loss = 2.06 ← Very slow improvement
Epoch 40: loss = 2.04

Fix:

  • Increase learning rate by 2-3x (e.g., 0.00001 → 0.00003)
  • Start a new training run with adjusted rate
  • Monitor closely to ensure stability
Learning rate just right (what to expect)

Good training behavior:

  • Loss decreases steadily without large spikes
  • Occasional small fluctuations are normal
  • Converges within expected number of epochs
  • Validation metrics improve consistently

Example:

Epoch 5:  loss = 2.1
Epoch 10: loss = 1.6
Epoch 15: loss = 1.3 ← Steady improvement
Epoch 20: loss = 1.1
Epoch 25: loss = 0.95

Characteristics:

  • Smooth loss curve with minor noise
  • Clear downward trend
  • No sudden spikes or divergence
  • Validation performance tracks training improvement

Learning rate schedules:

Advanced: Learning rate scheduling

Learning rate schedules automatically adjust the learning rate during training:

Common schedules:

  1. Constant (default):

    • Same rate throughout training
    • Simple and predictable
    • Good for most use cases
  2. Linear decay:

    • Gradually reduces rate over training
    • Helps fine-tune convergence at the end
    • Useful for long training runs
  3. Cosine annealing:

    • Reduces rate following cosine curve
    • Smoother decay than linear
    • Popular for large model training

When to use schedules:

  • Long training runs (100+ epochs)
  • Fine-tuning pre-trained models
  • Seeking optimal convergence
  • Advanced optimization scenarios

Note: Most training workflows work well with constant learning rate. Schedules are an advanced optimization technique.


Batch size

The number of images processed simultaneously in each training step. Batch size affects training speed, memory usage, and model convergence.

How it works:

Batch size 2: Process 2 images → compute gradients → update model
Batch size 4: Process 4 images → compute gradients → update model
Batch size 8: Process 8 images → compute gradients → update model

Example:

  • 100 training images, batch size 4
  • Steps per epoch: 100 ÷ 4 = 25 steps
  • 10 epochs = 250 total training steps

Typical range: 1-16 for VLM training

Recommended starting values:

Model sizeGPU memory 16GBGPU memory 24GBGPU memory 40GB+
Small (1-3B)81632
Medium (7-13B)2-44-88-16
Large (20-34B)1-22-44-8
📘

Batch size tradeoffs

Larger batch sizes:

  • Faster training (better GPU utilization)
  • More stable gradients
  • Higher memory usage
  • May reduce model generalization

Smaller batch sizes:

  • Lower memory requirements
  • Better generalization (more noisy gradients)
  • Slower training
  • Less stable convergence

Choosing batch size:

Memory-constrained training

When GPU memory is limited:

Start with smallest batch size that trains successfully:

  1. Try batch size 4
  2. If out of memory, reduce to 2
  3. If still issues, try batch size 1
  4. Consider enabling quantization or using gradient accumulation

Memory optimization strategies:

  • Enable quantization (NF4)
  • Use gradient accumulation to simulate larger batches
  • Choose smaller model size
  • Reduce precision type to FP16/BF16
Balancing speed and quality

For optimal training:

Use the largest batch size that:

  • Fits in GPU memory comfortably (~80% utilization)
  • Maintains stable training (no memory errors)
  • Provides reasonable training speed

Practical approach:

  1. Start with recommended value for your GPU
  2. Increase until you encounter memory issues
  3. Reduce by 25-50% for safety margin
  4. Monitor training stability

Example:

  • GPU: 24GB VRAM
  • Model: 13B parameters with LoRA
  • Test batch sizes: 4 → 8 → 16
  • Batch size 16 causes OOM errors
  • Final choice: batch size 8 (safe maximum)
When batch size matters less

Batch size is less critical when:

  • Using gradient accumulation (can simulate larger batches)
  • Training small models with ample GPU memory
  • Dataset is large (1000+ images)

Batch size is more critical when:

  • GPU memory is constrained
  • Training very large models
  • Dataset is small (gradient noise matters more)

Gradient accumulation steps

The number of forward passes before updating model weights. This technique simulates larger batch sizes without requiring additional GPU memory.

How it works:

Instead of updating the model after each batch, gradients are accumulated over multiple batches:

Batch size 4, accumulation steps 1 (default):
- Process 4 images → update model immediately
- Effective batch size: 4

Batch size 4, accumulation steps 4:
- Process 4 images → accumulate gradients
- Process 4 images → accumulate gradients
- Process 4 images → accumulate gradients
- Process 4 images → accumulate gradients → update model
- Effective batch size: 16 (4 × 4)

Effective batch size formula:

Effective batch size = Batch size × Gradient accumulation steps

Typical values: 1-8 (1 means no accumulation)

💡

Why use gradient accumulation?

Gradient accumulation lets you train with larger effective batch sizes when GPU memory is limited. This improves training stability and gradient quality without requiring more memory.

When to use gradient accumulation:

Limited GPU memory

Problem: Want larger batch size but GPU memory is insufficient

Solution: Use gradient accumulation

Example:

  • Target effective batch size: 16
  • GPU can only handle batch size 4
  • Set gradient accumulation steps: 4
  • Effective batch size: 4 × 4 = 16 ✓

Tradeoff:

  • Training takes longer (4x more forward passes before each update)
  • Same memory usage as batch size 4
  • Training stability of batch size 16
Improving training stability

Higher effective batch sizes provide:

  • More stable gradient estimates
  • Smoother loss curves
  • Better convergence for large models
  • Reduced gradient noise

Recommended combinations:

ScenarioBatch sizeAccumulation stepsEffective batch
Memory constrained248
Balanced428
Memory available818

All achieve same effective batch size with different memory/speed tradeoffs.

When NOT to use accumulation

Avoid gradient accumulation when:

  • GPU memory is sufficient for desired batch size
  • Small datasets where gradient noise aids generalization
  • Training is already slow and speed is critical
  • Batch size 1-2 is sufficient for your model size

Why avoid unnecessary accumulation:

  • Slower training (more forward passes per update)
  • Adds complexity without benefit
  • Default (accumulation steps = 1) works fine for most cases

Recommended values:

By GPU memory constraint

16GB GPU:

  • Batch size 2, accumulation steps 4 (effective: 8)
  • Batch size 1, accumulation steps 8 (effective: 8)

24GB GPU:

  • Batch size 4, accumulation steps 2 (effective: 8)
  • Batch size 2, accumulation steps 4 (effective: 8)

40GB+ GPU:

  • Batch size 8, accumulation steps 1 (effective: 8)
  • Usually no accumulation needed

Goal: Effective batch size of 8-16 for most VLM training


Optimizer

The optimization algorithm that adjusts the model's parameters to minimize the loss function. The optimizer determines how gradient information is used to update model weights.

Available optimizers:

AdamW

  • Full name: Adam with Weight Decay
  • How it works: Adaptive learning rates per parameter with decoupled weight regularization
  • Best for: Most VLM training scenarios (recommended default)
  • Advantages: Better convergence and generalization through decoupled weight decay

Adam

  • How it works: Adaptive learning rates per parameter
  • Difference from AdamW: Coupled weight decay (less effective regularization)
  • Best for: Legacy compatibility, specific research requirements

Recommendation: AdamW

Use AdamW for VLM training as it provides better convergence and generalization than Adam. It's the standard optimizer for training transformer-based models.

Optimizer comparison:

OptimizerConvergence speedMemory usageGeneralizationUse case
AdamWFastModerateExcellentDefault (recommended)
AdamFastModerateGoodLegacy/compatibility

When to use each optimizer:

AdamW (recommended)

Use for:

  • All standard VLM training (recommended for 95%+ of cases)
  • Production models
  • Fine-tuning transformer models
  • When you want best practices defaults

Advantages:

  • Decoupled weight decay improves regularization
  • Adaptive learning rates per parameter
  • Proven effective for large language and vision models
  • Industry standard for transformer training
  • Better generalization than Adam

Why it's better than Adam:

  • Separates weight decay from gradient-based updates
  • More effective regularization
  • Better final model performance
  • Widely adopted as best practice
Adam

Use for:

  • Reproducing older experiments that used Adam
  • Compatibility with existing configurations
  • Specific research requirements

Note: AdamW is generally preferred over Adam. The main reason to use Adam is backward compatibility with older experiments or when reproducing specific published results that used Adam.

Difference from AdamW:

  • Couples weight decay with gradient updates
  • Slightly less effective regularization
  • May lead to marginally lower performance

Evaluation

Evaluation settings control how the model generates outputs during inference and evaluation. These settings affect response quality, diversity, and computational cost.

Max new tokens

The maximum number of tokens the model can generate in a single response. This setting limits output length to control generation time and computational cost.

How it works:

Tokens are roughly words or subwords:

  • "cat" = 1 token
  • "running" might be 1-2 tokens
  • "The quick brown fox" ≈ 5 tokens

Typical range: 128-1024 tokens

Recommended values:

Task typeMax new tokensReasoning
Short answers (VQA)128-256Brief responses sufficient
Phrase grounding512JSON with multiple groundings
Detailed descriptions512-1024Comprehensive captions needed
📘

Tradeoff: Length vs. Speed

Higher values:

  • Allow longer, more detailed responses
  • Increase evaluation time per sample
  • Higher computational cost
  • Risk of repetitive or rambling outputs

Lower values:

  • Force concise responses
  • Faster evaluation
  • Lower computational cost
  • May truncate important information

Choosing max new tokens:

For phrase grounding tasks

Recommended: 512 tokens

Phrase grounding outputs include:

  • Descriptive caption
  • Multiple grounded phrases
  • Bounding box coordinates
  • JSON structure

Example output size:

{
  "phrase_grounding": {
    "sentence": "A red car parked next to two people wearing blue shirts on the sidewalk near a green tree",
    "groundings": [
      {"phrase": "A red car", "grounding": [[120,340,580,670]]},
      {"phrase": "two people", "grounding": [[600,280,750,720],[780,290,920,710]]},
      {"phrase": "the sidewalk", "grounding": [[0,650,1024,900]]},
      ...
    ]
  }
}

Typical length: 200-400 tokens

Setting 512 provides comfortable margin without waste.

For visual question answering

Recommended: 128-256 tokens

VQA responses are typically:

  • 1-3 sentences
  • 20-80 tokens
  • Focused and concise

Example responses:

  • Short: "There are three dogs in the image" (~8 tokens)
  • Medium: "The image shows three golden retrievers playing in a park on a sunny day" (~15 tokens)
  • Long: "There are three dogs visible: two golden retrievers playing with a ball in the foreground and one black labrador resting under a tree in the background" (~30 tokens)

Setting 256 allows detailed answers without encouraging verbosity.

For custom tasks

Consider:

  1. Measure typical output length:

    • Generate sample outputs
    • Count tokens (roughly 1 token per word)
    • Use 1.5-2x the typical length as max
  2. Balance quality and cost:

    • Longer limits → more flexibility but slower
    • Shorter limits → faster but may truncate
    • Start conservative, increase if needed
  3. Monitor for truncation:

    • If outputs frequently hit the limit, increase
    • If outputs are always much shorter, decrease

Top K results

Limits token sampling to the K most probable next tokens. This setting controls output diversity by restricting the model's choices to the most likely options.

How it works:

At each generation step:

  1. Model computes probability for all possible next tokens
  2. Top K keeps only the K most probable tokens
  3. Model samples from these K tokens

Example:

Top K = 50:
- All tokens ranked by probability
- Keep top 50 most probable
- Sample next token from these 50
- Ignore all other possibilities

Top K = 5:
- Keep only top 5 most probable
- Very focused, deterministic output
- Less diversity

Typical range: 1-100

Recommended starting value: 50

📘

How Top K affects outputs

Lower values (10-30):

  • More focused and deterministic
  • Less diversity and creativity
  • More consistent outputs
  • Risk of repetition

Higher values (50-100):

  • More diversity and variation
  • Less predictable
  • Broader vocabulary usage
  • May reduce coherence if too high

Choosing Top K:

Deterministic tasks (object detection, VQA)

Recommended: 50

For tasks requiring factual accuracy:

  • Object detection
  • Bounding box generation
  • Answering specific questions
  • Structured output generation

Why 50 works well:

  • Enough diversity to avoid repetition
  • Focused enough for accurate outputs
  • Balances creativity and precision

Example: "How many cars are in the image?"

  • Lower K: "There are 3 cars" (consistent)
  • Higher K: Various phrasings but same count
Creative tasks (descriptions, captions)

Recommended: 70-100

For tasks benefiting from diversity:

  • Detailed scene descriptions
  • Creative captions
  • Varied phrasing

Why higher K helps:

  • More vocabulary variety
  • Diverse expression styles
  • Less repetitive phrasing

Example: "Describe this scene"

  • Lower K: More standardized descriptions
  • Higher K: More varied and creative language
Interaction with Top P

Top K and Top P work together:

Both limit the sampling pool:

  • Top K: "Keep top 50 tokens"
  • Top P: "Keep tokens until cumulative probability reaches 0.95"

In practice:

  • Set Top K as hard upper limit
  • Top P provides dynamic threshold
  • Both constrain sampling for quality

Recommended combination:

  • Top K: 50
  • Top P: 0.95
  • Works well for most scenarios

Top P

Nucleus sampling threshold that selects tokens whose cumulative probability reaches P. This dynamic approach to limiting token choices adapts based on the confidence distribution.

How it works:

Instead of fixed K tokens, select tokens until cumulative probability reaches threshold:

Top P = 0.9:
- Sort all tokens by probability
- Add tokens until cumulative probability ≥ 0.9
- Sample from this dynamic set

Example:
Token A: 40% probability
Token B: 30% probability
Token C: 15% probability
Token D: 10% probability
Others: 5% probability

With Top P = 0.9:
- Keep A (40% cumulative)
- Keep B (70% cumulative)
- Keep C (85% cumulative)
- Keep D (95% cumulative ≥ 90% ✓)
- Result: Sample from {A, B, C, D}

Typical range: 0.0-1.0 (commonly 0.90-0.95)

Recommended starting value: 0.95

📘

Top P vs Top K

Top K: Fixed number of tokens (e.g., always 50) Top P: Variable number based on confidence distribution

Top P adapts:

  • When model is confident: few tokens needed to reach P
  • When model is uncertain: more tokens needed to reach P

Understanding Top P values:

High Top P (0.95-1.0)

More diverse outputs:

  • Includes less probable tokens
  • Greater output variety
  • More creative and unpredictable
  • Risk of lower coherence

Use when:

  • Creative description tasks
  • Varied phrasing desired
  • Output diversity important
  • Repetition is a problem

Example: Top P = 0.95

  • Allows more vocabulary choices
  • Includes less common but valid alternatives
  • Increases expression variety
Medium Top P (0.85-0.94)

Balanced outputs (recommended for most tasks):

  • Moderate diversity
  • Maintains coherence
  • Focused but not repetitive
  • Good default choice

Use when:

  • Standard VLM tasks
  • Balance needed between quality and diversity
  • First-time configuration
  • Phrase grounding and VQA

Example: Top P = 0.90

  • Focuses on high-probability tokens
  • Allows some variation
  • Prevents most low-quality choices
Low Top P (0.7-0.84)

Focused outputs:

  • Very deterministic
  • Minimal diversity
  • Highly consistent
  • Risk of repetition

Use when:

  • Maximum consistency required
  • Factual accuracy critical
  • Template-like outputs desired
  • Debugging generation issues

Example: Top P = 0.80

  • Very focused token selection
  • Predictable outputs
  • May become repetitive

Recommended combinations:

Standard phrase grounding

Configuration:

  • Top K: 50
  • Top P: 0.95
  • Temperature: 1.0

Why this works:

  • Allows diverse phrasing for captions
  • Maintains focus for accurate groundings
  • Balances creativity and precision
Factual VQA

Configuration:

  • Top K: 50
  • Top P: 0.90
  • Temperature: 0.7

Why this works:

  • Focuses on most probable (accurate) answers
  • Reduces creative but potentially incorrect responses
  • Maintains consistency across similar questions
Creative descriptions

Configuration:

  • Top K: 70
  • Top P: 0.95
  • Temperature: 1.2

Why this works:

  • High diversity in vocabulary
  • Varied expression styles
  • Creative but coherent outputs

Sampling temperature

Controls randomness in token selection during generation. Temperature shapes the probability distribution over possible next tokens, directly affecting output diversity and creativity.

How it works:

Temperature scales the probability distribution:

Low temperature (0.3):
- Sharpens distribution → more deterministic
- High-probability tokens become even more likely
- Low-probability tokens become even less likely

High temperature (1.5):
- Flattens distribution → more random
- Probabilities become more uniform
- Low-probability tokens get more chances

Example:

Original probabilities:
Token A: 40%
Token B: 30%
Token C: 20%
Token D: 10%

Temperature = 0.5 (low):
Token A: 60% ← More focused
Token B: 25%
Token C: 10%
Token D: 5%

Temperature = 1.5 (high):
Token A: 35% ← More uniform
Token B: 32%
Token C: 22%
Token D: 11%

Typical range: 0.1-2.0 (commonly 0.7-1.3)

Recommended starting value: 1.0 (neutral)

📘

Temperature intuition

Think of temperature like confidence vs. exploration:

Low temperature (0.3-0.7): "Play it safe, use most likely words"

Medium temperature (0.8-1.2): "Balance safety with some variation"

High temperature (1.3-2.0): "Be creative, try different approaches"

Choosing temperature:

Low temperature (0.3-0.7)

Focused, deterministic outputs:

Use for:

  • Factual question answering
  • Structured output generation
  • Consistency critical tasks
  • Bounding box generation

Characteristics:

  • Very predictable outputs
  • High consistency
  • Low diversity
  • Risk of repetition with very low values

Example use case: Counting objects in images:

  • Temperature 0.5
  • Consistent "There are X objects" format
  • Minimal phrasing variation
  • Focus on accuracy

Recommended values:

  • VQA (factual): 0.7
  • Object counting: 0.5
  • Structured outputs: 0.6
Medium temperature (0.8-1.2)

Balanced outputs (recommended for most tasks):

Use for:

  • Standard VLM tasks
  • Phrase grounding
  • General image understanding
  • Most production scenarios

Characteristics:

  • Good balance of quality and diversity
  • Natural-sounding outputs
  • Appropriate variation
  • Reliable performance

Example use case: Phrase grounding captions:

  • Temperature 1.0
  • Varied but accurate descriptions
  • Natural language flow
  • Consistent quality

Recommended values:

  • Phrase grounding: 1.0
  • VQA (descriptive): 1.0
  • General tasks: 0.9-1.1
High temperature (1.3-2.0)

Creative, diverse outputs:

Use for:

  • Creative image descriptions
  • Multiple phrasing alternatives
  • Exploring diverse generations
  • Reducing repetition

Characteristics:

  • High output diversity
  • Creative language use
  • Less predictable
  • Risk of incoherence if too high

Example use case: Creative scene descriptions:

  • Temperature 1.5
  • Varied vocabulary and phrasing
  • Multiple perspectives
  • Interesting but potentially less consistent

Recommended values:

  • Creative descriptions: 1.3-1.5
  • Diverse generations: 1.4
  • Maximum diversity: 1.6-1.8

Caution: Values above 1.5 may reduce quality

Temperature interactions:

Combining with Top K and Top P

Temperature works with Top K and Top P:

Processing order:

  1. Temperature: Reshapes probability distribution
  2. Top K: Limits to K most probable tokens
  3. Top P: Limits to cumulative probability threshold
  4. Sample: Choose next token from remaining options

Recommended combinations:

For factual tasks:

Temperature: 0.7 (focused)
Top K: 50 (moderate limit)
Top P: 0.90 (focused sampling)
→ Deterministic, accurate outputs

For balanced tasks:

Temperature: 1.0 (neutral)
Top K: 50 (moderate limit)
Top P: 0.95 (balanced sampling)
→ Natural, reliable outputs

For creative tasks:

Temperature: 1.3 (creative)
Top K: 70 (wider limit)
Top P: 0.95 (diverse sampling)
→ Varied, interesting outputs

Sampling repetition penalty

Reduces the likelihood of repeating tokens that have already been generated. This penalty helps create more diverse and natural-sounding outputs by discouraging repetitive patterns.

How it works:

After generating each token:

  1. Track which tokens have been used
  2. Reduce probability of already-used tokens
  3. Higher penalty = stronger discouragement

Typical range: 1.0-2.0

  • 1.0: No penalty (default behavior)
  • 1.05: Gentle penalty
  • 1.1: Moderate penalty (recommended)
  • 1.3: Strong penalty
  • 1.5+: Very strong penalty (may hurt coherence)

Recommended starting value: 1.05

📘

How repetition penalty works

Example with penalty 1.2:

If "car" was already generated:

  • Original probability: 10%
  • After penalty: 10% ÷ 1.2 = 8.3%

Repeated use further reduces probability:

  • Second use: 8.3% ÷ 1.2 = 6.9%
  • Third use: 6.9% ÷ 1.2 = 5.8%

The model increasingly favors alternative words.

Choosing repetition penalty:

Low penalty (1.0-1.1)

Minimal repetition discouragement:

Use for:

  • Natural language where repetition is acceptable
  • Technical descriptions requiring specific terminology
  • Structured outputs with repeated elements
  • Default starting point

Value 1.0 (no penalty):

  • Natural repetition patterns
  • May repeat common words naturally
  • Good for most standard tasks

Value 1.05 (gentle penalty):

  • Slight preference for variety
  • Maintains natural language flow
  • Recommended default for most VLM tasks

Example: Phrase grounding captions naturally repeat:

  • "The red car next to a blue car" (car repeated appropriately)
  • Penalty 1.05 allows this natural repetition
Moderate penalty (1.1-1.3)

Balanced repetition control (recommended):

Use for:

  • Long-form descriptions
  • Reducing noticeable repetition
  • Creative text generation
  • When repetition becomes problematic

Value 1.1:

  • Good balance for most use cases
  • Reduces obvious repetition
  • Maintains coherence

Value 1.2:

  • Stronger variety encouragement
  • For longer outputs
  • When 1.1 shows too much repetition

Example: Without penalty: "The image shows a person wearing a hat. The person is standing next to another person. Each person has a bag."

With penalty 1.2: "The image shows a person wearing a hat, standing next to someone else. Both individuals carry bags." → More varied vocabulary

High penalty (1.3+)

Strong repetition avoidance:

Use for:

  • Creative writing scenarios
  • Extreme repetition problems
  • Experimental settings

Caution:

  • May force unnatural word choices
  • Can reduce coherence
  • Might avoid necessary repetitions
  • Use only when repetition is severe

Value 1.5:

  • Very strong penalty
  • Significantly alters word choice
  • Risk of awkward phrasing

Example: With very high penalty: "The car is red" → hard to repeat "car" → forced to use "vehicle", "automobile", "auto" even when "car" is most natural

Generally not recommended unless repetition is extreme.

Common scenarios:

Phrase grounding outputs

Recommended: 1.05 (gentle)

Phrase grounding naturally involves:

  • Repeating object names in groundings
  • Similar phrasing across detections
  • Structured JSON format

Why gentle penalty:

  • Allow natural terminology repetition
  • Maintain accurate object references
  • Preserve structured output format

Example:

{
  "groundings": [
    {"phrase": "a red car", ...},
    {"phrase": "a blue car", ...},
    {"phrase": "the parked cars", ...}
  ]
}

→ "car" repeated appropriately

Visual question answering

Recommended: 1.05-1.1

VQA responses are typically:

  • Short (1-3 sentences)
  • Focused answers
  • Limited repetition risk

Why light penalty:

  • Short outputs have less repetition
  • Focus on accuracy over variety
  • Natural answer patterns acceptable

Example question: "What color are the cars?"

Answer with 1.05: "There are two cars: a red car and a blue car." → Natural repetition of "car" acceptable

Long descriptions

Recommended: 1.1-1.2

Longer outputs risk more repetition:

  • Multiple sentences
  • Describing many objects
  • Detailed scene understanding

Why moderate penalty:

  • Encourages vocabulary variety
  • Maintains natural flow
  • Prevents monotonous phrasing

Example without penalty: "The scene shows a person in a red shirt. Next to the person is another person in a blue shirt. Behind these two people is a third person."

Example with penalty 1.15: "The scene shows a person in a red shirt, beside someone wearing blue. A third individual stands behind them." → More varied phrasing


Best practices

Start with recommended defaults

Recommended starting configuration:

Model Options:

  • Architecture size: Medium (7-13B) or small (1-3B) for testing
  • Training mode: LoRA Training
  • Quantization: NF4
  • Precision type: BFloat16

Hyperparameters:

  • Epochs: 100 (adjust based on dataset size)
  • Learning rate: 0.0001 (for LoRA Training)
  • Batch size: 4 (adjust for your GPU)
  • Gradient accumulation: 1
  • Optimizer: AdamW

Evaluation:

  • Max new tokens: 512
  • Top K: 50
  • Top P: 0.95
  • Temperature: 1.0
  • Repetition penalty: 1.05

Why these defaults:

  • Balanced across speed, quality, and resource usage
  • Work well for most VLM tasks
  • Safe starting point for experimentation
  • Proven effective in production
Iterate based on results

Systematic tuning approach:

  1. Start with defaults (above configuration)
  2. Train initial model and evaluate performance
  3. Identify issues from training behavior:
    • Slow convergence → adjust learning rate
    • Memory errors → reduce batch size or enable quantization
    • Poor quality → try larger model
    • Overfitting → reduce epochs or add regularization
  4. Change one thing at a time for clear cause-effect
  5. Compare results across runs using evaluation metrics
  6. Refine iteratively until satisfactory performance

Example iteration:

Run 1 (defaults):

  • Result: Good quality but slow convergence

Run 2 (adjust learning rate):

  • Learning rate: 0.0001 → 0.0002
  • Result: Faster convergence, similar quality

Run 3 (adjust batch size):

  • Batch size: 4 → 8
  • Result: Stable training, faster per-epoch time

Run 4 (final tuning):

  • Epochs: 100 → 75 (converged earlier)
  • Result: Optimal configuration found
Match settings to your resources

For limited GPU memory (8-16GB):

Architecture: Small (1-3B)
Training mode: LoRA Training
Quantization: NF4
Precision: BFloat16
Batch size: 1-2
Gradient accumulation: 4-8

For standard GPU (24-32GB):

Architecture: Medium (7-13B)
Training mode: LoRA Training
Quantization: NF4
Precision: BFloat16
Batch size: 4-8
Gradient accumulation: 1-2

For high-end GPU (40-80GB):

Architecture: Medium to Large
Training mode: LoRA Training or Full Finetuning (SFT)
Quantization: NF4 or FP4
Precision: BFloat16
Batch size: 8-16
Gradient accumulation: 1
Monitor training behavior

Watch for these patterns:

Loss curves:

  • Smooth decrease: Good configuration ✓
  • Erratic spikes: Learning rate too high
  • Plateau early: Learning rate too low or model capacity insufficient
  • Overfitting: Training loss decreases but validation increases

Memory usage:

  • Consistent 70-80% GPU utilization: Optimal ✓
  • Frequent OOM errors: Reduce batch size or enable quantization
  • Low utilization (<50%): Can increase batch size

Training speed:

  • Consistent step times: Good ✓
  • Increasing step times: Memory/swap issues
  • Very slow: Consider larger batch size or better GPU

Learn more about monitoring runs →

Document your configurations

Track successful configurations:

Keep records of:

  • Which settings worked for which tasks
  • Resource requirements (GPU memory, training time)
  • Performance metrics achieved
  • Issues encountered and solutions

Naming convention for workflows:

[Task]-[Model Size]-[Key Settings]-v[Number]

Examples:
- "DefectDetection-7B-LoRA-NF4-v1"
- "ProductRecog-13B-LoRA-FastConv-v2"
- "QualityControl-3B-FullFT-HighAcc-v1"

Benefits:

  • Quickly identify successful configurations
  • Share settings with team members
  • Reproduce results reliably
  • Track iteration progress

Common questions

Which settings have the biggest impact on training quality?

Most critical settings for quality:

  1. Model architecture size (35% impact)

    • Larger models generally achieve higher quality
    • Most important factor for capacity
  2. Learning rate (25% impact)

    • Correct learning rate ensures convergence
    • Too high/low severely impacts results
  3. Epochs (20% impact)

    • Sufficient epochs required to converge
    • More isn't always better (overfitting risk)
  4. Dataset quality and size (not a setting, but 40% impact)

    • High-quality annotations critical
    • More data generally helps

Less critical for quality (but affect speed/resources):

  • Training mode (LoRA Training vs Full Finetuning): Similar quality
  • Quantization: Minimal quality impact
  • Batch size: Affects stability more than final quality
  • Evaluation settings: Only affect inference, not training
My training is too slow. What should I change?

Speed optimization strategies:

1. Reduce model size:

  • Large (20-34B) → Medium (7-13B): 2-3x faster
  • Medium (7-13B) → Small (1-3B): 2-4x faster

2. Enable optimizations:

  • Use LoRA Training instead of Full Finetuning (SFT): 2-3x faster
  • Use NF4 quantization: 1.5-2x faster
  • Use BFloat16 precision: 2x faster vs Float32

3. Increase batch size:

  • Batch 2 → Batch 4: ~1.5x faster
  • Batch 4 → Batch 8: ~1.3x faster
  • Limited by GPU memory

4. Reduce epochs:

  • Monitor for early convergence
  • Stop when validation metrics plateau
  • May be training longer than needed

5. Reduce gradient accumulation:

  • Only if memory allows
  • Accumulation adds overhead

Typical speedup example:

  • Before: Large model, Full Finetuning (SFT), batch 2 → 10 hours
  • After: Medium model, LoRA Training, NF4, batch 4 → 2-3 hours
  • ~3-4x total speedup with minimal quality impact
How do I know if my settings are good?

Indicators of good configuration:

Training behavior:

  • Loss decreases smoothly
  • No frequent spikes or NaN values
  • Reasonable training speed
  • GPU utilization 70-90%
  • No out-of-memory errors

Results quality:

  • Validation metrics improve over training
  • Test set performance meets requirements
  • Generated outputs are coherent and accurate
  • Model generalizes to new examples

Resource usage:

  • Training completes in acceptable time
  • GPU memory usage stable
  • Costs within budget

Comparison approach:

  1. Train with default settings (baseline)
  2. Evaluate quality metrics
  3. Adjust one setting at a time
  4. Compare metrics to baseline
  5. Keep changes that improve results

Learn about model evaluation →

Should I use the same settings for all my projects?

Start similar, then customize:

Reusable baseline:

  • Core settings (LoRA Training, NF4, AdamW) work across projects
  • Training approach generalizes well
  • Resource optimizations apply universally

Project-specific tuning:

  • Epochs: Depends on dataset size
  • Learning rate: May need adjustment per task
  • Batch size: Constrained by your GPU
  • Model size: Depends on task complexity

Best practice:

  1. Create a "baseline" workflow with proven settings
  2. Clone it for new projects
  3. Adjust task-specific parameters:
    • System prompt
    • Dataset
    • Epochs based on data size
  4. Fine-tune if baseline doesn't perform well

Example:

Baseline workflow:

  • 7B model, LoRA Training, NF4, BFloat16
  • Batch 4, AdamW, LR 0.0001
  • Works for most detection tasks

Project A (100 images):

  • Clone baseline
  • Adjust: Epochs 150 (small dataset)

Project B (1000 images):

  • Clone baseline
  • Adjust: Epochs 75 (large dataset)
  • Adjust: Batch 8 (more memory available)
When should I use Full Finetuning (SFT) instead of LoRA Training?

Use Full Finetuning (SFT) when:

  1. Domain is drastically different:

    • Medical/microscopy images (if base model trained on natural images)
    • Satellite/aerial imagery
    • Specialized visual domains
  2. Maximum quality is critical:

    • Production systems with strict accuracy requirements
    • Research requiring state-of-the-art results
    • When LoRA Training results are insufficient
  3. Resources are abundant:

    • Access to high-end GPUs (A100, H100)
    • Training time is not a constraint
    • Budget allows for higher compute costs

Stick with LoRA Training when:

  1. Standard computer vision tasks:

    • Object detection
    • Classification
    • Quality control
    • Most production applications
  2. Limited resources:

    • Consumer GPUs
    • Time constraints
    • Budget limitations
  3. Iterating quickly:

    • Development phase
    • Prototyping
    • A/B testing different approaches

Reality: 90%+ of use cases work excellently with LoRA Training. Try LoRA Training first, only switch to Full Finetuning (SFT) if results are insufficient and you have the resources.

My model generates repetitive outputs. How do I fix this?

Solutions ranked by effectiveness:

1. Increase repetition penalty (first try):

  • Current: 1.05 → Try: 1.1 or 1.15
  • Directly addresses repetition
  • Usually most effective solution

2. Adjust temperature:

  • Current: 1.0 → Try: 1.2
  • Increases output diversity
  • More creative word choices

3. Increase Top P:

  • Current: 0.90 → Try: 0.95
  • Allows more token variety
  • Broader vocabulary usage

4. Increase Top K:

  • Current: 50 → Try: 70-100
  • Widens sampling pool
  • More diverse token selection

5. Training-level solutions (if inference changes don't help):

  • Increase dataset diversity
  • Add more varied training examples
  • Adjust system prompt to encourage variety

Typical fix:

Before:
- Temperature: 1.0
- Repetition penalty: 1.05
- Result: "The car is red. The car is large. The car is parked."

After:
- Temperature: 1.1
- Repetition penalty: 1.15
- Result: "The vehicle is red and large, parked near the building."

Try adjustments incrementally—don't change all at once.

What's the difference between batch size and gradient accumulation?

Both affect effective batch size, but differently:

Batch size:

  • Number of images processed simultaneously
  • Limited by GPU memory
  • Higher = faster training (better GPU utilization)
  • Each batch computes gradients, model updates immediately

Gradient accumulation:

  • Number of batches before updating model
  • Not limited by GPU memory
  • Higher = simulates larger batches without extra memory
  • Gradients accumulated across batches, then single update

Example comparison:

Configuration A:

  • Batch size: 8
  • Gradient accumulation: 1
  • Effective batch: 8
  • Updates per epoch: dataset_size / 8
  • Speed: Fast (8 images at once)
  • Memory: High (8 images in GPU)

Configuration B:

  • Batch size: 2
  • Gradient accumulation: 4
  • Effective batch: 8 (same as A)
  • Updates per epoch: dataset_size / 8 (same as A)
  • Speed: Slower (only 2 images at once, but 4x more forward passes)
  • Memory: Low (only 2 images in GPU)

When to use which:

Prefer higher batch size (less accumulation):

  • When GPU memory allows
  • For faster training
  • Simpler configuration

Use gradient accumulation:

  • When GPU memory is limited
  • To simulate larger batches than GPU allows
  • To improve training stability without upgrading GPU

Best practice: Use largest batch size that fits in memory, then add accumulation if you need larger effective batch sizes.


Next steps

After configuring your model settings:

Continue workflow configuration

Start training

Optimize performance


Additional resources

Training guides

Model configuration

Concept guides

Quickstart

Related resources