Model settings

Model settings define the architecture, training approach, and inference behavior of your VLM. Proper configuration of these settings directly impacts training speed, memory usage, model quality, and inference performance.

💡
Looking for a quick start?
For streamlined workflow setup with recommended default settings:

Quickstart: Create a Workflow

📋
Prerequisites
Before configuring model settings, ensure you have:

An existing training project

A configured dataset with train-test split

Basic understanding of your computational resources and training goals

Understanding model settings

Model settings are organized into three main categories:

Model Options — Architecture size, training mode, and memory optimization
Hyperparameters — Training behavior and convergence settings
Evaluation — Inference behavior and output generation

These settings work together to define:

Training efficiency — How fast your model trains and how much memory it requires
Model capacity — The model's ability to learn complex patterns
Convergence behavior — How the model improves during training
Inference quality — Output diversity, length, and coherence

Model options

Access model configuration by clicking the Model node in the workflow canvas.

Model architecture size

The number of parameters in the neural network, measured in billions (B). This fundamental setting determines the model's capacity to learn and represent complex patterns.

Available sizes:

Size	Parameters	Best for	Memory requirements
Small	1-3B	Quick experiments, limited resources	8-16 GB GPU
Medium	7-13B	Standard production use cases	16-32 GB GPU
Large	20-34B	Complex tasks, high accuracy needs	40-80 GB GPU

How it impacts training:

Larger models have more capacity to learn complex patterns and nuanced relationships
Smaller models train faster and require less computational resources
Memory usage scales roughly linearly with parameter count

📘
Choosing the right size
Start with smaller models for initial experiments to validate your approach quickly. Scale up to larger models when you need higher accuracy or have proven the concept works with your data.

When to use each size:

Small models (1-3B parameters)

Best for:

Initial prototyping and experimentation
Limited GPU resources (single consumer GPU)
Fast iteration during development
Simple tasks with clear visual patterns
Real-time inference requirements

Tradeoffs:

Lower capacity for complex reasoning
May struggle with subtle distinctions
Faster training and inference
Lower memory and compute costs

Example use cases:

Binary classification (good/defective)
Single object detection
Simple quality control

Medium models (7-13B parameters)

Best for:

Production deployments
Most standard computer vision tasks
Multi-class detection and classification
Balanced performance and resource usage

Tradeoffs:

Good balance of accuracy and efficiency
Reasonable training times
Moderate GPU requirements
Suitable for most use cases

Example use cases:

Multi-object detection
Complex defect classification
Retail product recognition
General-purpose visual understanding

Large models (20-34B parameters)

Best for:

Maximum accuracy requirements
Complex reasoning tasks
Fine-grained distinctions
Production systems with ample resources

Tradeoffs:

Highest accuracy potential
Significantly longer training times
High GPU memory requirements
May require distributed training

Example use cases:

Medical image analysis
Detailed inspection tasks
Open-ended visual reasoning
Challenges requiring nuanced understanding

Training mode

Training mode determines how the model's parameters are updated during training. This critical choice affects training speed, memory usage, and model flexibility.

Available modes:

LoRA Training

Trains only small adapter layers while keeping the base model frozen.

How it works: Inserts small trainable layers (Low-Rank Adaptation) into the frozen base model
Memory requirements: Low (only adapter gradients stored)
Training time: Faster (fewer parameters to update)
Flexibility: Good for most use cases, may be less flexible for drastically different tasks

Full Finetuning (SFT)

Updates all model parameters during training using Supervised Fine-Tuning.

How it works: Every layer in the neural network is adjusted based on your training data
Memory requirements: High (requires storing gradients for all parameters)
Training time: Longer (more parameters to update)
Flexibility: Maximum adaptation to your specific task

✅
Recommendation: Start with LoRA Training
LoRA Training is recommended for most use cases as it offers 2-3x faster training with significantly lower memory usage while maintaining comparable quality to Full Finetuning.

Comparison:

Aspect	LoRA Training	Full Finetuning (SFT)
Memory usage	Low (3-5x reduction)	High
Training speed	Faster (2-3x speedup)	Slower
Adaptation flexibility	Good	Maximum
Best for	Most standard use cases	Drastically different tasks
GPU requirements	Consumer GPU sufficient	High-end GPU required

When to use LoRA Training:

Standard use cases (recommended)

Use for most production applications:

Standard computer vision tasks (detection, classification)
Limited GPU resources
Faster iteration during development
Domain adaptation (e.g., retail → manufacturing)
Cost-effective training

When to use Full Finetuning (SFT):

Maximum adaptation required

Use when your task is significantly different from the base model's training:

Highly specialized domain (e.g., microscopy, satellite imagery)
Novel visual patterns not seen in general training
Maximum accuracy is critical and resources are available
Task requires fundamental changes to feature extraction

Quantization

Quantization reduces model precision to save memory, enabling training of larger models on limited GPU resources. This technique uses lower-bit representations for model weights and activations.

Available formats:

NF4 (Normalized Float 4)

4-bit format optimized specifically for neural networks with normalized value distribution.

Memory savings: ~4x reduction compared to full precision
Quality: Excellent preservation of model quality
Best for: Neural network training (recommended default)
Designed for: Transformer models and VLMs

FP4 (4-bit Floating Point)

Standard 4-bit floating point quantization for weights.

Memory savings: ~4x reduction compared to full precision
Quality: Very good
Best for: General quantization, compatibility scenarios
Use when: NF4 has compatibility issues or specific FP4 requirements

✅
Recommendation: NF4
NF4 is recommended for VLM training as it's specifically optimized for transformer models, providing better quality preservation than FP4 with the same memory savings.

📘
How quantization works
Quantization represents model weights using 4 bits instead of 16-bit or 32-bit precision, reducing memory usage by approximately 4x. This enables training larger models or using larger batch sizes with the same GPU memory.

Quantization formats compared:

Format	Memory savings	Quality	Best for
NF4	~4x reduction	Excellent	Neural network training (recommended)
FP4	~4x reduction	Very good	General quantization, compatibility

Impact example:

Without quantization (16-bit):
- 13B model: ~40 GB GPU memory

With NF4 quantization (4-bit):
- 13B model: ~12 GB GPU memory
- 3-4x memory reduction
- Minimal quality loss

When to use each format:

NF4 (recommended)

Use for most scenarios:

Standard VLM training workflows
Production model training
When you want best quality with quantization
Recommended default choice

Advantages:

Optimized value distribution for neural network weights
Better preservation of model quality than FP4
Specifically designed for transformer architectures
Minimal accuracy loss compared to full precision
Industry best practice for VLM training

FP4

Use when:

Compatibility issues with NF4
Debugging quantization-related issues
Specific requirements for standard floating point format
Legacy configurations

Note: For most use cases, NF4 is preferred as it provides better quality with the same memory savings.

Precision type

The numerical format used for calculations during training. This setting balances computation speed, memory usage, and numerical stability.

Available options:

BFloat16

Precision: 16-bit brain floating point format
Accuracy: Better numerical stability than Float16 for large models
Speed: ~2x faster than Float32
Memory: 2x reduction vs Float32
Use case: Preferred for training large models (recommended)

Float16

Precision: 16-bit floating point
Accuracy: Good for most use cases
Speed: ~2x faster than Float32
Memory: 2x reduction vs Float32
Use case: Standard training with older GPUs

Float32

Precision: 32-bit floating point
Accuracy: Highest numerical precision
Speed: Slowest (baseline)
Memory: Highest usage
Use case: Debugging numerical issues, research requiring maximum precision

✅
Recommendation: BFloat16
BFloat16 is recommended for VLM training as it provides the speed and memory benefits of 16-bit precision with better numerical stability than Float16, especially important for large models.

Comparison table:

Precision	Speed	Memory	Stability	Modern GPU support
BFloat16	Fast (2x)	Low	Excellent	NVIDIA Ampere+, AMD MI200+
Float16	Fast (2x)	Low	Good	Universal
Float32	Baseline (1x)	High	Excellent	Universal

When to use each precision:

BFloat16 (recommended)

Best for most scenarios:

Modern GPUs (NVIDIA A100, H100, RTX 30/40 series)
Training large models (7B+ parameters)
Balanced speed and stability
Production training workflows

Why it's better:

Preserves Float32's exponent range (better for extreme values)
Reduces gradient underflow/overflow issues
Widely supported in modern ML frameworks
Minimal accuracy loss compared to Float32

Float16

Use when:

Older GPUs without BFloat16 support
Maximum speed is critical
Working with smaller models (<7B parameters)

Considerations:

May encounter numerical instability with very large or small gradients
Requires gradient scaling for stability
Slightly more prone to training issues than BFloat16

Float32

Use when:

Debugging numerical stability issues
Research requiring maximum precision
Training is unstable with lower precision
GPU memory is not a constraint

Tradeoffs:

2x slower than Float16/BFloat16
2x more memory usage
Rarely necessary for production training

Hyperparameters

Hyperparameters control the training process dynamics—how the model learns from your data and how quickly it converges.

Epochs

The number of complete passes through your entire training dataset. Each epoch represents one full cycle of training where the model sees every training image once.

How it works:

1 epoch = model sees all training images once
10 epochs = model sees all training images 10 times

Example calculation:

100 training images
Batch size of 2
1 epoch = 50 training steps (100 images ÷ 2 per batch)
10 epochs = 500 total training steps

Choosing the right number:

Recommended epoch ranges

Small datasets (<100 images):

Recommended: 100-300 epochs
Reasoning: More passes needed to learn from limited data
Watch for: Overfitting after ~200 epochs

Medium datasets (100-1000 images):

Recommended: 50-150 epochs
Reasoning: Balanced learning with sufficient data
Watch for: Convergence plateau around 100 epochs

Large datasets (1000+ images):

Recommended: 20-100 epochs
Reasoning: Fewer passes needed with abundant data
Watch for: Diminishing returns after 50-75 epochs

Signs you need more epochs

Increase epochs when:

Training loss is still decreasing steadily
Validation metrics improving each epoch
Model hasn't converged yet
Early in experimentation phase

Example: Training loss: epoch 50 = 0.45, epoch 100 = 0.32, epoch 150 = 0.28

Still improving → continue training

Signs you need fewer epochs

Reduce epochs when:

Training loss plateaus early
Validation performance stops improving or degrades (overfitting)
Training time is excessive
Model converges quickly

Example: Training loss: epoch 30 = 0.25, epoch 60 = 0.24, epoch 90 = 0.24

Converged at epoch 30 → reduce to 50 epochs

💡
Pro tip: Use early stopping
Monitor validation metrics during training. If performance stops improving for 10-20 epochs, training can often be stopped early. Create multiple runs with different epoch counts to find the optimal value for your dataset.

Learning rate

Learning rate controls how much the model's parameters change with each training step. It's one of the most critical hyperparameters affecting training success.

How it works:

High learning rate: Large parameter updates, faster initial learning, risk of instability
Low learning rate: Small parameter updates, stable but slow learning, may get stuck
Optimal learning rate: Balances speed and stability for efficient convergence

Typical range: 0.00001 (1e-5) to 0.0001 (1e-4)

Recommended starting values:

Model size	LoRA Training	Full Finetuning (SFT)
Small (1-3B)	0.0001	0.00005
Medium (7-13B)	0.0001	0.00003
Large (20-34B)	0.00005	0.00001

⚠️
Learning rate significantly impacts training
Too high causes training instability and divergence. Too low results in slow training or getting stuck in poor solutions. Start with recommended values and adjust based on training behavior.

Tuning the learning rate:

Learning rate too high (signs and fixes)

Signs:

Training loss increases or oscillates wildly
Loss suddenly spikes to very large values
Model predictions become nonsensical
Training diverges or produces NaN values

Example:

Epoch 1: loss = 2.3
Epoch 2: loss = 1.8
Epoch 3: loss = 5.7 ← Spike indicates too high
Epoch 4: loss = NaN ← Training diverged

Fix:

Reduce learning rate by 2-5x (e.g., 0.0001 → 0.00002)
Start a new training run with adjusted rate
Consider using smaller batch size for more stable gradients

Learning rate too low (signs and fixes)

Signs:

Training loss decreases very slowly
Progress stalls at high loss values
Training takes excessively long
Model underfits the data

Example:

Epoch 10: loss = 2.1
Epoch 20: loss = 2.08
Epoch 30: loss = 2.06 ← Very slow improvement
Epoch 40: loss = 2.04

Fix:

Increase learning rate by 2-3x (e.g., 0.00001 → 0.00003)
Start a new training run with adjusted rate
Monitor closely to ensure stability

Learning rate just right (what to expect)

Good training behavior:

Loss decreases steadily without large spikes
Occasional small fluctuations are normal
Converges within expected number of epochs
Validation metrics improve consistently

Example:

Epoch 5:  loss = 2.1
Epoch 10: loss = 1.6
Epoch 15: loss = 1.3 ← Steady improvement
Epoch 20: loss = 1.1
Epoch 25: loss = 0.95

Characteristics:

Smooth loss curve with minor noise
Clear downward trend
No sudden spikes or divergence
Validation performance tracks training improvement

Learning rate schedules:

Advanced: Learning rate scheduling

Learning rate schedules automatically adjust the learning rate during training:

Common schedules:

Constant (default):
- Same rate throughout training
- Simple and predictable
- Good for most use cases
Linear decay:
- Gradually reduces rate over training
- Helps fine-tune convergence at the end
- Useful for long training runs
Cosine annealing:
- Reduces rate following cosine curve
- Smoother decay than linear
- Popular for large model training

When to use schedules:

Long training runs (100+ epochs)
Fine-tuning pre-trained models
Seeking optimal convergence
Advanced optimization scenarios

Note: Most training workflows work well with constant learning rate. Schedules are an advanced optimization technique.

Batch size

The number of images processed simultaneously in each training step. Batch size affects training speed, memory usage, and model convergence.

How it works:

Batch size 2: Process 2 images → compute gradients → update model
Batch size 4: Process 4 images → compute gradients → update model
Batch size 8: Process 8 images → compute gradients → update model

Example:

100 training images, batch size 4
Steps per epoch: 100 ÷ 4 = 25 steps
10 epochs = 250 total training steps

Typical range: 1-16 for VLM training

Recommended starting values:

Model size	GPU memory 16GB	GPU memory 24GB	GPU memory 40GB+
Small (1-3B)	8	16	32
Medium (7-13B)	2-4	4-8	8-16
Large (20-34B)	1-2	2-4	4-8

📘
Batch size tradeoffs
Larger batch sizes:

Faster training (better GPU utilization)

More stable gradients

Higher memory usage

May reduce model generalization

Smaller batch sizes:

Lower memory requirements

Better generalization (more noisy gradients)

Slower training

Less stable convergence

Choosing batch size:

Memory-constrained training

When GPU memory is limited:

Start with smallest batch size that trains successfully:

Try batch size 4
If out of memory, reduce to 2
If still issues, try batch size 1
Consider enabling quantization or using gradient accumulation

Memory optimization strategies:

Enable quantization (NF4)
Use gradient accumulation to simulate larger batches
Choose smaller model size
Reduce precision type to FP16/BF16

Balancing speed and quality

For optimal training:

Use the largest batch size that:

Fits in GPU memory comfortably (~80% utilization)
Maintains stable training (no memory errors)
Provides reasonable training speed

Practical approach:

Start with recommended value for your GPU
Increase until you encounter memory issues
Reduce by 25-50% for safety margin
Monitor training stability

Example:

GPU: 24GB VRAM
Model: 13B parameters with LoRA
Test batch sizes: 4 → 8 → 16
Batch size 16 causes OOM errors
Final choice: batch size 8 (safe maximum)

When batch size matters less

Batch size is less critical when:

Using gradient accumulation (can simulate larger batches)
Training small models with ample GPU memory
Dataset is large (1000+ images)

Batch size is more critical when:

GPU memory is constrained
Training very large models
Dataset is small (gradient noise matters more)

Gradient accumulation steps

The number of forward passes before updating model weights. This technique simulates larger batch sizes without requiring additional GPU memory.

How it works:

Instead of updating the model after each batch, gradients are accumulated over multiple batches:

Batch size 4, accumulation steps 1 (default):
- Process 4 images → update model immediately
- Effective batch size: 4

Batch size 4, accumulation steps 4:
- Process 4 images → accumulate gradients
- Process 4 images → accumulate gradients
- Process 4 images → accumulate gradients
- Process 4 images → accumulate gradients → update model
- Effective batch size: 16 (4 × 4)

Effective batch size formula:

Effective batch size = Batch size × Gradient accumulation steps

Typical values: 1-8 (1 means no accumulation)

💡
Why use gradient accumulation?
Gradient accumulation lets you train with larger effective batch sizes when GPU memory is limited. This improves training stability and gradient quality without requiring more memory.

When to use gradient accumulation:

Limited GPU memory

Problem: Want larger batch size but GPU memory is insufficient

Solution: Use gradient accumulation

Example:

Target effective batch size: 16
GPU can only handle batch size 4
Set gradient accumulation steps: 4
Effective batch size: 4 × 4 = 16 ✓

Tradeoff:

Training takes longer (4x more forward passes before each update)
Same memory usage as batch size 4
Training stability of batch size 16

Improving training stability

Higher effective batch sizes provide:

More stable gradient estimates
Smoother loss curves
Better convergence for large models
Reduced gradient noise

Recommended combinations:

Scenario	Batch size	Accumulation steps	Effective batch
Memory constrained	2	4	8
Balanced	4	2	8
Memory available	8	1	8

All achieve same effective batch size with different memory/speed tradeoffs.

When NOT to use accumulation

Avoid gradient accumulation when:

GPU memory is sufficient for desired batch size
Small datasets where gradient noise aids generalization
Training is already slow and speed is critical
Batch size 1-2 is sufficient for your model size

Why avoid unnecessary accumulation:

Slower training (more forward passes per update)
Adds complexity without benefit
Default (accumulation steps = 1) works fine for most cases

Recommended values:

By GPU memory constraint

16GB GPU:

Batch size 2, accumulation steps 4 (effective: 8)
Batch size 1, accumulation steps 8 (effective: 8)

24GB GPU:

Batch size 4, accumulation steps 2 (effective: 8)
Batch size 2, accumulation steps 4 (effective: 8)

40GB+ GPU:

Batch size 8, accumulation steps 1 (effective: 8)
Usually no accumulation needed

Goal: Effective batch size of 8-16 for most VLM training

Optimizer

The optimization algorithm that adjusts the model's parameters to minimize the loss function. The optimizer determines how gradient information is used to update model weights.

Available optimizers:

AdamW

Full name: Adam with Weight Decay
How it works: Adaptive learning rates per parameter with decoupled weight regularization
Best for: Most VLM training scenarios (recommended default)
Advantages: Better convergence and generalization through decoupled weight decay

Adam

How it works: Adaptive learning rates per parameter
Difference from AdamW: Coupled weight decay (less effective regularization)
Best for: Legacy compatibility, specific research requirements

✅
Recommendation: AdamW
Use AdamW for VLM training as it provides better convergence and generalization than Adam. It's the standard optimizer for training transformer-based models.

Optimizer comparison:

Optimizer	Convergence speed	Memory usage	Generalization	Use case
AdamW	Fast	Moderate	Excellent	Default (recommended)
Adam	Fast	Moderate	Good	Legacy/compatibility

When to use each optimizer:

AdamW (recommended)

Use for:

All standard VLM training (recommended for 95%+ of cases)
Production models
Fine-tuning transformer models
When you want best practices defaults

Advantages:

Decoupled weight decay improves regularization
Adaptive learning rates per parameter
Proven effective for large language and vision models
Industry standard for transformer training
Better generalization than Adam

Why it's better than Adam:

Separates weight decay from gradient-based updates
More effective regularization
Better final model performance
Widely adopted as best practice

Adam

Use for:

Reproducing older experiments that used Adam
Compatibility with existing configurations
Specific research requirements

Note: AdamW is generally preferred over Adam. The main reason to use Adam is backward compatibility with older experiments or when reproducing specific published results that used Adam.

Difference from AdamW:

Couples weight decay with gradient updates
Slightly less effective regularization
May lead to marginally lower performance

Evaluation

Evaluation settings control how the model generates outputs during inference and evaluation. These settings affect response quality, diversity, and computational cost.

Max new tokens

The maximum number of tokens the model can generate in a single response. This setting limits output length to control generation time and computational cost.

How it works:

Tokens are roughly words or subwords:

"cat" = 1 token
"running" might be 1-2 tokens
"The quick brown fox" ≈ 5 tokens

Typical range: 128-1024 tokens

Recommended values:

Task type	Max new tokens	Reasoning
Short answers (VQA)	128-256	Brief responses sufficient
Phrase grounding	512	JSON with multiple groundings
Detailed descriptions	512-1024	Comprehensive captions needed

📘
Tradeoff: Length vs. Speed
Higher values:

Allow longer, more detailed responses

Increase evaluation time per sample

Higher computational cost

Risk of repetitive or rambling outputs

Lower values:

Force concise responses

Faster evaluation

Lower computational cost

May truncate important information

Choosing max new tokens:

For phrase grounding tasks

Recommended: 512 tokens

Phrase grounding outputs include:

Descriptive caption
Multiple grounded phrases
Bounding box coordinates
JSON structure

Example output size:

{
  "phrase_grounding": {
    "sentence": "A red car parked next to two people wearing blue shirts on the sidewalk near a green tree",
    "groundings": [
      {"phrase": "A red car", "grounding": [[120,340,580,670]]},
      {"phrase": "two people", "grounding": [[600,280,750,720],[780,290,920,710]]},
      {"phrase": "the sidewalk", "grounding": [[0,650,1024,900]]},
      ...
    ]
  }
}

Typical length: 200-400 tokens

Setting 512 provides comfortable margin without waste.

For visual question answering

Recommended: 128-256 tokens

VQA responses are typically:

1-3 sentences
20-80 tokens
Focused and concise

Example responses:

Short: "There are three dogs in the image" (~8 tokens)
Medium: "The image shows three golden retrievers playing in a park on a sunny day" (~15 tokens)
Long: "There are three dogs visible: two golden retrievers playing with a ball in the foreground and one black labrador resting under a tree in the background" (~30 tokens)

Setting 256 allows detailed answers without encouraging verbosity.

For custom tasks

Consider:

Measure typical output length:
- Generate sample outputs
- Count tokens (roughly 1 token per word)
- Use 1.5-2x the typical length as max
Balance quality and cost:
- Longer limits → more flexibility but slower
- Shorter limits → faster but may truncate
- Start conservative, increase if needed
Monitor for truncation:
- If outputs frequently hit the limit, increase
- If outputs are always much shorter, decrease

Top K results

Limits token sampling to the K most probable next tokens. This setting controls output diversity by restricting the model's choices to the most likely options.

How it works:

At each generation step:

Model computes probability for all possible next tokens
Top K keeps only the K most probable tokens
Model samples from these K tokens

Example:

Top K = 50:
- All tokens ranked by probability
- Keep top 50 most probable
- Sample next token from these 50
- Ignore all other possibilities

Top K = 5:
- Keep only top 5 most probable
- Very focused, deterministic output
- Less diversity

Typical range: 1-100

Recommended starting value: 50

📘
How Top K affects outputs
Lower values (10-30):

More focused and deterministic

Less diversity and creativity

More consistent outputs

Risk of repetition

Higher values (50-100):

More diversity and variation

Less predictable

Broader vocabulary usage

May reduce coherence if too high

Choosing Top K:

Deterministic tasks (object detection, VQA)

Recommended: 50

For tasks requiring factual accuracy:

Object detection
Bounding box generation
Answering specific questions
Structured output generation

Why 50 works well:

Enough diversity to avoid repetition
Focused enough for accurate outputs
Balances creativity and precision

Example: "How many cars are in the image?"

Lower K: "There are 3 cars" (consistent)
Higher K: Various phrasings but same count

Creative tasks (descriptions, captions)

Recommended: 70-100

For tasks benefiting from diversity:

Detailed scene descriptions
Creative captions
Varied phrasing

Why higher K helps:

More vocabulary variety
Diverse expression styles
Less repetitive phrasing

Example: "Describe this scene"

Lower K: More standardized descriptions
Higher K: More varied and creative language

Interaction with Top P

Top K and Top P work together:

Both limit the sampling pool:

Top K: "Keep top 50 tokens"
Top P: "Keep tokens until cumulative probability reaches 0.95"

In practice:

Set Top K as hard upper limit
Top P provides dynamic threshold
Both constrain sampling for quality

Recommended combination:

Top K: 50
Top P: 0.95
Works well for most scenarios

Top P

Nucleus sampling threshold that selects tokens whose cumulative probability reaches P. This dynamic approach to limiting token choices adapts based on the confidence distribution.

How it works:

Instead of fixed K tokens, select tokens until cumulative probability reaches threshold:

Top P = 0.9:
- Sort all tokens by probability
- Add tokens until cumulative probability ≥ 0.9
- Sample from this dynamic set

Example:
Token A: 40% probability
Token B: 30% probability
Token C: 15% probability
Token D: 10% probability
Others: 5% probability

With Top P = 0.9:
- Keep A (40% cumulative)
- Keep B (70% cumulative)
- Keep C (85% cumulative)
- Keep D (95% cumulative ≥ 90% ✓)
- Result: Sample from {A, B, C, D}

Typical range: 0.0-1.0 (commonly 0.90-0.95)

Recommended starting value: 0.95

📘
Top P vs Top K
Top K: Fixed number of tokens (e.g., always 50) Top P: Variable number based on confidence distribution
Top P adapts:

When model is confident: few tokens needed to reach P

When model is uncertain: more tokens needed to reach P

Understanding Top P values:

High Top P (0.95-1.0)

More diverse outputs:

Includes less probable tokens
Greater output variety
More creative and unpredictable
Risk of lower coherence

Use when:

Creative description tasks
Varied phrasing desired
Output diversity important
Repetition is a problem

Example: Top P = 0.95

Allows more vocabulary choices
Includes less common but valid alternatives
Increases expression variety

Medium Top P (0.85-0.94)

Balanced outputs (recommended for most tasks):

Moderate diversity
Maintains coherence
Focused but not repetitive
Good default choice

Use when:

Standard VLM tasks
Balance needed between quality and diversity
First-time configuration
Phrase grounding and VQA

Example: Top P = 0.90

Focuses on high-probability tokens
Allows some variation
Prevents most low-quality choices

Low Top P (0.7-0.84)

Focused outputs:

Very deterministic
Minimal diversity
Highly consistent
Risk of repetition

Use when:

Maximum consistency required
Factual accuracy critical
Template-like outputs desired
Debugging generation issues

Example: Top P = 0.80

Very focused token selection
Predictable outputs
May become repetitive

Recommended combinations:

Standard phrase grounding

Configuration:

Top K: 50
Top P: 0.95
Temperature: 1.0

Why this works:

Allows diverse phrasing for captions
Maintains focus for accurate groundings
Balances creativity and precision

Factual VQA

Configuration:

Top K: 50
Top P: 0.90
Temperature: 0.7

Why this works:

Focuses on most probable (accurate) answers
Reduces creative but potentially incorrect responses
Maintains consistency across similar questions

Creative descriptions

Configuration:

Top K: 70
Top P: 0.95
Temperature: 1.2

Why this works:

High diversity in vocabulary
Varied expression styles
Creative but coherent outputs

Sampling temperature

Controls randomness in token selection during generation. Temperature shapes the probability distribution over possible next tokens, directly affecting output diversity and creativity.

How it works:

Temperature scales the probability distribution:

Low temperature (0.3):
- Sharpens distribution → more deterministic
- High-probability tokens become even more likely
- Low-probability tokens become even less likely

High temperature (1.5):
- Flattens distribution → more random
- Probabilities become more uniform
- Low-probability tokens get more chances

Example:

Original probabilities:
Token A: 40%
Token B: 30%
Token C: 20%
Token D: 10%

Temperature = 0.5 (low):
Token A: 60% ← More focused
Token B: 25%
Token C: 10%
Token D: 5%

Temperature = 1.5 (high):
Token A: 35% ← More uniform
Token B: 32%
Token C: 22%
Token D: 11%

Typical range: 0.1-2.0 (commonly 0.7-1.3)

Recommended starting value: 1.0 (neutral)

📘
Temperature intuition
Think of temperature like confidence vs. exploration:
Low temperature (0.3-0.7): "Play it safe, use most likely words"
Medium temperature (0.8-1.2): "Balance safety with some variation"
High temperature (1.3-2.0): "Be creative, try different approaches"

Choosing temperature:

Low temperature (0.3-0.7)

Focused, deterministic outputs:

Use for:

Factual question answering
Structured output generation
Consistency critical tasks
Bounding box generation

Characteristics:

Very predictable outputs
High consistency
Low diversity
Risk of repetition with very low values

Example use case: Counting objects in images:

Temperature 0.5
Consistent "There are X objects" format
Minimal phrasing variation
Focus on accuracy

Recommended values:

VQA (factual): 0.7
Object counting: 0.5
Structured outputs: 0.6

Medium temperature (0.8-1.2)

Balanced outputs (recommended for most tasks):

Use for:

Standard VLM tasks
Phrase grounding
General image understanding
Most production scenarios

Characteristics:

Good balance of quality and diversity
Natural-sounding outputs
Appropriate variation
Reliable performance

Example use case: Phrase grounding captions:

Temperature 1.0
Varied but accurate descriptions
Natural language flow
Consistent quality

Recommended values:

Phrase grounding: 1.0
VQA (descriptive): 1.0
General tasks: 0.9-1.1

High temperature (1.3-2.0)

Creative, diverse outputs:

Use for:

Creative image descriptions
Multiple phrasing alternatives
Exploring diverse generations
Reducing repetition

Characteristics:

High output diversity
Creative language use
Less predictable
Risk of incoherence if too high

Example use case: Creative scene descriptions:

Temperature 1.5
Varied vocabulary and phrasing
Multiple perspectives
Interesting but potentially less consistent

Recommended values:

Creative descriptions: 1.3-1.5
Diverse generations: 1.4
Maximum diversity: 1.6-1.8

Caution: Values above 1.5 may reduce quality

Temperature interactions:

Combining with Top K and Top P

Temperature works with Top K and Top P:

Processing order:

Temperature: Reshapes probability distribution
Top K: Limits to K most probable tokens
Top P: Limits to cumulative probability threshold
Sample: Choose next token from remaining options

Recommended combinations:

For factual tasks:

Temperature: 0.7 (focused)
Top K: 50 (moderate limit)
Top P: 0.90 (focused sampling)
→ Deterministic, accurate outputs

For balanced tasks:

Temperature: 1.0 (neutral)
Top K: 50 (moderate limit)
Top P: 0.95 (balanced sampling)
→ Natural, reliable outputs

For creative tasks:

Temperature: 1.3 (creative)
Top K: 70 (wider limit)
Top P: 0.95 (diverse sampling)
→ Varied, interesting outputs

Sampling repetition penalty

Reduces the likelihood of repeating tokens that have already been generated. This penalty helps create more diverse and natural-sounding outputs by discouraging repetitive patterns.

How it works:

After generating each token:

Track which tokens have been used
Reduce probability of already-used tokens
Higher penalty = stronger discouragement

Typical range: 1.0-2.0

1.0: No penalty (default behavior)
1.05: Gentle penalty
1.1: Moderate penalty (recommended)
1.3: Strong penalty
1.5+: Very strong penalty (may hurt coherence)

Recommended starting value: 1.05

📘
How repetition penalty works
Example with penalty 1.2:
If "car" was already generated:

Original probability: 10%

After penalty: 10% ÷ 1.2 = 8.3%

Repeated use further reduces probability:

Second use: 8.3% ÷ 1.2 = 6.9%

Third use: 6.9% ÷ 1.2 = 5.8%

The model increasingly favors alternative words.

Choosing repetition penalty:

Low penalty (1.0-1.1)

Minimal repetition discouragement:

Use for:

Natural language where repetition is acceptable
Technical descriptions requiring specific terminology
Structured outputs with repeated elements
Default starting point

Value 1.0 (no penalty):

Natural repetition patterns
May repeat common words naturally
Good for most standard tasks

Value 1.05 (gentle penalty):

Slight preference for variety
Maintains natural language flow
Recommended default for most VLM tasks

Example: Phrase grounding captions naturally repeat:

"The red car next to a blue car" (car repeated appropriately)
Penalty 1.05 allows this natural repetition

Moderate penalty (1.1-1.3)

Balanced repetition control (recommended):

Use for:

Long-form descriptions
Reducing noticeable repetition
Creative text generation
When repetition becomes problematic

Value 1.1:

Good balance for most use cases
Reduces obvious repetition
Maintains coherence

Value 1.2:

Stronger variety encouragement
For longer outputs
When 1.1 shows too much repetition

Example: Without penalty: "The image shows a person wearing a hat. The person is standing next to another person. Each person has a bag."

With penalty 1.2: "The image shows a person wearing a hat, standing next to someone else. Both individuals carry bags." → More varied vocabulary

High penalty (1.3+)

Strong repetition avoidance:

Use for:

Creative writing scenarios
Extreme repetition problems
Experimental settings

Caution:

May force unnatural word choices
Can reduce coherence
Might avoid necessary repetitions
Use only when repetition is severe

Value 1.5:

Very strong penalty
Significantly alters word choice
Risk of awkward phrasing

Example: With very high penalty: "The car is red" → hard to repeat "car" → forced to use "vehicle", "automobile", "auto" even when "car" is most natural

Generally not recommended unless repetition is extreme.

Common scenarios:

Phrase grounding outputs

Recommended: 1.05 (gentle)

Phrase grounding naturally involves:

Repeating object names in groundings
Similar phrasing across detections
Structured JSON format

Why gentle penalty:

Allow natural terminology repetition
Maintain accurate object references
Preserve structured output format

Example:

{
  "groundings": [
    {"phrase": "a red car", ...},
    {"phrase": "a blue car", ...},
    {"phrase": "the parked cars", ...}
  ]
}

→ "car" repeated appropriately

Visual question answering

Recommended: 1.05-1.1

VQA responses are typically:

Short (1-3 sentences)
Focused answers
Limited repetition risk

Why light penalty:

Short outputs have less repetition
Focus on accuracy over variety
Natural answer patterns acceptable

Example question: "What color are the cars?"

Answer with 1.05: "There are two cars: a red car and a blue car." → Natural repetition of "car" acceptable

Long descriptions

Recommended: 1.1-1.2

Longer outputs risk more repetition:

Multiple sentences
Describing many objects
Detailed scene understanding

Why moderate penalty:

Encourages vocabulary variety
Maintains natural flow
Prevents monotonous phrasing

Example without penalty: "The scene shows a person in a red shirt. Next to the person is another person in a blue shirt. Behind these two people is a third person."

Example with penalty 1.15: "The scene shows a person in a red shirt, beside someone wearing blue. A third individual stands behind them." → More varied phrasing

Best practices

Start with recommended defaults

Recommended starting configuration:

Model Options:

Architecture size: Medium (7-13B) or small (1-3B) for testing
Training mode: LoRA Training
Quantization: NF4
Precision type: BFloat16

Hyperparameters:

Epochs: 100 (adjust based on dataset size)
Learning rate: 0.0001 (for LoRA Training)
Batch size: 4 (adjust for your GPU)
Gradient accumulation: 1
Optimizer: AdamW

Evaluation:

Max new tokens: 512
Top K: 50
Top P: 0.95
Temperature: 1.0
Repetition penalty: 1.05

Why these defaults:

Balanced across speed, quality, and resource usage
Work well for most VLM tasks
Safe starting point for experimentation
Proven effective in production

Iterate based on results

Systematic tuning approach:

Start with defaults (above configuration)
Train initial model and evaluate performance
Identify issues from training behavior:
- Slow convergence → adjust learning rate
- Memory errors → reduce batch size or enable quantization
- Poor quality → try larger model
- Overfitting → reduce epochs or add regularization
Change one thing at a time for clear cause-effect
Compare results across runs using evaluation metrics
Refine iteratively until satisfactory performance

Example iteration:

Run 1 (defaults):

Result: Good quality but slow convergence

Run 2 (adjust learning rate):

Learning rate: 0.0001 → 0.0002
Result: Faster convergence, similar quality

Run 3 (adjust batch size):

Batch size: 4 → 8
Result: Stable training, faster per-epoch time

Run 4 (final tuning):

Epochs: 100 → 75 (converged earlier)
Result: Optimal configuration found

Match settings to your resources

For limited GPU memory (8-16GB):

Architecture: Small (1-3B)
Training mode: LoRA Training
Quantization: NF4
Precision: BFloat16
Batch size: 1-2
Gradient accumulation: 4-8

For standard GPU (24-32GB):

Architecture: Medium (7-13B)
Training mode: LoRA Training
Quantization: NF4
Precision: BFloat16
Batch size: 4-8
Gradient accumulation: 1-2

For high-end GPU (40-80GB):

Architecture: Medium to Large
Training mode: LoRA Training or Full Finetuning (SFT)
Quantization: NF4 or FP4
Precision: BFloat16
Batch size: 8-16
Gradient accumulation: 1

Monitor training behavior

Watch for these patterns:

Loss curves:

Smooth decrease: Good configuration ✓
Erratic spikes: Learning rate too high
Plateau early: Learning rate too low or model capacity insufficient
Overfitting: Training loss decreases but validation increases

Memory usage:

Consistent 70-80% GPU utilization: Optimal ✓
Frequent OOM errors: Reduce batch size or enable quantization
Low utilization (<50%): Can increase batch size

Training speed:

Consistent step times: Good ✓
Increasing step times: Memory/swap issues
Very slow: Consider larger batch size or better GPU

Learn more about monitoring runs →

Document your configurations

Track successful configurations:

Keep records of:

Which settings worked for which tasks
Resource requirements (GPU memory, training time)
Performance metrics achieved
Issues encountered and solutions

Naming convention for workflows:

[Task]-[Model Size]-[Key Settings]-v[Number]

Examples:
- "DefectDetection-7B-LoRA-NF4-v1"
- "ProductRecog-13B-LoRA-FastConv-v2"
- "QualityControl-3B-FullFT-HighAcc-v1"

Benefits:

Quickly identify successful configurations
Share settings with team members
Reproduce results reliably
Track iteration progress

Common questions

Which settings have the biggest impact on training quality?

Most critical settings for quality:

Model architecture size (35% impact)
- Larger models generally achieve higher quality
- Most important factor for capacity
Learning rate (25% impact)
- Correct learning rate ensures convergence
- Too high/low severely impacts results
Epochs (20% impact)
- Sufficient epochs required to converge
- More isn't always better (overfitting risk)
Dataset quality and size (not a setting, but 40% impact)
- High-quality annotations critical
- More data generally helps

Less critical for quality (but affect speed/resources):

Training mode (LoRA Training vs Full Finetuning): Similar quality
Quantization: Minimal quality impact
Batch size: Affects stability more than final quality
Evaluation settings: Only affect inference, not training

My training is too slow. What should I change?

Speed optimization strategies:

1. Reduce model size:

Large (20-34B) → Medium (7-13B): 2-3x faster
Medium (7-13B) → Small (1-3B): 2-4x faster

2. Enable optimizations:

Use LoRA Training instead of Full Finetuning (SFT): 2-3x faster
Use NF4 quantization: 1.5-2x faster
Use BFloat16 precision: 2x faster vs Float32

3. Increase batch size:

Batch 2 → Batch 4: ~1.5x faster
Batch 4 → Batch 8: ~1.3x faster
Limited by GPU memory

4. Reduce epochs:

Monitor for early convergence
Stop when validation metrics plateau
May be training longer than needed

5. Reduce gradient accumulation:

Only if memory allows
Accumulation adds overhead

Typical speedup example:

Before: Large model, Full Finetuning (SFT), batch 2 → 10 hours
After: Medium model, LoRA Training, NF4, batch 4 → 2-3 hours
~3-4x total speedup with minimal quality impact

How do I know if my settings are good?

Indicators of good configuration:

Training behavior:

Loss decreases smoothly
No frequent spikes or NaN values
Reasonable training speed
GPU utilization 70-90%
No out-of-memory errors

Results quality:

Validation metrics improve over training
Test set performance meets requirements
Generated outputs are coherent and accurate
Model generalizes to new examples

Resource usage:

Training completes in acceptable time
GPU memory usage stable
Costs within budget

Comparison approach:

Train with default settings (baseline)
Evaluate quality metrics
Adjust one setting at a time
Compare metrics to baseline
Keep changes that improve results

Learn about model evaluation →

Should I use the same settings for all my projects?

Start similar, then customize:

Reusable baseline:

Core settings (LoRA Training, NF4, AdamW) work across projects
Training approach generalizes well
Resource optimizations apply universally

Project-specific tuning:

Epochs: Depends on dataset size
Learning rate: May need adjustment per task
Batch size: Constrained by your GPU
Model size: Depends on task complexity

Best practice:

Create a "baseline" workflow with proven settings
Clone it for new projects
Adjust task-specific parameters:
- System prompt
- Dataset
- Epochs based on data size
Fine-tune if baseline doesn't perform well

Example:

Baseline workflow:

7B model, LoRA Training, NF4, BFloat16
Batch 4, AdamW, LR 0.0001
Works for most detection tasks

Project A (100 images):

Clone baseline
Adjust: Epochs 150 (small dataset)

Project B (1000 images):

Clone baseline
Adjust: Epochs 75 (large dataset)
Adjust: Batch 8 (more memory available)

When should I use Full Finetuning (SFT) instead of LoRA Training?

Use Full Finetuning (SFT) when:

Domain is drastically different:
- Medical/microscopy images (if base model trained on natural images)
- Satellite/aerial imagery
- Specialized visual domains
Maximum quality is critical:
- Production systems with strict accuracy requirements
- Research requiring state-of-the-art results
- When LoRA Training results are insufficient
Resources are abundant:
- Access to high-end GPUs (A100, H100)
- Training time is not a constraint
- Budget allows for higher compute costs

Stick with LoRA Training when:

Standard computer vision tasks:
- Object detection
- Classification
- Quality control
- Most production applications
Limited resources:
- Consumer GPUs
- Time constraints
- Budget limitations
Iterating quickly:
- Development phase
- Prototyping
- A/B testing different approaches

Reality: 90%+ of use cases work excellently with LoRA Training. Try LoRA Training first, only switch to Full Finetuning (SFT) if results are insufficient and you have the resources.

My model generates repetitive outputs. How do I fix this?

Solutions ranked by effectiveness:

1. Increase repetition penalty (first try):

Current: 1.05 → Try: 1.1 or 1.15
Directly addresses repetition
Usually most effective solution

2. Adjust temperature:

Current: 1.0 → Try: 1.2
Increases output diversity
More creative word choices

3. Increase Top P:

Current: 0.90 → Try: 0.95
Allows more token variety
Broader vocabulary usage

4. Increase Top K:

Current: 50 → Try: 70-100
Widens sampling pool
More diverse token selection

5. Training-level solutions (if inference changes don't help):

Increase dataset diversity
Add more varied training examples
Adjust system prompt to encourage variety

Typical fix:

Before:
- Temperature: 1.0
- Repetition penalty: 1.05
- Result: "The car is red. The car is large. The car is parked."

After:
- Temperature: 1.1
- Repetition penalty: 1.15
- Result: "The vehicle is red and large, parked near the building."

Try adjustments incrementally—don't change all at once.

What's the difference between batch size and gradient accumulation?

Both affect effective batch size, but differently:

Batch size:

Number of images processed simultaneously
Limited by GPU memory
Higher = faster training (better GPU utilization)
Each batch computes gradients, model updates immediately

Gradient accumulation:

Number of batches before updating model
Not limited by GPU memory
Higher = simulates larger batches without extra memory
Gradients accumulated across batches, then single update

Example comparison:

Configuration A:

Batch size: 8
Gradient accumulation: 1
Effective batch: 8
Updates per epoch: dataset_size / 8
Speed: Fast (8 images at once)
Memory: High (8 images in GPU)

Configuration B:

Batch size: 2
Gradient accumulation: 4
Effective batch: 8 (same as A)
Updates per epoch: dataset_size / 8 (same as A)
Speed: Slower (only 2 images at once, but 4x more forward passes)
Memory: Low (only 2 images in GPU)

When to use which:

Prefer higher batch size (less accumulation):

When GPU memory allows
For faster training
Simpler configuration

Use gradient accumulation:

When GPU memory is limited
To simulate larger batches than GPU allows
To improve training stability without upgrading GPU

Best practice: Use largest batch size that fits in memory, then add accumulation if you need larger effective batch sizes.

Next steps

After configuring your model settings:

Continue workflow configuration

Configure your system prompt — Define VLM task instructions
Configure your dataset — Set up data splitting
Create a workflow — Complete workflow setup

Start training

Manage runs — Launch training and monitor progress
Monitor a run — Track training metrics in real-time
Evaluate a model — Assess performance on test data

Optimize performance

Configure training settings — Advanced optimization options
View resource usage — Monitor compute and storage costs

Additional resources

Training guides

Create a training project — Set up training projects
Train a model — Complete training workflow
Configure training settings — Advanced parameters

Model configuration

Configure your model — Model selection and setup
Configure your dataset — Dataset splitting
Configure your system prompt — Task instructions

Concept guides

Phrase grounding — Understanding visual grounding
Visual question answering — Understanding VQA
Glossary — VLMOps terminology reference

Quickstart

Quickstart: Train a model — Fast-track training guide
Quickstart: Create a workflow — Streamlined workflow setup

Related resources

Model architectures — Compare VLM architectures and sizes
Configure your model — Complete model configuration overview
Create a workflow — Combine model settings into workflows
Configure training settings — Set checkpoint strategy and GPU
Train a model — Complete training workflow guide
Evaluate a model — Assess model performance
System prompts — Define VLM behavior
Dataset configuration — Set train-test split
Resource usage — Understanding Compute Credits
Quickstart — End-to-end training tutorial
Vi SDK — Python SDK for model management
Contact us — Get help from the Datature team

Need help?

We're here to support your VLMOps journey. Reach out through any of these channels:

Contact Support

Get help from our team via our website or email us at [email protected]

Join Our Community

Connect with other Datature users, share ideas, and get community support on Slack

Explore Resources

Read our Blog
Check out GitHub
Watch Tutorials

Schedule a Demo

Book a personalized demo to see how Datature Vi can accelerate your vision AI projects