Configure Training Settings

Configure advanced settings, hardware, and validation before starting your training run.

Configure your training run settings before launching model fine-tuning. The training configuration wizard guides you through four essential steps: advanced settings, hardware selection, dataset validation, and final review.

📋

Prerequisites

Before configuring training settings, ensure you have:


Access training configuration

From your workflow canvas, click Run Training at the bottom right to open the training configuration dialog.

The configuration wizard presents four sequential steps that you must complete before starting training.


Step 1: Advanced Settings

Configure checkpoint and evaluation settings that control how your model is saved and monitored during training.

Checkpoint Strategy

Evaluation Interval Epochs — Controls how frequently the training process creates evaluation checkpoints.

SettingBehaviorBest for
1 epochEvaluate after every epochShort training runs (3-10 epochs), critical monitoring
2-5 epochsEvaluate every few epochsStandard training (10-50 epochs), balanced monitoring
10+ epochsEvaluate less frequentlyLong training runs (100+ epochs), reduced overhead

Learn more about checkpoint strategies →

Advanced Evaluation

Enable Advanced Evaluation — Enables detailed evaluation metrics and preview visualizations during training.

OptionDescriptionImpact
EnableGenerate visual previews at each checkpointHigher GPU memory usage, slower checkpointing
DisableBasic metrics only, no preview generationFaster checkpointing, lower memory footprint
💡

Recommendation: Enable Advanced Evaluation for initial experiments to understand model behavior. Disable for production runs where speed is critical.

View complete checkpoint configuration options →

Click Next to continue to hardware configuration.


Step 2: Hardware Configuration

Select your compute infrastructure and GPU resources for training. Your choices directly impact training speed, cost, and model capacity.

Infrastructure Options

Choose where your training will run:

InfrastructureDescriptionAvailability
Vi CloudTrain on Vi's managed GPU infrastructure with automatic scaling and monitoringAvailable now
Custom RunnerTrain on your own infrastructure with full control over environment and resourcesComing soon
💡

For most users: Vi Cloud provides the fastest setup with no infrastructure management required.

GPU Type

Select the GPU model that matches your performance requirements and budget. Different GPU types have different:

  • Memory capacity (VRAM) — Determines maximum batch size and model size
  • Compute performance — Affects training speed
  • Cost multiplier — Higher-performance GPUs consume more Compute Credits per minute
GPU TierExample ModelsBest For
Entry-levelNVIDIA T4Small models, experimentation, tight budgets
BalancedNVIDIA L4, A10GStandard VLM training, production workflows
High-performanceNVIDIA A100 (40GB/80GB)Large models, faster iteration, distributed training
Cutting-edgeNVIDIA H100, H200, B200Maximum performance, ultra-large models, research

View complete GPU specifications and pricing →

Number of GPUs

Choose how many GPUs to allocate for distributed training. Multi-GPU setups can significantly reduce training time for large models and datasets.

ConfigurationTraining SpeedUse Case
1 GPUBaselineSmall to medium models, single experiments
2-4 GPUs1.8-3.5× fasterStandard distributed training, faster iteration
8+ GPUs6-12× fasterLarge-scale training, production pipelines
💡

Compute Credit calculation: Multi-GPU configurations have specific usage multipliers. A 4× A10G setup consumes 10.0 credits per minute (not 4× 2.5).

View usage multipliers →

Cost Estimation

Usage Multiplier displays the Compute Credits consumed per minute for your selected configuration:

Example calculation:
  4× NVIDIA A10G GPUs → 10.0 credits/minute
  Estimated training time: 45 minutes
  Total cost: 450 Compute Credits

Plan your Compute Credit usage →

Click Next to validate your dataset.


Step 3: Dataset Validation

The platform automatically validates your dataset to ensure it meets training requirements.

Validation Checks

The system performs several checks on your configured dataset:

CheckPurposeCommon Issues
Asset availabilityVerify all training assets are accessibleMissing or deleted assets
Annotation completenessEnsure annotations exist for training splitEmpty training set
Format compatibilityCheck annotation format matches model requirementsIncorrect annotation types
Split configurationValidate train/validation/test splitsInvalid split ratios

Validation Results

StatusMeaningAction
Ready for TrainingAll checks passed, dataset is validProceed to summary
WarningsNon-critical issues detectedReview warnings, optionally fix
ErrorsCritical issues preventing trainingFix issues before proceeding
❗️

Critical validation errors must be resolved before training can start. Common errors include missing assets, empty annotation sets, or invalid split configurations.

If issues are found:

  1. Review the error or warning details
  2. Click Back to exit the configuration
  3. Fix the identified issues in your dataset or workflow
  4. Return and restart the training configuration

Click Next when validation shows "Ready for Training".


Step 4: Review Summary

Review your complete training configuration before launching the run.

Summary Information

The summary displays key configuration details:

ComponentDescriptionSource
System PromptCharacter count of your prompt configurationConfigure your system prompt
ArchitectureSelected VLM model architecture and sizeConfigure your model
Batch SizeNumber of samples processed per training stepModel settings
Training EpochsNumber of complete passes through training dataTraining settings
Usage MultiplierCompute Credits consumed per minuteResource usage

Pre-launch Checklist

Before clicking Run Training, verify:

  • System prompt matches your use case requirements
  • Model architecture is appropriate for your dataset size
  • Batch size fits within selected GPU memory (VRAM)
  • Training epochs provide sufficient learning time
  • Usage multiplier aligns with your Compute Credit budget

Launch Training

Click Run Training to start the training run. The system will:

  1. Allocate your selected GPU resources
  2. Load your dataset and model configuration
  3. Initialize training with your specified settings
  4. Begin the fine-tuning process

Training runs in the background. You can safely close the browser or navigate away. The platform will notify you when training completes.

Learn how to monitor your training run →


Configuration Best Practices

Start conservative, then scale

For your first training run:

  1. Use 1 GPU (typically 1× T4 or L4) to validate configuration
  2. Enable Advanced Evaluation to understand model behavior
  3. Set checkpointing to 1-2 epochs for frequent feedback
  4. Monitor resource usage and training progress

After validation:

  1. Scale to multi-GPU for faster training
  2. Disable Advanced Evaluation if speed is priority
  3. Adjust checkpoint frequency based on training duration
  4. Optimize batch size for GPU memory utilization
Match hardware to model size

Model size considerations:

Architecture SizeMinimum GPURecommended GPUBatch Size
0.5-3B paramsT4 (16 GB)L4 (24 GB)8-16
3-7B paramsA10G (24 GB)A100 (40 GB)4-8
7-13B paramsA100 (40 GB)A100 (80 GB)2-4
13B+ paramsA100 (80 GB)H100 (80 GB)1-2

Out of memory errors? Reduce batch size or upgrade to GPU with more VRAM.

View detailed GPU specifications →

Optimize checkpoint frequency

Balance monitoring needs with training efficiency:

Frequent checkpoints (1-2 epochs):

  • ✅ Detailed progress tracking
  • ✅ More recovery points if training fails
  • ✅ Better for short training runs (< 10 epochs)
  • ❌ Slower training due to evaluation overhead

Infrequent checkpoints (5-10 epochs):

  • ✅ Faster training with less overhead
  • ✅ Better for long runs (50+ epochs)
  • ❌ Less granular progress visibility
  • ❌ Fewer recovery points

Configure checkpoint strategies →

Plan for compute costs

Before starting training:

  1. Estimate duration: Similar models with similar datasets provide baseline
  2. Calculate credits: Usage Multiplier × Estimated Minutes = Total Credits
  3. Add buffer: Include 20-30% buffer for unexpected training time
  4. Verify budget: Ensure sufficient Compute Credits available

Example calculation:

Configuration: 4× A100 (40GB) → 16.0 credits/minute
Estimated time: 2 hours (120 minutes)
Buffer (25%): 30 minutes

Total credits needed: 16.0 × 150 = 2,400 credits

View pricing and examples →


Common Questions

Can I change settings after training starts?

No. Training configuration is locked when the run starts. If you need different settings:

  1. Cancel the current run
  2. Return to your workflow
  3. Click Run Training to configure a new run with updated settings

Tip: Test configurations with short runs (few epochs) before committing to long training sessions.

What happens if I run out of Compute Credits during training?

Training will pause when your Compute Credits are depleted:

  1. The run enters a "paused" state
  2. You receive a notification
  3. Training resumes automatically when credits are available
  4. No progress is lost—training continues from the last checkpoint

To avoid interruptions:

  • Monitor your resource usage before starting
  • Purchase additional Compute Credits if needed
  • Use smaller GPU configurations for longer availability
How do I choose between 1 GPU and multiple GPUs?

Use 1 GPU when:

  • Testing new configurations or hyperparameters
  • Training small models (< 3B parameters)
  • Working with limited Compute Credit budget
  • Dataset is small (< 1,000 images)

Use multiple GPUs when:

  • Training large models (7B+ parameters)
  • Working with large datasets (10,000+ images)
  • Speed is critical for iteration velocity
  • You have sufficient Compute Credit budget

Performance scaling: Multi-GPU training doesn't scale linearly. 4 GPUs are typically 2.5-3.5× faster than 1 GPU, not 4× faster.

Learn about distributed training →

What's the difference between checkpointing and saving?

Checkpointing:

  • Occurs during training at specified intervals (every N epochs)
  • Creates recovery points for interrupted training
  • Generates evaluation metrics and visualizations
  • Temporary until training completes

Final model saving:

  • Occurs when training completes successfully
  • Creates the deployable model artifact
  • Includes final weights and configuration
  • Permanent and available for deployment

Configure checkpoint behavior →

Can I use the same GPU configuration for different models?

Maybe. GPU requirements depend on:

Model architecture size:

  • 0.5-1B parameter models: T4 or L4 sufficient
  • 3-7B parameter models: A10G or A100 (40GB) recommended
  • 7-13B parameter models: A100 (80GB) or H100 required

Batch size:

  • Larger batch sizes require more VRAM
  • Reduce batch size if you encounter out-of-memory errors

Dataset complexity:

  • High-resolution images need more memory
  • Complex annotations increase memory usage

Rule of thumb: Start with the GPU configuration from similar successful runs. Upgrade if you encounter memory errors.

View GPU selection guide →

Why is my estimated training time different from actual time?

Training time estimates are based on historical data and may vary due to:

Factors that increase training time:

  • Large batch sizes require more computation per step
  • Complex model architectures have slower forward/backward passes
  • Advanced Evaluation enabled adds checkpoint overhead
  • High-resolution images increase processing time

Factors that decrease training time:

  • Multi-GPU configurations (but not linearly)
  • Smaller datasets finish epochs faster
  • Disabled Advanced Evaluation reduces overhead

Tip: Track actual training time for your specific configurations to improve future estimates.


Troubleshooting

Training won't start - insufficient resources

Issue: Cannot start training due to resource constraints.

Potential causes:

  • Insufficient Compute Credits for selected GPU configuration
  • All GPUs of selected type currently in use
  • Account limits exceeded

Solutions:

  1. Check Compute Credits:

    • View your resource usage
    • Calculate required credits: Usage Multiplier × Estimated Duration
    • Purchase additional credits or wait for monthly renewal
  2. Try different GPU type:

    • Select an alternative GPU with similar capabilities
    • Consider using fewer GPUs (e.g., 1× instead of 4×)
  3. Wait and retry:

    • GPU availability varies; retry after a few minutes
    • Schedule training during off-peak hours if possible
Dataset validation fails

Issue: Validation step shows errors preventing training.

Common validation errors:

ErrorCauseSolution
Missing assetsAssets deleted after workflow configurationRe-upload missing assets
Empty training splitNo assets assigned to training splitAdjust split configuration
Invalid annotationsAnnotation format incompatible with modelFix annotation format
Insufficient dataTraining set too small for selected modelAdd more annotated assets

To resolve:

  1. Note the specific error message from validation
  2. Click Back to exit configuration
  3. Fix the underlying issue in your dataset or workflow
  4. Restart the configuration and validate again

Learn about dataset requirements →

Configuration resets after going back

Issue: Changes made in earlier steps are lost when navigating back.

Why this happens:

  • Configuration wizard maintains session state
  • Browser back button may cause state loss
  • Session timeout resets configuration

To avoid:

  • Use the Back button within the wizard, not browser back
  • Complete configuration in one session
  • Don't leave configuration dialog idle for extended periods

If configuration is lost:

  • Your workflow settings are preserved
  • Only the training run configuration needs to be redone
  • Previous selections may auto-populate based on workflow defaults
Can't see my preferred GPU type

Issue: Desired GPU model not available in dropdown.

Reasons for unavailability:

  1. Plan restrictions: Some GPU types require specific subscription plans
  2. Region limitations: Certain GPUs may be region-specific
  3. Temporary unavailability: High-demand GPUs may be temporarily unavailable
  4. Beta access: Cutting-edge GPUs (H200, B200) may require beta enrollment

Solutions:

  • Upgrade plan: Contact sales for Enterprise plans with more GPU options
  • Use available alternative: Select similar GPU with comparable performance
  • Join waitlist: Request access to beta GPU types
  • Check documentation: Verify GPU is generally available

Contact support for GPU access →


Next Steps

After configuring your training settings:


Related resources