Configure Your Model

Choose your model architecture and configure training settings for optimal VLM performance

Configure Your Model

Configuring your model involves two key decisions: selecting the right model architecture for your task and tuning training settings to optimize performance. These choices directly impact training speed, accuracy, and resource usage.

💡

New to VLM training?

Start with the quickstart guide to train your first model with recommended settings. Return here when you're ready to optimize and customize your configuration.

📋

Prerequisites

Get started with training →

What you'll configure

Model configuration happens in two stages when creating a workflow:


Step 1: Choose Your Model Architecture

Select the vision-language model that best fits your use case, accuracy requirements, and available compute resources.

Available Architectures

Vi supports four powerful VLM architectures, each optimized for different scenarios, with more coming soon:

ModelSizes AvailableBest ForKey Strength
Qwen2.5-VL3B, 7B, 32BGeneral-purpose VLM tasksDynamic resolution, extended context (128K tokens)
NVIDIA NVILA-Lite2BResource-constrained deploymentsEfficiency, fast inference
NVIDIA Cosmos-Reason17BComplex reasoning tasksLogical inference, multi-step analysis
OpenGVLab InternVL3.58BBalanced performanceFine-grained visual understanding
DeepSeek OCR (Coming Soon)TBADocument understanding and OCRSpecialized text extraction capabilities
LLaVA-NeXT (Coming Soon)TBAAdvanced multimodal reasoningImproved visual comprehension
📘

Recommended Starting Point

Qwen2.5-VL (7B) offers the best balance of performance and efficiency for most use cases. It handles diverse tasks including visual question answering, phrase grounding, and image understanding.

Quick Selection Guide

Choose based on your primary goal:

I need maximum accuracy

Recommended: Qwen2.5-VL 32B

The largest available model provides the highest accuracy across all task types. Requires substantial GPU resources (40-80GB) but delivers state-of-the-art results for:

  • Production deployments with strict quality requirements
  • Complex visual reasoning tasks
  • Multi-image and video understanding
  • Tasks requiring deep contextual understanding
I have limited compute resources

Recommended: NVILA-Lite 2B or Qwen2.5-VL 3B

These compact models run efficiently on single consumer GPUs (8-16GB) while maintaining good performance:

  • NVILA-Lite: Optimized for efficiency, fastest inference
  • Qwen2.5-VL 3B: Broader capabilities, general-purpose tasks

Both are ideal for edge deployments and real-time applications.

I need OCR or text extraction

Recommended: Qwen2.5-VL 7B or 32B

Qwen2.5-VL handles OCR tasks effectively along with other vision-language capabilities:

  • Document understanding and processing
  • Text extraction from images
  • Multilingual text recognition
  • General vision-language tasks

Coming Soon: DeepSeek OCR will be available for specialized OCR and document understanding tasks.

I need logical reasoning capabilities

Recommended: Cosmos-Reason1 7B

Optimized specifically for complex reasoning tasks:

  • Multi-step logical inference
  • Cause-and-effect analysis
  • Visual reasoning puzzles
  • Contextual understanding and decision-making

Best for analytical applications requiring deep reasoning over visual and textual information.

Explore all architectures in detail →


Step 2: Configure Model Settings

After selecting your architecture, fine-tune training and inference settings to optimize for your specific requirements.

Settings Categories

Model settings are organized into three main areas:

Model Options

Control the fundamental training approach:

  • Architecture Size: Choose parameter count (2B-32B) to balance capacity and efficiency
  • Training Mode: Select between full fine-tuning or LoRA for parameter-efficient training
  • Quantization: Reduce memory usage with 4-bit or 8-bit quantization
  • Precision Type: Trade speed for accuracy with different numeric precision levels

Impact: These settings determine memory requirements, training speed, and model capacity.

Learn more about model options →

Hyperparameters

Control how the model learns during training:

  • Epochs: Number of complete passes through your training data
  • Learning Rate: Speed at which the model adapts to training data
  • Batch Size: Number of training examples processed simultaneously
  • Gradient Accumulation: Simulate larger batch sizes on limited hardware
  • Optimizer: Algorithm that updates model weights during training

Impact: These settings affect training convergence, final model quality, and training stability.

Learn more about hyperparameters →

Evaluation Settings

Control how the model generates predictions during inference:

Impact: These settings control output quality, diversity, and response length.

Learn more about evaluation settings →

Default vs. Custom Settings

For most use cases, Vi's default settings provide a solid starting point:

Setting CategoryDefault BehaviorWhen to Customize
Architecture SizeRecommended size for selected modelWhen optimizing for specific hardware or accuracy needs
Training ModeFull fine-tuningWhen memory is limited (use LoRA)
HyperparametersBalanced for convergence and speedWhen training isn't converging or you need faster iterations
EvaluationModerate creativity, balanced lengthWhen outputs are too short, too long, or too repetitive

Start with Defaults

Vi's default settings are tuned for most common use cases. Start with defaults and adjust only if you encounter specific issues like slow training, poor convergence, or suboptimal outputs.

View all settings and customization options →


Configuration Workflow

Follow this process when configuring your model:

  1. Select model architecture based on your task and resources
  2. Configure dataset with appropriate train-test split
  3. Define system prompt to guide model behavior
  4. Adjust model settings if needed (or use defaults)
  5. Create workflow to save your configuration
  6. Start training run to train your model

Configuration Tips

Experiment with multiple configurations

Create multiple workflows within your training project to test different configurations:

  • Compare different model architectures on the same dataset
  • Test various hyperparameter combinations
  • Optimize for different deployment scenarios (cloud vs. edge)

Each workflow maintains its own configuration and training runs, making it easy to track and compare results.

Monitor resource usage

Keep track of your compute consumption as you experiment with different configurations:

  • Larger models consume more compute credits
  • Longer training (more epochs) increases costs
  • Full fine-tuning uses more resources than LoRA

Balance accuracy requirements with available resources by testing smaller models first.

Iterate based on results

After your first training run completes:

  1. Evaluate model performance on test data
  2. Identify issues: Poor accuracy? Slow training? Repetitive outputs?
  3. Adjust settings: Change architecture, hyperparameters, or evaluation settings
  4. Create new workflow with updated configuration
  5. Compare results across multiple runs

Use the evaluation metrics to guide your configuration decisions.


Common Configuration Scenarios

Fast prototyping and experimentation

Goal: Quick iterations to validate your approach

Recommended Configuration:

  • Architecture: Qwen2.5-VL 3B or NVILA-Lite 2B
  • Training Mode: LoRA (faster, memory-efficient)
  • Epochs: Start with 3-5 for quick validation
  • Batch Size: Larger (8-16) for faster training

This configuration minimizes training time and cost while providing enough signal to validate your data quality and approach.

Production deployment with high accuracy

Goal: Maximum performance for production use

Recommended Configuration:

  • Architecture: Qwen2.5-VL 32B or InternVL3.5 8B
  • Training Mode: Full fine-tuning for maximum quality
  • Epochs: 10-20 for thorough training
  • Learning Rate: Conservative (1e-5 to 5e-5)
  • Evaluation: Lower temperature (0.1-0.5) for consistent outputs

This configuration prioritizes accuracy and consistency for production-quality results.

Resource-constrained environments

Goal: Train effectively with limited GPU resources

Recommended Configuration:

  • Architecture: NVILA-Lite 2B or Qwen2.5-VL 3B
  • Training Mode: LoRA with 4-bit quantization
  • Batch Size: Small (1-4) with gradient accumulation
  • Precision: Mixed precision (BF16 or FP16)

This configuration enables training on consumer GPUs (8-16GB) while maintaining reasonable quality.

Creative or diverse outputs

Goal: Generate varied, creative responses

Recommended Configuration:

  • Architecture: Qwen2.5-VL 7B or Cosmos-Reason1 7B
  • Temperature: Higher (0.7-1.0) for more creativity
  • Top P: Moderate (0.8-0.95) for diverse sampling
  • Repetition Penalty: Moderate (1.1-1.3) to avoid repetition

This configuration encourages diverse, creative outputs while maintaining coherence.


Next Steps

Once you've configured your model:


Common Questions

Can I change model settings after creating a workflow?

No, settings are fixed when you create a workflow. To try different settings, create a new workflow with the desired configuration.

You can maintain multiple workflows within a training project, each with different settings. This makes it easy to compare configurations and choose the best performing one.

What's the difference between model architecture and model settings?

Model Architecture (detailed guide) refers to the fundamental VLM design:

  • Which base model (Qwen2.5-VL, NVILA-Lite, etc.)
  • Model family and capabilities
  • Available parameter sizes

Model Settings (detailed guide) refers to configuration options:

  • How to train the model (training mode, hyperparameters)
  • Memory optimizations (quantization, precision)
  • Inference behavior (temperature, sampling parameters)

Think of architecture as "which car model" and settings as "how you tune and drive it."

Should I use full fine-tuning or LoRA?

Use Full Fine-Tuning when:

  • Maximum accuracy is critical
  • You have sufficient GPU resources (32GB+)
  • Training time is not a constraint
  • Deploying to production with quality requirements

Use LoRA when:

  • GPU memory is limited (< 32GB)
  • You need faster training iterations
  • Experimenting with multiple configurations
  • Training budget is constrained

Both can achieve excellent results. LoRA trades a small amount of potential accuracy for significant efficiency gains. Learn more about training modes.

How do I know if my settings are working?

Monitor these indicators during and after training:

During training (monitor runs):

  • Loss should steadily decrease
  • Training shouldn't stall or diverge
  • Memory usage should be stable

After training (evaluate model):

  • Test accuracy meets your requirements
  • Outputs are relevant and coherent
  • Generation length is appropriate
  • No excessive repetition in responses

If results are poor, adjust settings and create a new workflow to test the changes.

What settings have the biggest impact on results?

In order of impact:

  1. Model Architecture & Size: Larger models generally perform better
  2. Training Data Quality: Good data matters more than settings
  3. Epochs: Too few = underfitting, too many = overfitting
  4. Learning Rate: Affects convergence speed and final quality
  5. Batch Size & Gradient Accumulation: Impacts training stability
  6. Evaluation Settings: Control output quality and diversity

Start by selecting the right model architecture, then focus on epochs and learning rate before fine-tuning other settings.

Can I use the same settings for different datasets?

Settings that typically transfer well:

  • Model architecture (if tasks are similar)
  • Training mode (LoRA vs. full fine-tuning)
  • Evaluation settings (temperature, top P, etc.)

Settings that may need adjustment:

  • Epochs: Depends on dataset size and complexity
  • Learning Rate: May need tuning for different data distributions
  • Batch Size: Might change based on dataset size

Create a new workflow for each dataset to track settings and results independently. You can start with settings that worked well on similar datasets and adjust as needed.


Related Resources

Related resources