Train a Model
Fine-tune vision-language models on your data to create custom AI tailored to your specific use case.
Train custom vision-language models (VLMs) by fine-tuning state-of-the-art architectures on your annotated data. Transform general-purpose models into specialized AI that understands your domain, terminology, and visual patterns.
New to VLM training?Start with the quickstart guide to train your first model in under an hour. This section provides comprehensive documentation for production training workflows.
Prerequisites
- Prepared dataset with images and annotations
- Training project set up in Datature Vi
- Understanding of VLM concepts — phrase grounding or VQA
- Compute Credits for GPU training resources
Training workflow overview
Training a VLM on Datature Vi follows a structured workflow with five key stages:
1. Create Workflow
↓
2. Configure System Prompt
↓
3. Configure Dataset
↓
4. Configure Model
↓
5. Launch Training RunEach stage builds on the previous one, culminating in a reusable workflow that you can execute across multiple training runs with consistent configuration.
Why workflows?
Workflows are reusable training configurations that capture your complete setup:
- System prompt — Natural language instructions that guide model behavior
- Dataset configuration — Train-test splits, shuffling, and data distribution
- Model selection — Architecture choice and training parameters
- Reproducibility — Run the same configuration multiple times for experimentation
Once configured, workflows enable rapid iteration and experimentation without reconfiguring settings each time.
Training process stages
Set up a reusable training configuration with system prompt, dataset, and model settings
Write natural language instructions that define your VLM's task and behavior
Set train-test splits, enable shuffling, and optimize data distribution
Select VLM architecture and tune training parameters for optimal performance
Set checkpoint strategy, select GPU hardware, and validate dataset before launch
Track training progress, view metrics, and manage active or completed runs
Stage 1: Create a workflow
Workflows are the foundation of your training configuration. They encapsulate all settings required to train a model and can be reused across multiple training runs.
What you'll configure:
- System prompt configuration — Instructions that guide model behavior
- Dataset selection and splitting — Choose data source and configure train-test splits
- Model architecture and settings — Select VLM and configure training parameters
Complete guide: Create a Workflow →
Stage 2: Configure system prompt
System prompts are natural language instructions that define what your VLM should learn during training. They shape model behavior for both phrase grounding and VQA tasks.
Key decisions:
- Task definition — What should the model detect or answer?
- Domain terminology — Industry-specific language and concepts
- Output format — How should the model structure responses?
- Edge case handling — Behavior for ambiguous or out-of-scope inputs
Example system prompts:
Phrase grounding prompt example
You are an AI assistant specialized in identifying printed circuit board (PCB)
components in manufacturing images. Given a text description of a component
type (e.g., "capacitor", "resistor", "IC chip"), locate and mark all instances
of that component in the image with precise bounding boxes.
Focus on:
- Component type identification based on visual characteristics
- Clear distinction between similar components (e.g., capacitors vs. resistors)
- Accurate bounding box placement around component boundaries
- Detection of partially visible or occluded components
If the described component is not present in the image, respond with no
bounding boxes.VQA prompt example
You are a quality control AI assistant for automotive manufacturing. Answer
questions about vehicle assembly images with clear, specific responses based
only on what is visible in the image.
Your responses should:
- Be concise and factual (1-2 sentences maximum)
- Reference specific visual evidence when possible
- Indicate uncertainty when details are unclear or ambiguous
- Use standard automotive terminology
For yes/no questions, provide reasoning in parentheses. For count questions,
provide the exact number visible. For defect questions, describe the type and
location clearly.Complete guide: Configure Your System Prompt →
Stage 3: Configure dataset
Dataset configuration determines how your training data is split, shuffled, and distributed across training, validation, and test sets. Proper configuration ensures your model learns effectively and generalizes to new data.
Configuration options:
| Setting | Purpose | Best Practice |
|---|---|---|
| Dataset source | Which dataset to use | Select dataset matching your task (phrase grounding or VQA) |
| Train-test split | Ratio of training vs. validation data | 80/20 split for most use cases |
| Shuffling | Randomize data order | Enable to prevent bias from sequential data |
| Stratification | Balance classes across splits | Enable when classes have unequal representation |
Common split strategies:
- 80/20 split — Standard for most datasets (1000+ images)
- 90/10 split — When validation data is limited (< 500 images)
- 70/20/10 split — Add dedicated test set for final evaluation
Complete guide: Configure Your Dataset →
Stage 4: Configure model
Model configuration involves two key choices: selecting the right architecture for your task and tuning training settings to optimize performance.
Model architecture selection
Choose from 4 state-of-the-art VLM architectures (with more coming soon):
View available architectures
| Architecture | Sizes Available | Best For | Key Strengths |
|---|---|---|---|
| Qwen2.5-VL | 3B, 7B, 32B | General VLM tasks, phrase grounding, VQA | Dynamic resolution, extended context (128K tokens) |
| NVIDIA NVILA-Lite | 2B | Resource-constrained deployments | Efficiency, fast inference |
| NVIDIA Cosmos-Reason1 | 7B | Complex reasoning tasks | Logical inference, multi-step analysis |
| OpenGVLab InternVL3.5 | 8B | Balanced performance | Fine-grained visual understanding |
Coming Soon:
- DeepSeek OCR — Specialized OCR model for document understanding
- LLaVA-NeXT — Advanced multimodal reasoning with improved visual comprehension
Architecture selection factors:
- Task complexity — Larger models (7B+) for complex reasoning; smaller models (1-2B) for speed
- Compute budget — Larger models require more GPU memory and training time
- Inference speed — Smaller models deploy faster with lower latency
- Domain requirements — Specialized architectures for OCR, multilingual, or document tasks
View complete architecture comparison →
Model settings and hyperparameters
Configure training behavior and optimization parameters:
Core settings:
- Training mode — LoRA fine-tuning (efficient) or full fine-tuning (maximum customization)
- Batch size — Number of samples per training step (affects GPU memory and convergence)
- Learning rate — Controls speed and stability of model updates
- Epochs — Number of complete passes through training data
Complete guide: Configure Your Model →
Stage 5: Launch training run
After configuring your workflow, launch a training run with specific hardware and checkpoint settings.
Training run configuration:
- Advanced settings — Checkpoint frequency and evaluation options
- Hardware configuration — GPU type and quantity selection
- Dataset validation — Automatic checks ensure data readiness
- Review summary — Verify configuration before launch
GPU selection guidance:
View GPU recommendations by model size
| Model Size | Minimum GPU | Recommended GPU | Estimated Training Time* |
|---|---|---|---|
| 0.5-2B params | T4 (16 GB) | L4 (24 GB) | 1-2 hours |
| 3-7B params | A10G (24 GB) | A100 (40 GB) | 2-4 hours |
| 7-13B params | A100 (40 GB) | A100 (80 GB) | 4-8 hours |
| 13B+ params | A100 (80 GB) | H100 (80 GB) | 8-16 hours |
*For 1,000 images, 3 epochs, with LoRA fine-tuning
Compute costs:
Training consumes Compute Credits based on GPU type and duration:
Example: 4× A10G GPUs for 2 hours
Usage multiplier: 10.0 credits/minute
Duration: 120 minutes
Total cost: 1,200 Compute CreditsComplete guide: Configure Training Settings →
View GPU pricing and specifications →
After training completes
Once training finishes, you can evaluate performance, compare runs, and deploy your model.
Review metrics, analyze predictions, and assess model performance
Analyze different configurations to find optimal settings
Export trained model for deployment or external use
Training best practices
Start with default settings
For your first training run:
- Use default system prompts for your task type
- Configure 80/20 train-test split with shuffling enabled
- Select a 2-7B parameter model for balanced performance
- Start with LoRA training mode for efficiency
- Use 1-2 GPUs to validate configuration
After validation:
- Customize system prompt for domain-specific terminology
- Experiment with larger models for improved accuracy
- Scale to multi-GPU for faster training
- Fine-tune hyperparameters based on initial results
Optimize for your use case
For high accuracy requirements:
- Use larger models (7B-32B parameters)
- Train for more epochs (5-10)
- Enable Advanced Evaluation for detailed monitoring
- Increase dataset size (2,000+ annotated images)
For fast inference:
- Use smaller models (0.5-2B parameters)
- Optimize batch size for your deployment hardware
- Test inference speed after training
- Consider quantization for deployment
For limited compute budget:
- Start with smaller models (1-2B)
- Use LoRA fine-tuning instead of full fine-tuning
- Train with fewer epochs initially (1-3)
- Use 1× T4 GPU for experimentation
Iterate systematically
Effective experimentation workflow:
- Baseline run — Train with default settings to establish baseline performance
- One change at a time — Modify single variable per run (e.g., learning rate only)
- Document results — Track metrics, settings, and observations for each run
- Compare objectively — Use evaluation metrics to assess improvements
- Scale gradually — Apply successful changes to larger models or datasets
Variables to experiment with (in order of impact):
- Dataset size and quality (biggest impact)
- System prompt customization
- Model architecture selection
- Learning rate and training epochs
- Batch size and optimization settings
Monitor and validate
During training:
- Monitor training runs for loss curves and convergence
- Watch for overfitting indicators
- Check GPU utilization and memory usage
- Review checkpoint metrics at regular intervals
After training:
- Evaluate on validation set to assess generalization
- Test on held-out examples not seen during training
- Compare predictions against ground truth annotations
- Verify model performs well on edge cases and difficult examples
Common questions
How much data do I need to train a VLM?
Minimum requirements:
- Phrase grounding: 100-500 annotated image-text pairs
- VQA: 200-1,000 question-answer pairs
Recommended for production:
- Phrase grounding: 1,000+ annotated pairs across diverse scenarios
- VQA: 2,000+ diverse question-answer pairs
Quality over quantity: 500 high-quality, diverse annotations outperform 2,000 repetitive or low-quality annotations.
How long does training take?
Typical training times:
| Model Size | Dataset Size | GPU Configuration | Estimated Time |
|---|---|---|---|
| 1-2B params | 500 images | 1× T4 | 1-2 hours |
| 2-7B params | 1,000 images | 1× A10G | 2-4 hours |
| 7-13B params | 2,000 images | 2× A100 (40GB) | 4-8 hours |
| 13B+ params | 5,000 images | 4× A100 (80GB) | 8-16 hours |
Factors affecting training time:
- Model architecture size (larger = slower)
- Dataset size and image resolution
- Number of training epochs
- GPU type and quantity
- Training mode (LoRA vs. full fine-tuning)
Can I pause or resume training?
Automatic pausing:
Training pauses automatically when:
- Compute Credits are depleted
- Infrastructure issues occur
- Manual cancellation requested
Resuming training:
- Training resumes from the last saved checkpoint
- No progress is lost between checkpoints
- Refill Compute Credits to resume automatically
Manual control:
- Kill a run to stop permanently
- Saved checkpoints are preserved even after cancellation
- Create new runs from existing workflows to restart with different settings
What's the difference between LoRA and full fine-tuning?
LoRA (Low-Rank Adaptation) fine-tuning:
- ✅ Faster training (2-3× speedup)
- ✅ Lower GPU memory requirements
- ✅ Smaller model files for deployment
- ✅ Recommended for most use cases
- ❌ Slightly lower maximum accuracy ceiling
Full fine-tuning:
- ✅ Maximum customization potential
- ✅ Highest possible accuracy
- ✅ Best for highly specialized domains
- ❌ Longer training time
- ❌ Requires more GPU memory
- ❌ Larger deployment model size
Recommendation: Start with LoRA. Move to full fine-tuning only if LoRA results are insufficient for your requirements.
How do I improve model accuracy?
Most effective improvements (in order):
-
Add more training data (biggest impact)
- Increase dataset size
- Improve annotation quality and consistency
- Add examples covering edge cases
-
Refine system prompt (second biggest impact)
- Add domain-specific terminology
- Clarify ambiguous instructions
- Provide examples of desired behavior
-
Optimize hyperparameters
- Adjust learning rate
- Increase training epochs
- Tune batch size
-
Try larger model architecture
- Upgrade from 2B to 7B parameters
- Consider specialized architectures for your task
-
Improve data quality
- Fix annotation errors
- Remove ambiguous or low-quality examples
- Balance class distribution
Related resources
- Create a training project — Set up your training environment
- Prepare your dataset — Upload images and create annotations
- Phrase grounding concepts — Understanding text-guided object detection
- Visual question answering concepts — Understanding VQA tasks
- Resource usage — Understanding Compute Credits and GPU pricing
- Monitor training runs — Track training progress in real-time
- Manage workflows — Edit, delete, and organize workflows
- Create a workflow — Define training configuration
- Configure your model — Select model architecture and settings
- Configure training settings — Set checkpoint strategy and GPU
- Evaluate a model — Assess model performance and quality
- Manage runs — Kill or delete runs
- Quickstart — End-to-end training tutorial
Need help?
We're here to support your VLMOps journey. Reach out through any of these channels:
Updated about 1 month ago
