Train a Model

Train custom vision-language models (VLMs) by fine-tuning state-of-the-art architectures on your annotated data. Transform general-purpose models into specialized AI that understands your domain, terminology, and visual patterns.

💡
New to VLM training?
Start with the quickstart guide to train your first model in under an hour. This section provides comprehensive documentation for production training workflows.

📋
Prerequisites

Prepared dataset with images and annotations

Training project set up in Datature Vi

Understanding of VLM concepts — phrase grounding or VQA

Compute Credits for GPU training resources

Get started with quickstart →

Training workflow overview

Training a VLM on Datature Vi follows a structured workflow with five key stages:

1. Create Workflow
   ↓
2. Configure System Prompt
   ↓
3. Configure Dataset
   ↓
4. Configure Model
   ↓
5. Launch Training Run

Each stage builds on the previous one, culminating in a reusable workflow that you can execute across multiple training runs with consistent configuration.

Why workflows?

Workflows are reusable training configurations that capture your complete setup:

System prompt — Natural language instructions that guide model behavior
Dataset configuration — Train-test splits, shuffling, and data distribution
Model selection — Architecture choice and training parameters
Reproducibility — Run the same configuration multiple times for experimentation

Once configured, workflows enable rapid iteration and experimentation without reconfiguring settings each time.

Training process stages

1. Create a Workflow

Set up a reusable training configuration with system prompt, dataset, and model settings

2. Configure System Prompt

Write natural language instructions that define your VLM's task and behavior

3. Configure Dataset

Set train-test splits, enable shuffling, and optimize data distribution

4. Configure Model

Select VLM architecture and tune training parameters for optimal performance

5. Configure Training Settings

Set checkpoint strategy, select GPU hardware, and validate dataset before launch

Monitor & Manage Runs

Track training progress, view metrics, and manage active or completed runs

Stage 1: Create a workflow

Workflows are the foundation of your training configuration. They encapsulate all settings required to train a model and can be reused across multiple training runs.

What you'll configure:

System prompt configuration — Instructions that guide model behavior
Dataset selection and splitting — Choose data source and configure train-test splits
Model architecture and settings — Select VLM and configure training parameters

Complete guide: Create a Workflow →

Stage 2: Configure system prompt

System prompts are natural language instructions that define what your VLM should learn during training. They shape model behavior for both phrase grounding and VQA tasks.

Key decisions:

Task definition — What should the model detect or answer?
Domain terminology — Industry-specific language and concepts
Output format — How should the model structure responses?
Edge case handling — Behavior for ambiguous or out-of-scope inputs

Example system prompts:

Phrase grounding prompt example

You are an AI assistant specialized in identifying printed circuit board (PCB)
components in manufacturing images. Given a text description of a component
type (e.g., "capacitor", "resistor", "IC chip"), locate and mark all instances
of that component in the image with precise bounding boxes.

Focus on:
- Component type identification based on visual characteristics
- Clear distinction between similar components (e.g., capacitors vs. resistors)
- Accurate bounding box placement around component boundaries
- Detection of partially visible or occluded components

If the described component is not present in the image, respond with no
bounding boxes.

VQA prompt example

You are a quality control AI assistant for automotive manufacturing. Answer
questions about vehicle assembly images with clear, specific responses based
only on what is visible in the image.

Your responses should:
- Be concise and factual (1-2 sentences maximum)
- Reference specific visual evidence when possible
- Indicate uncertainty when details are unclear or ambiguous
- Use standard automotive terminology

For yes/no questions, provide reasoning in parentheses. For count questions,
provide the exact number visible. For defect questions, describe the type and
location clearly.

Complete guide: Configure Your System Prompt →

Stage 3: Configure dataset

Dataset configuration determines how your training data is split, shuffled, and distributed across training, validation, and test sets. Proper configuration ensures your model learns effectively and generalizes to new data.

Configuration options:

Setting	Purpose	Best Practice
Dataset source	Which dataset to use	Select dataset matching your task (phrase grounding or VQA)
Train-test split	Ratio of training vs. validation data	80/20 split for most use cases
Shuffling	Randomize data order	Enable to prevent bias from sequential data
Stratification	Balance classes across splits	Enable when classes have unequal representation

Common split strategies:

80/20 split — Standard for most datasets (1000+ images)
90/10 split — When validation data is limited (< 500 images)
70/20/10 split — Add dedicated test set for final evaluation

Complete guide: Configure Your Dataset →

Stage 4: Configure model

Model configuration involves two key choices: selecting the right architecture for your task and tuning training settings to optimize performance.

Model architecture selection

Choose from 4 state-of-the-art VLM architectures (with more coming soon):

View available architectures

Architecture	Sizes Available	Best For	Key Strengths
Qwen2.5-VL	3B, 7B, 32B	General VLM tasks, phrase grounding, VQA	Dynamic resolution, extended context (128K tokens)
NVIDIA NVILA-Lite	2B	Resource-constrained deployments	Efficiency, fast inference
NVIDIA Cosmos-Reason1	7B	Complex reasoning tasks	Logical inference, multi-step analysis
OpenGVLab InternVL3.5	8B	Balanced performance	Fine-grained visual understanding

Coming Soon:

DeepSeek OCR — Specialized OCR model for document understanding
LLaVA-NeXT — Advanced multimodal reasoning with improved visual comprehension

Architecture selection factors:

Task complexity — Larger models (7B+) for complex reasoning; smaller models (1-2B) for speed
Compute budget — Larger models require more GPU memory and training time
Inference speed — Smaller models deploy faster with lower latency
Domain requirements — Specialized architectures for OCR, multilingual, or document tasks

View complete architecture comparison →

Model settings and hyperparameters

Configure training behavior and optimization parameters:

Core settings:

Training mode — LoRA fine-tuning (efficient) or full fine-tuning (maximum customization)
Batch size — Number of samples per training step (affects GPU memory and convergence)
Learning rate — Controls speed and stability of model updates
Epochs — Number of complete passes through training data

Complete guide: Configure Your Model →

Stage 5: Launch training run

After configuring your workflow, launch a training run with specific hardware and checkpoint settings.

Training run configuration:

Advanced settings — Checkpoint frequency and evaluation options
Hardware configuration — GPU type and quantity selection
Dataset validation — Automatic checks ensure data readiness
Review summary — Verify configuration before launch

GPU selection guidance:

View GPU recommendations by model size

Model Size	Minimum GPU	Recommended GPU	Estimated Training Time*
0.5-2B params	T4 (16 GB)	L4 (24 GB)	1-2 hours
3-7B params	A10G (24 GB)	A100 (40 GB)	2-4 hours
7-13B params	A100 (40 GB)	A100 (80 GB)	4-8 hours
13B+ params	A100 (80 GB)	H100 (80 GB)	8-16 hours

*For 1,000 images, 3 epochs, with LoRA fine-tuning

Compute costs:

Training consumes Compute Credits based on GPU type and duration:

Example: 4× A10G GPUs for 2 hours
  Usage multiplier: 10.0 credits/minute
  Duration: 120 minutes
  Total cost: 1,200 Compute Credits

Complete guide: Configure Training Settings →

View GPU pricing and specifications →

After training completes

Once training finishes, you can evaluate performance, compare runs, and deploy your model.

Evaluate Model

Review metrics, analyze predictions, and assess model performance

Compare Runs

Analyze different configurations to find optimal settings

Download Model

Export trained model for deployment or external use

Training best practices

Start with default settings

For your first training run:

Use default system prompts for your task type
Configure 80/20 train-test split with shuffling enabled
Select a 2-7B parameter model for balanced performance
Start with LoRA training mode for efficiency
Use 1-2 GPUs to validate configuration

After validation:

Customize system prompt for domain-specific terminology
Experiment with larger models for improved accuracy
Scale to multi-GPU for faster training
Fine-tune hyperparameters based on initial results

Optimize for your use case

For high accuracy requirements:

Use larger models (7B-32B parameters)
Train for more epochs (5-10)
Enable Advanced Evaluation for detailed monitoring
Increase dataset size (2,000+ annotated images)

For fast inference:

Use smaller models (0.5-2B parameters)
Optimize batch size for your deployment hardware
Test inference speed after training
Consider quantization for deployment

For limited compute budget:

Start with smaller models (1-2B)
Use LoRA fine-tuning instead of full fine-tuning
Train with fewer epochs initially (1-3)
Use 1× T4 GPU for experimentation

Iterate systematically

Effective experimentation workflow:

Baseline run — Train with default settings to establish baseline performance
One change at a time — Modify single variable per run (e.g., learning rate only)
Document results — Track metrics, settings, and observations for each run
Compare objectively — Use evaluation metrics to assess improvements
Scale gradually — Apply successful changes to larger models or datasets

Variables to experiment with (in order of impact):

Dataset size and quality (biggest impact)
System prompt customization
Model architecture selection
Learning rate and training epochs
Batch size and optimization settings

Monitor and validate

During training:

Monitor training runs for loss curves and convergence
Watch for overfitting indicators
Check GPU utilization and memory usage
Review checkpoint metrics at regular intervals

After training:

Evaluate on validation set to assess generalization
Test on held-out examples not seen during training
Compare predictions against ground truth annotations
Verify model performs well on edge cases and difficult examples

Learn about model evaluation →

Common questions

How much data do I need to train a VLM?

Minimum requirements:

Phrase grounding: 100-500 annotated image-text pairs
VQA: 200-1,000 question-answer pairs

Recommended for production:

Phrase grounding: 1,000+ annotated pairs across diverse scenarios
VQA: 2,000+ diverse question-answer pairs

Quality over quantity: 500 high-quality, diverse annotations outperform 2,000 repetitive or low-quality annotations.

Learn about dataset preparation →

How long does training take?

Typical training times:

Model Size	Dataset Size	GPU Configuration	Estimated Time
1-2B params	500 images	1× T4	1-2 hours
2-7B params	1,000 images	1× A10G	2-4 hours
7-13B params	2,000 images	2× A100 (40GB)	4-8 hours
13B+ params	5,000 images	4× A100 (80GB)	8-16 hours

Factors affecting training time:

Model architecture size (larger = slower)
Dataset size and image resolution
Number of training epochs
GPU type and quantity
Training mode (LoRA vs. full fine-tuning)

Plan training duration and costs →

Can I pause or resume training?

Automatic pausing:

Training pauses automatically when:

Compute Credits are depleted
Infrastructure issues occur
Manual cancellation requested

Resuming training:

Training resumes from the last saved checkpoint
No progress is lost between checkpoints
Refill Compute Credits to resume automatically

Manual control:

Kill a run to stop permanently
Saved checkpoints are preserved even after cancellation
Create new runs from existing workflows to restart with different settings

What's the difference between LoRA and full fine-tuning?

LoRA (Low-Rank Adaptation) fine-tuning:

✅ Faster training (2-3× speedup)
✅ Lower GPU memory requirements
✅ Smaller model files for deployment
✅ Recommended for most use cases
❌ Slightly lower maximum accuracy ceiling

Full fine-tuning:

✅ Maximum customization potential
✅ Highest possible accuracy
✅ Best for highly specialized domains
❌ Longer training time
❌ Requires more GPU memory
❌ Larger deployment model size

Recommendation: Start with LoRA. Move to full fine-tuning only if LoRA results are insufficient for your requirements.

Learn about training modes →

How do I improve model accuracy?

Most effective improvements (in order):

Add more training data (biggest impact)
- Increase dataset size
- Improve annotation quality and consistency
- Add examples covering edge cases
Refine system prompt (second biggest impact)
- Add domain-specific terminology
- Clarify ambiguous instructions
- Provide examples of desired behavior
Optimize hyperparameters
- Adjust learning rate
- Increase training epochs
- Tune batch size
Try larger model architecture
- Upgrade from 2B to 7B parameters
- Consider specialized architectures for your task
Improve data quality
- Fix annotation errors
- Remove ambiguous or low-quality examples
- Balance class distribution

Complete guide to model evaluation →

Related resources

Create a training project — Set up your training environment
Prepare your dataset — Upload images and create annotations
Phrase grounding concepts — Understanding text-guided object detection
Visual question answering concepts — Understanding VQA tasks
Resource usage — Understanding Compute Credits and GPU pricing
Monitor training runs — Track training progress in real-time
Manage workflows — Edit, delete, and organize workflows
Create a workflow — Define training configuration
Configure your model — Select model architecture and settings
Configure training settings — Set checkpoint strategy and GPU
Evaluate a model — Assess model performance and quality
Manage runs — Kill or delete runs
Quickstart — End-to-end training tutorial

Need help?

We're here to support your VLMOps journey. Reach out through any of these channels:

Contact Support

Get help from our team via our website or email us at [email protected]

Join Our Community

Connect with other Datature users, share ideas, and get community support on Slack

Explore Resources

Read our Blog
Check out GitHub
Watch Tutorials

Schedule a Demo

Book a personalized demo to see how Datature Vi can accelerate your vision AI projects