Train a Model

Fine-tune vision-language models on your data to create custom AI tailored to your specific use case.

Train custom vision-language models (VLMs) by fine-tuning state-of-the-art architectures on your annotated data. Transform general-purpose models into specialized AI that understands your domain, terminology, and visual patterns.

💡

New to VLM training?

Start with the quickstart guide to train your first model in under an hour. This section provides comprehensive documentation for production training workflows.

📋

Prerequisites

Get started with quickstart →


Training workflow overview

Training a VLM on Datature Vi follows a structured workflow with five key stages:

1. Create Workflow
   ↓
2. Configure System Prompt
   ↓
3. Configure Dataset
   ↓
4. Configure Model
   ↓
5. Launch Training Run

Each stage builds on the previous one, culminating in a reusable workflow that you can execute across multiple training runs with consistent configuration.

Why workflows?

Workflows are reusable training configurations that capture your complete setup:

  • System prompt — Natural language instructions that guide model behavior
  • Dataset configuration — Train-test splits, shuffling, and data distribution
  • Model selection — Architecture choice and training parameters
  • Reproducibility — Run the same configuration multiple times for experimentation

Once configured, workflows enable rapid iteration and experimentation without reconfiguring settings each time.


Training process stages


Stage 1: Create a workflow

Workflows are the foundation of your training configuration. They encapsulate all settings required to train a model and can be reused across multiple training runs.

What you'll configure:

Complete guide: Create a Workflow →


Stage 2: Configure system prompt

System prompts are natural language instructions that define what your VLM should learn during training. They shape model behavior for both phrase grounding and VQA tasks.

Key decisions:

  • Task definition — What should the model detect or answer?
  • Domain terminology — Industry-specific language and concepts
  • Output format — How should the model structure responses?
  • Edge case handling — Behavior for ambiguous or out-of-scope inputs

Example system prompts:

Phrase grounding prompt example
You are an AI assistant specialized in identifying printed circuit board (PCB)
components in manufacturing images. Given a text description of a component
type (e.g., "capacitor", "resistor", "IC chip"), locate and mark all instances
of that component in the image with precise bounding boxes.

Focus on:
- Component type identification based on visual characteristics
- Clear distinction between similar components (e.g., capacitors vs. resistors)
- Accurate bounding box placement around component boundaries
- Detection of partially visible or occluded components

If the described component is not present in the image, respond with no
bounding boxes.
VQA prompt example
You are a quality control AI assistant for automotive manufacturing. Answer
questions about vehicle assembly images with clear, specific responses based
only on what is visible in the image.

Your responses should:
- Be concise and factual (1-2 sentences maximum)
- Reference specific visual evidence when possible
- Indicate uncertainty when details are unclear or ambiguous
- Use standard automotive terminology

For yes/no questions, provide reasoning in parentheses. For count questions,
provide the exact number visible. For defect questions, describe the type and
location clearly.

Complete guide: Configure Your System Prompt →


Stage 3: Configure dataset

Dataset configuration determines how your training data is split, shuffled, and distributed across training, validation, and test sets. Proper configuration ensures your model learns effectively and generalizes to new data.

Configuration options:

SettingPurposeBest Practice
Dataset sourceWhich dataset to useSelect dataset matching your task (phrase grounding or VQA)
Train-test splitRatio of training vs. validation data80/20 split for most use cases
ShufflingRandomize data orderEnable to prevent bias from sequential data
StratificationBalance classes across splitsEnable when classes have unequal representation

Common split strategies:

  • 80/20 split — Standard for most datasets (1000+ images)
  • 90/10 split — When validation data is limited (< 500 images)
  • 70/20/10 split — Add dedicated test set for final evaluation

Complete guide: Configure Your Dataset →


Stage 4: Configure model

Model configuration involves two key choices: selecting the right architecture for your task and tuning training settings to optimize performance.

Model architecture selection

Choose from 4 state-of-the-art VLM architectures (with more coming soon):

View available architectures
ArchitectureSizes AvailableBest ForKey Strengths
Qwen2.5-VL3B, 7B, 32BGeneral VLM tasks, phrase grounding, VQADynamic resolution, extended context (128K tokens)
NVIDIA NVILA-Lite2BResource-constrained deploymentsEfficiency, fast inference
NVIDIA Cosmos-Reason17BComplex reasoning tasksLogical inference, multi-step analysis
OpenGVLab InternVL3.58BBalanced performanceFine-grained visual understanding

Coming Soon:

  • DeepSeek OCR — Specialized OCR model for document understanding
  • LLaVA-NeXT — Advanced multimodal reasoning with improved visual comprehension

Architecture selection factors:

  • Task complexity — Larger models (7B+) for complex reasoning; smaller models (1-2B) for speed
  • Compute budget — Larger models require more GPU memory and training time
  • Inference speed — Smaller models deploy faster with lower latency
  • Domain requirements — Specialized architectures for OCR, multilingual, or document tasks

View complete architecture comparison →

Model settings and hyperparameters

Configure training behavior and optimization parameters:

Core settings:

  • Training mode — LoRA fine-tuning (efficient) or full fine-tuning (maximum customization)
  • Batch size — Number of samples per training step (affects GPU memory and convergence)
  • Learning rate — Controls speed and stability of model updates
  • Epochs — Number of complete passes through training data

Complete guide: Configure Your Model →


Stage 5: Launch training run

After configuring your workflow, launch a training run with specific hardware and checkpoint settings.

Training run configuration:

  1. Advanced settings — Checkpoint frequency and evaluation options
  2. Hardware configuration — GPU type and quantity selection
  3. Dataset validation — Automatic checks ensure data readiness
  4. Review summary — Verify configuration before launch

GPU selection guidance:

View GPU recommendations by model size
Model SizeMinimum GPURecommended GPUEstimated Training Time*
0.5-2B paramsT4 (16 GB)L4 (24 GB)1-2 hours
3-7B paramsA10G (24 GB)A100 (40 GB)2-4 hours
7-13B paramsA100 (40 GB)A100 (80 GB)4-8 hours
13B+ paramsA100 (80 GB)H100 (80 GB)8-16 hours

*For 1,000 images, 3 epochs, with LoRA fine-tuning

Compute costs:

Training consumes Compute Credits based on GPU type and duration:

Example: 4× A10G GPUs for 2 hours
  Usage multiplier: 10.0 credits/minute
  Duration: 120 minutes
  Total cost: 1,200 Compute Credits

Complete guide: Configure Training Settings →

View GPU pricing and specifications →


After training completes

Once training finishes, you can evaluate performance, compare runs, and deploy your model.


Training best practices

Start with default settings

For your first training run:

  1. Use default system prompts for your task type
  2. Configure 80/20 train-test split with shuffling enabled
  3. Select a 2-7B parameter model for balanced performance
  4. Start with LoRA training mode for efficiency
  5. Use 1-2 GPUs to validate configuration

After validation:

  • Customize system prompt for domain-specific terminology
  • Experiment with larger models for improved accuracy
  • Scale to multi-GPU for faster training
  • Fine-tune hyperparameters based on initial results
Optimize for your use case

For high accuracy requirements:

  • Use larger models (7B-32B parameters)
  • Train for more epochs (5-10)
  • Enable Advanced Evaluation for detailed monitoring
  • Increase dataset size (2,000+ annotated images)

For fast inference:

  • Use smaller models (0.5-2B parameters)
  • Optimize batch size for your deployment hardware
  • Test inference speed after training
  • Consider quantization for deployment

For limited compute budget:

  • Start with smaller models (1-2B)
  • Use LoRA fine-tuning instead of full fine-tuning
  • Train with fewer epochs initially (1-3)
  • Use 1× T4 GPU for experimentation
Iterate systematically

Effective experimentation workflow:

  1. Baseline run — Train with default settings to establish baseline performance
  2. One change at a time — Modify single variable per run (e.g., learning rate only)
  3. Document results — Track metrics, settings, and observations for each run
  4. Compare objectively — Use evaluation metrics to assess improvements
  5. Scale gradually — Apply successful changes to larger models or datasets

Variables to experiment with (in order of impact):

  1. Dataset size and quality (biggest impact)
  2. System prompt customization
  3. Model architecture selection
  4. Learning rate and training epochs
  5. Batch size and optimization settings
Monitor and validate

During training:

  • Monitor training runs for loss curves and convergence
  • Watch for overfitting indicators
  • Check GPU utilization and memory usage
  • Review checkpoint metrics at regular intervals

After training:

  • Evaluate on validation set to assess generalization
  • Test on held-out examples not seen during training
  • Compare predictions against ground truth annotations
  • Verify model performs well on edge cases and difficult examples

Learn about model evaluation →


Common questions

How much data do I need to train a VLM?

Minimum requirements:

  • Phrase grounding: 100-500 annotated image-text pairs
  • VQA: 200-1,000 question-answer pairs

Recommended for production:

  • Phrase grounding: 1,000+ annotated pairs across diverse scenarios
  • VQA: 2,000+ diverse question-answer pairs

Quality over quantity: 500 high-quality, diverse annotations outperform 2,000 repetitive or low-quality annotations.

Learn about dataset preparation →

How long does training take?

Typical training times:

Model SizeDataset SizeGPU ConfigurationEstimated Time
1-2B params500 images1× T41-2 hours
2-7B params1,000 images1× A10G2-4 hours
7-13B params2,000 images2× A100 (40GB)4-8 hours
13B+ params5,000 images4× A100 (80GB)8-16 hours

Factors affecting training time:

  • Model architecture size (larger = slower)
  • Dataset size and image resolution
  • Number of training epochs
  • GPU type and quantity
  • Training mode (LoRA vs. full fine-tuning)

Plan training duration and costs →

Can I pause or resume training?

Automatic pausing:

Training pauses automatically when:

  • Compute Credits are depleted
  • Infrastructure issues occur
  • Manual cancellation requested

Resuming training:

  • Training resumes from the last saved checkpoint
  • No progress is lost between checkpoints
  • Refill Compute Credits to resume automatically

Manual control:

  • Kill a run to stop permanently
  • Saved checkpoints are preserved even after cancellation
  • Create new runs from existing workflows to restart with different settings
What's the difference between LoRA and full fine-tuning?

LoRA (Low-Rank Adaptation) fine-tuning:

  • ✅ Faster training (2-3× speedup)
  • ✅ Lower GPU memory requirements
  • ✅ Smaller model files for deployment
  • ✅ Recommended for most use cases
  • ❌ Slightly lower maximum accuracy ceiling

Full fine-tuning:

  • ✅ Maximum customization potential
  • ✅ Highest possible accuracy
  • ✅ Best for highly specialized domains
  • ❌ Longer training time
  • ❌ Requires more GPU memory
  • ❌ Larger deployment model size

Recommendation: Start with LoRA. Move to full fine-tuning only if LoRA results are insufficient for your requirements.

Learn about training modes →

How do I improve model accuracy?

Most effective improvements (in order):

  1. Add more training data (biggest impact)

    • Increase dataset size
    • Improve annotation quality and consistency
    • Add examples covering edge cases
  2. Refine system prompt (second biggest impact)

    • Add domain-specific terminology
    • Clarify ambiguous instructions
    • Provide examples of desired behavior
  3. Optimize hyperparameters

  4. Try larger model architecture

    • Upgrade from 2B to 7B parameters
    • Consider specialized architectures for your task
  5. Improve data quality

    • Fix annotation errors
    • Remove ambiguous or low-quality examples
    • Balance class distribution

Complete guide to model evaluation →


Related resources