Learn how vision-language model training works in Datature Vi. Understand epochs, batch size, learning rate, loss curves, overfitting, and validation splits in plain language.
Training a VLM means showing the model your annotated images repeatedly until it learns the patterns in your data. Datature Vi handles the GPU infrastructure. You configure the model, select your dataset, set a few training parameters, and launch. This page explains what happens during training and what each setting controls.
During training, Datature Vi feeds your annotated images through the model in small groups. The model tries to predict the correct output for each image, checks how far off it was, and adjusts its internal parameters to do better next time. This cycle repeats thousands of times.
1
Model sees a batch of annotated images
A small group of images (the "batch") is loaded from your dataset along with the corresponding annotations.
2
Model makes predictions
The model generates text output for each image based on your system prompt and the image content.
3
Predictions are compared to your annotations
The difference between the model's predictions and your ground-truth annotations is calculated using cross-entropy loss. This difference is called the loss.
4
Model adjusts its parameters to reduce the loss
The model updates its internal weights so it will make smaller errors on similar images next time.
5
Repeat across all images
Steps 1-4 repeat until the model has processed every image in the training set. One complete pass through all images is called one epoch. Training runs for multiple epochs.
After training completes, you get a trained model you can download and run inference on. See Train a Model for the full step-by-step guide.
Under the hood, Datature Vi uses HuggingFace Transformers with DeepSpeed for distributed training optimization. The training loop follows standard HuggingFace conventions: a cross-entropy loss on the generated tokens, the AdamW optimizer, and cosine learning rate scheduling with warmup. DeepSpeed handles memory-efficient training through ZeRO optimization stages, which allows fine-tuning larger models on fewer GPUs than would otherwise be possible.
You do not need to configure DeepSpeed or HuggingFace directly. Datature Vi selects the right optimization strategy based on the model size and GPU hardware you choose.
Phrase grounding labels combine a caption, grounded phrases, and bounding boxes. Fine-tuning treats the supervised output as one text sequence the model should generate: natural language plus the numeric box corners in the same normalized [0, 1024] layout the model uses at inference ([x_min, y_min, x_max, y_max] per box; see Bounding box format). Cross-entropy runs over every token in that target, so coordinates are corrected through the same next-token objective as words and punctuation.
Dataset files store boxes in the import schema under Format specifications (Vi JSONL phrase grounding). For how validation turns emitted boxes into F1 and IoU, see Training metrics.
Training settings explained
You don't need to change most settings for your first run. Here's a quick reference, with details below.
Setting
What it controls
Default advice
Epochs
How many times the model sees all your images
More for small datasets, fewer for large
Batch size
Images processed at once before updating
Reduce if you get out-of-memory errors
Learning rate
How aggressively the model adjusts
Use the default unless loss spikes
Validation split
% of data held out for testing
20% default works for most cases
LoRA vs full fine-tuning
How many parameters get updated
Start with LoRA, switch if you need more accuracy
System prompt
What the model looks for and how it responds
Must match between training and inference
An epoch is one complete pass through your entire training dataset. If you set 50 epochs, the model sees every image 50 times.
More epochs means the model gets more practice. Too many epochs can cause overfitting, where the model memorizes your training data instead of learning general patterns. When this happens, training loss keeps dropping but validation loss starts rising.
Dataset size
Recommended epochs
Under 100 images
100-300
100-1,000 images
50-150
1,000+ images
20-100
Smaller datasets need more epochs because the model has fewer examples to learn from. Larger datasets need fewer passes because there is more variety in each epoch.
Batch size is how many images the model processes at once before updating its parameters. A batch size of 4 means the model looks at 4 images, calculates the average error, then adjusts.
Larger batches train faster because more images are processed in parallel. But larger batches require more GPU memory. If you get an out-of-memory error, reduce your batch size first.
Typical values range from 1 to 8. The default works for most setups.
The learning rate controls how much the model adjusts its parameters after each batch. A learning rate of 0.0001 means small, careful adjustments. A rate of 0.01 means large, aggressive adjustments.
Too high and the model overshoots good solutions (you'll see the loss spike or oscillate). Too low and training takes much longer than necessary. The default works for most tasks.
Datature Vi holds out 20% of your data by default. This held-out portion is called the validation set. The model never trains on these images.
At regular intervals during training, the model is tested on the validation set. This check tells you whether the model is learning general patterns or just memorizing the training images. If validation loss starts rising while training loss keeps falling, the model is overfitting.
LoRA (Low-Rank Adaptation) updates only a small fraction of the model's parameters during training. The rest stay frozen. It trains 2-3x faster and uses 3-5x less GPU memory.
Full fine-tuning updates every parameter. It gives the model more flexibility to adapt but costs more compute time and memory.
Start with LoRA. Switch to full fine-tuning only if you need higher accuracy and have the GPU budget for it. For a deeper explanation, see How Do LoRA and Quantization Work?
The system prompt defines the task. It tells the model what to look for, how to format its output, and what domain knowledge to apply. The same system prompt is used during both training and inference.
If you change the system prompt after training, the model's behavior will degrade. The prompt it learned with is the prompt it expects at inference time.
For your first training run, the defaults work well. See Model Settings when you're ready to tune.
How do training settings interact?
Training settings do not work in isolation. Changing one often affects how others behave. Here are the interactions that matter most.
Batch size and learning rate
Batch size and learning rate are linked. A larger batch means the model averages gradients over more images, which smooths out noise. This smoother signal can handle a slightly higher learning rate. A smaller batch produces noisier gradients, so a lower learning rate works better.
The rule of thumb: if you double the batch size, you can try increasing the learning rate by 1.4x (the square root of 2). If you halve the batch size, reduce the learning rate by a similar factor.
In practice, leave the learning rate at the default unless you see instability. Adjust batch size first to fix memory issues, and only change the learning rate if loss curves look abnormal.
Epochs and dataset size
Smaller datasets need more epochs because the model sees fewer unique examples per pass. A dataset of 50 images at 200 epochs means the model sees each image 200 times. A dataset of 5,000 images at 20 epochs gives the model enough variety per pass that it does not need as many repetitions.
The risk with high epoch counts on small datasets is overfitting. The model starts memorizing individual images rather than learning reusable patterns. Watch your validation loss: if it climbs while training loss keeps falling, you have gone too far.
Model size and data requirements
Larger models have more parameters, which gives them more capacity to learn. But that capacity is wasted without enough data. A 32B model trained on 50 images will overfit faster than a 4B model on the same data, because the larger model has so many parameters that it can memorize small datasets more easily.
Match model size to data size. For datasets under 200 images, start with a 2B-4B model. For 200-1,000 images, a 7B-9B model works well. Scale to 27B+ only when you have 1,000+ annotated images.
Cross-entropy loss is the function that measures how wrong the model's predictions are. At each token position, the model predicts a probability distribution over its vocabulary. Cross-entropy measures the gap between that predicted distribution and the correct answer (your annotation).
If the model assigns 90% probability to the correct token, the loss is low. If it assigns only 5% to the correct token, the loss is high. The training process adjusts weights to reduce this loss over time.
You do not need to configure cross-entropy loss directly. Datature Vi uses it by default for all VLM training. What matters is understanding that loss values represent "how wrong" the model is, and that lower values mean better predictions.
How to read loss curves
The loss curve shows how fast the model's errors decrease over time. Orange represents training loss (calculated at every step). Blue represents validation loss (calculated at evaluation intervals).
What you see
What it means
What to do
Both curves decrease smoothly
Training is healthy
Continue training
Training drops, validation rises
Overfitting: model is memorizing training data
Add more data or reduce epochs
Both curves plateau high
Underfitting: model cannot learn the patterns
Train longer or try a larger model
Loss spikes or oscillates
Learning rate is too high
Halve the learning rate
Starting loss is above 10
Possible data format issue
Check your annotations
Overfitting means the model has memorized the training examples rather than learning reusable patterns. It performs well on images it has seen but poorly on new ones. The fix is more data, fewer epochs, or both.
Underfitting means the model hasn't learned enough. It performs poorly on both training and new images. The fix is more training time or a larger model architecture.
Loss is a number that represents how wrong the model's predictions are. Lower is better.
Typical ranges for VLM training in Datature Vi:
Starting loss: 2.0-6.0 (depends on model size and data)
Good final loss: 0.5-1.5 (for most tasks)
Suspiciously low loss (below 0.1): May indicate overfitting or data leakage
Loss values are not directly comparable across different model architectures. A loss of 1.0 on Qwen2.5-VL 7B does not mean the same thing as a loss of 1.0 on NVILA-Lite 2B. Compare loss values only within the same architecture and dataset.
For a full breakdown of all metrics tracked during training, see Training Metrics.
When training goes wrong
Training does not always converge on the first attempt. Here are the most common failure patterns and what causes them.
Symptom
Likely cause
Fix
Loss does not decrease after several epochs
Learning rate too low, or data format mismatch
Try doubling the learning rate. Check that annotations match the expected format for your dataset type.
Loss spikes suddenly mid-training
Learning rate too high
Halve the learning rate and restart.
Training loss drops but validation loss rises
Overfitting
Reduce epochs, add more training data, or try a smaller model.
Both losses plateau at a high value (above 3.0)
Model too small for the task, or annotations are inconsistent
Try a larger model architecture. Review annotation quality.
GPU out-of-memory error
Model + batch exceeds GPU VRAM
Reduce batch size. Enable NF4 quantization. Switch to LoRA if using full fine-tuning.
Training completes but model outputs are wrong
System prompt mismatch or annotation errors
Verify the system prompt matches your task. Spot-check 20-30 annotations for consistency.
Poor annotations are the most common root cause of training issues that are not related to hyperparameters. Check for these patterns:
Inconsistent labels: The same object described differently across images ("red car," "vehicle," "sedan"). Pick one term and apply it consistently.
Missing annotations: Images where visible objects are not annotated. The model learns that unmarked objects should be ignored, which hurts recall.
Wrong bounding boxes: Boxes that are too loose (include too much background) or too tight (clip the object). Both hurt IoU scores.
Vague answers in VQA: Short, uninformative answers like "yes" when the model needs richer training signal. Add context: "Yes, there is a crack along the left edge of the tile."
Training requires GPUs. Datature Vi manages the infrastructure so you don't need to provision or configure hardware yourself. You select a GPU tier based on your model size and training mode.
GPU
VRAM
CUDA cores
Available counts
Good for
T4
16 GB
2,560
1, 4, 8
Small models (2-3B) with LoRA
L4
24 GB
7,680
1, 4, 8
Small to medium models (2-7B) with LoRA
A10G
24 GB
9,216
1, 4, 8
Medium models (7-8B) with LoRA
A100 (40 GB)
40 GB
6,912
8 only
Large models (up to 32B), full fine-tuning
A100 (80 GB)
80 GB
6,912
8 only
Large models (32B+), full fine-tuning
H100
80 GB
14,592
1, 8
Maximum speed, largest models
If you're unsure, start with the GPU tier Datature Vi recommends for your chosen model. The platform prevents you from selecting a GPU that doesn't have enough memory for your configuration.
Training consumes Compute Credits from your organization's plan. You can monitor your usage and remaining Compute Credits in Resource Usage.
Frequently asked questions
Most training runs finish in 1-3 hours. The exact time depends on three factors: dataset size, model architecture, and GPU tier.
A small dataset (100 images) on a 7B model with LoRA and an A10G GPU typically completes in about 1 hour. Larger datasets, bigger models, or full fine-tuning will take longer.
Yes. Training runs on Datature Vi's servers, not on your machine. You can close the browser, shut down your computer, or switch to another task. You'll receive a notification when the run finishes.
Open the run and check the Training Logs. The most common causes are:
GPU out of memory: Reduce batch size or switch to LoRA mode.
Annotation format errors: Check that your annotations match the expected format for your task type.
Dataset too small: The minimum is 20 images with annotations. Below that, training may fail to converge.
The minimum is 20 annotated images. For reliable results, aim for 100+ images. For production-quality models, 500+ images is the target.
Quality matters as much as quantity. Fifty well-annotated images with clear, specific labels will outperform 500 images with vague or inconsistent annotations.