How Does VLM Training Work?

Training a VLM means showing the model your annotated images repeatedly until it learns the patterns in your data. Datature Vi handles the GPU infrastructure. You configure the model, select your dataset, set a few training parameters, and launch. This page explains what happens during training and what each setting controls.

On this page

What happens during a training run Training settings explained How settings interact How to read loss curves When training goes wrong Hardware and compute FAQ Related resources

What happens during a training run?

During training, Datature Vi feeds your annotated images through the model in small groups. The model tries to predict the correct output for each image, checks how far off it was, and adjusts its internal parameters to do better next time. This cycle repeats thousands of times.

Model sees a batch of annotated images

A small group of images (the "batch") is loaded from your dataset along with the corresponding annotations.

Model makes predictions

The model generates text output for each image based on your system prompt and the image content.

Predictions are compared to your annotations

The difference between the model's predictions and your ground-truth annotations is calculated using cross-entropy loss. This difference is called the loss.

Model adjusts its parameters to reduce the loss

The model updates its internal weights so it will make smaller errors on similar images next time.

Repeat across all images

Steps 1-4 repeat until the model has processed every image in the training set. One complete pass through all images is called one epoch. Training runs for multiple epochs.

After training completes, you get a trained model you can download and run inference on. See Train a Model for the full step-by-step guide.

Training settings explained

You don't need to change most settings for your first run. Here's a quick reference, with details below.

Setting

What it controls

Default advice

Epochs

How many times the model sees all your images

More for small datasets, fewer for large

Batch size

Images processed at once before updating

Reduce if you get out-of-memory errors

Learning rate

How aggressively the model adjusts

Use the default unless loss spikes

Validation split

% of data held out for testing

20% default works for most cases

LoRA vs full fine-tuning

How many parameters get updated

Start with LoRA, switch if you need more accuracy

System prompt

What the model looks for and how it responds

Must match between training and inference

Start with the defaults

For your first training run, the defaults work well. See Model Settings when you're ready to tune.

How do training settings interact?

Training settings do not work in isolation. Changing one often affects how others behave. Here are the interactions that matter most.

Batch size and learning rate

Batch size and learning rate are linked. A larger batch means the model averages gradients over more images, which smooths out noise. This smoother signal can handle a slightly higher learning rate. A smaller batch produces noisier gradients, so a lower learning rate works better.

The rule of thumb: if you double the batch size, you can try increasing the learning rate by 1.4x (the square root of 2). If you halve the batch size, reduce the learning rate by a similar factor.

In practice, leave the learning rate at the default unless you see instability. Adjust batch size first to fix memory issues, and only change the learning rate if loss curves look abnormal.

Epochs and dataset size

Smaller datasets need more epochs because the model sees fewer unique examples per pass. A dataset of 50 images at 200 epochs means the model sees each image 200 times. A dataset of 5,000 images at 20 epochs gives the model enough variety per pass that it does not need as many repetitions.

The risk with high epoch counts on small datasets is overfitting. The model starts memorizing individual images rather than learning reusable patterns. Watch your validation loss: if it climbs while training loss keeps falling, you have gone too far.

Model size and data requirements

Larger models have more parameters, which gives them more capacity to learn. But that capacity is wasted without enough data. A 32B model trained on 50 images will overfit faster than a 4B model on the same data, because the larger model has so many parameters that it can memorize small datasets more easily.

Match model size to data size. For datasets under 200 images, start with a 2B-4B model. For 200-1,000 images, a 7B-9B model works well. Scale to 27B+ only when you have 1,000+ annotated images.

How to read loss curves

The loss curve shows how fast the model's errors decrease over time. Orange represents training loss (calculated at every step). Blue represents validation loss (calculated at evaluation intervals).

What you see

What it means

What to do

Both curves decrease smoothly

Training is healthy

Continue training

Training drops, validation rises

Overfitting: model is memorizing training data

Add more data or reduce epochs

Both curves plateau high

Underfitting: model cannot learn the patterns

Train longer or try a larger model

Loss spikes or oscillates

Learning rate is too high

Halve the learning rate

Starting loss is above 10

Possible data format issue

Check your annotations

Overfitting means the model has memorized the training examples rather than learning reusable patterns. It performs well on images it has seen but poorly on new ones. The fix is more data, fewer epochs, or both.

Underfitting means the model hasn't learned enough. It performs poorly on both training and new images. The fix is more training time or a larger model architecture.

When training goes wrong

Training does not always converge on the first attempt. Here are the most common failure patterns and what causes them.

Symptom

Likely cause

Fix

Loss does not decrease after several epochs

Learning rate too low, or data format mismatch

Try doubling the learning rate. Check that annotations match the expected format for your dataset type.

Loss spikes suddenly mid-training

Learning rate too high

Halve the learning rate and restart.

Training loss drops but validation loss rises

Overfitting

Reduce epochs, add more training data, or try a smaller model.

Both losses plateau at a high value (above 3.0)

Model too small for the task, or annotations are inconsistent

Try a larger model architecture. Review annotation quality.

GPU out-of-memory error

Model + batch exceeds GPU VRAM

Reduce batch size. Enable NF4 quantization. Switch to LoRA if using full fine-tuning.

Training completes but model outputs are wrong

System prompt mismatch or annotation errors

Verify the system prompt matches your task. Spot-check 20-30 annotations for consistency.

Hardware and compute

Training requires GPUs. Datature Vi manages the infrastructure so you don't need to provision or configure hardware yourself. You select a GPU tier based on your model size and training mode.

Training consumes Compute Credits from your organization's plan. You can monitor your usage and remaining Compute Credits in Resource Usage.