What Are Vision-Language Models?

Learn what vision-language models are, how they combine image understanding with language generation, and why they matter for computer vision tasks.

A vision-language model (VLM) takes an image and a text prompt as input and produces a text response. It can answer questions about images, locate objects described in natural language, and follow instructions that reference visual content. Datature Vi lets you fine-tune VLMs on your own data so the model learns your specific domain and task.


How did we get here? From classification to VLMs

Computer vision has evolved through three major phases. Each phase gave models more flexibility in how they understand and describe images.

Image classification (2012-2015) was the first breakthrough. Given an image, the model picks one label from a fixed list: "cat," "dog," "car." The model can only choose categories it was trained on. Adding a new category means retraining from scratch.

Object detection (2015-2020) added spatial awareness. The model draws bounding boxes around objects and labels each one. Still limited to pre-defined categories, but now the model can find multiple objects and say where they are. This is where formats like YOLO and COCO became standard.

Vision-language models (2021-present) removed the fixed-category constraint. Instead of picking from a list, VLMs generate free-form text. You can ask a VLM "find the dented can on the second shelf" and it will locate it, even if "dented can" was never a training category. The input is flexible (any text prompt) and the output is flexible (any text response, with optional bounding boxes).

Task
Input
Output
Flexibility
Image Classification
Image
Single label
Fixed categories
Object Detection
Image
Boxes + labels
Fixed categories
VLM (Phrase Grounding)
Image + text description
Boxes + text
Open vocabulary
VLM (VQA)
Image + question
Text answer
Any question

How do VLMs work?

A VLM has three main parts that work together. Traditional CV models have eyes but no language comprehension. VLMs have both.

1

Visual encoder

Processes the image by breaking it into patches and converting each patch into a numerical representation the model can work with.

2

Language model

Reads your prompt, processes the visual features from the encoder, and generates a text response token by token.

3

Cross-modal bridge

Connects the visual encoder to the language model. Translates visual features into a format the language model can understand, so the model can "see" and "speak" about the same content.

The visual encoder produces a grid of feature vectors (one per image patch). These vectors live in a different numerical space than the language model expects. The cross-modal bridge is a projection layer, a small neural network that converts each visual feature vector into the same dimensional space that the language model uses for text tokens.

After projection, the image patch representations are treated the same way as text token embeddings. The language model's attention mechanism can then attend to both image patches and text tokens in the same sequence. This is how the model connects the word "car" in your prompt to the actual car pixels in the image.

Different architectures implement this bridge differently. Some use a single linear projection (PaliGemma), some use a multi-layer perceptron (LLaVA-style), and some use cross-attention layers where text queries attend directly to visual features (Flamingo-style). Datature Vi handles this automatically based on the model you select.

The visual encoder splits each image into a grid of fixed-size patches (typically 14x14 or 16x16 pixels). Each patch becomes one visual token after passing through the encoder and the cross-modal bridge. A 448x448 image split into 14x14 patches produces 1,024 visual tokens.

Higher-resolution images produce more visual tokens. Some architectures (like Qwen-VL) use dynamic resolution, resizing images to different tile configurations depending on their aspect ratio. A tall document image might be split into more tiles than a square photo, consuming more of the model's context window.

This is why large images cost more tokens: a 1344x1344 image may use 4x the tokens of a 448x448 image. When working with high-resolution images (medical scans, drone footage, documents), check the token monitor in the workflow canvas to see how much of the context window your images consume.

Resolution directly affects two things: how much detail the model can see, and how many tokens the image consumes.

Detail: A small defect that spans 5x5 pixels in a 4000x3000 image might vanish when the image is resized to 448x448 for processing. Higher-resolution input preserves more detail. Architectures like Qwen3-VL and Qwen3.5 support dynamic resolution, which adapts the number of image tiles to the input size. Instead of resizing every image to a fixed square, dynamic resolution preserves the original aspect ratio and allocates more tiles to larger images.

Token cost: More tiles mean more visual tokens, which consume more of the model's context window. A context window has a fixed capacity (e.g., 32,768 tokens for Qwen2.5-VL, 128K+ for Qwen3.5). Visual tokens from images share this window with your system prompt, user prompt, and the model's generated response. If your images consume most of the window, there is less room for long prompts or detailed answers.

Practical guidance:

  • For most tasks, the default resolution handling works well. You do not need to manually resize images.
  • For high-resolution inputs (medical scans, satellite imagery, dense documents), check the token monitor in the workflow canvas to verify images fit within the context window alongside your prompts.
  • If you hit context limits, consider cropping images to the region of interest rather than downscaling the entire image, which preserves detail in the area that matters.

The architecture that makes VLMs possible is called a transformer. A transformer uses attention mechanisms to decide which parts of the input (both image and text) are most relevant to each other.

When you ask "What color is the car on the left?", the attention mechanism helps the model focus on the left side of the image and on the car specifically, rather than processing every pixel equally.

You don't need to understand transformers to use Datature Vi. For more, see the transformer glossary entry and attention mechanism glossary entry.


Key concepts you'll encounter

These are the terms you'll see throughout the docs. Each expands with a plain-language explanation. You don't need to understand all of them before starting the quickstart.

Models process text as tokens, which are word pieces. The word "understanding" might become two tokens: "under" and "standing." Short words like "the" are usually one token. Numbers and punctuation get their own tokens too.

When you set "max new tokens" to 512, you're telling the model it can write up to about 400 words. When a model has a "128K context window," it can process roughly 96,000 words of combined input and output.

1 token is approximately 0.75 words. This ratio holds for English text but varies for other languages and for code.

A model's parameters are its learned values. Every decision the model makes is based on these numbers. "7B parameters" means the model has 7 billion numbers that were set during pre-training.

More parameters generally means more capability: the model can represent more patterns and handle more complex tasks. But more parameters also means more GPU memory and longer training times.

In Datature Vi, you choose a model architecture (Qwen3.5 4B, 9B, or 27B, for example) based on your accuracy needs and hardware budget. See model architectures for a comparison.

Pre-training is the first phase of model development. The model is trained on millions of images and text pairs to learn general visual and language understanding. Pre-trained models can perform many tasks out of the box, but they aren't specialized for any specific domain.

Fine-tuning is the second phase. You take a pre-trained model and train it further on YOUR data. After fine-tuning, the model becomes specialized for your domain and task.

The analogy: A pre-trained model is like a general-knowledge assistant who can discuss any topic at a surface level. Fine-tuning is like spending a month teaching that assistant the specifics of your industry. They keep all their general knowledge but become a domain specialist.

In Datature Vi, fine-tuning is what happens when you train a model. You don't need thousands of images. For many tasks, 100+ annotated image-text pairs is enough to start. 500+ pairs is the recommended amount.

For more on how fine-tuning works, see How Does VLM Training Work?

LoRA (Low-Rank Adaptation) adjusts a small subset of model parameters during training, roughly 1-5% of the total. The rest of the model stays frozen. LoRA is like adding sticky notes to a textbook: the original text stays the same, but the notes guide you to interpret it differently for your use case.

Full fine-tuning adjusts every parameter. It's like rewriting the textbook from scratch for your domain.

When to use LoRA: Start here for your first training run. It trains 2-3x faster and uses 3-5x less GPU memory. For many tasks, LoRA matches full fine-tuning quality.

When to use full fine-tuning: When you need the highest possible accuracy and have the compute budget. Also required when using NVILA-Lite, which does not support LoRA.

In Datature Vi, you choose the training mode in model settings. For a deeper explanation, see How Do LoRA and Quantization Work?

# LoRA-trained models load the same way as fully fine-tuned models
from vi.inference import ViModel
model = ViModel(
    run_id="your-run-id",
    secret_key=".your-secret-key.",
    organization_id="your-organization-id",
)

Inference is running the trained model on new images to get predictions. You provide an image and a text prompt, and the model returns a response.

In Datature Vi, you run inference using the Vi SDK (locally on your machine) or NVIDIA NIM containers (production deployment). For a deeper look at how inference works and how generation settings control output, see How Does Inference Work?

from vi.inference import ViModel

model = ViModel(
    run_id="your-run-id",
    secret_key=".your-secret-key.",
    organization_id="your-organization-id",
)
result, error = model(source="image.jpg", user_prompt="Describe this image", stream=False)
print(result.result)

A system prompt is a set of natural language instructions that tells the model how to behave. It defines the model's role, what to look for in images, how to format responses, and what domain-specific terminology to use.

The analogy: It's like giving a photographer detailed instructions before a shoot: "Focus on product defects. Describe each defect using our internal severity scale. Ignore cosmetic scratches under 2mm."

The same system prompt is used during both training and inference. A mismatch between training and inference prompts will degrade performance.

Writing a good system prompt comes down to four elements:

  1. Define the role: "You are a quality control inspector analyzing PCB images" or "You are a medical imaging assistant describing radiographic findings."
  2. Specify what to look for: "Identify solder bridges, missing components, and cold joints" or "Describe visible crop health indicators."
  3. Set the output format: "Return a JSON object with fields: defect_found, defect_type, location, severity" or "Answer in one sentence."
  4. Add hallucination guards: "Only describe what is directly visible. Do not speculate about areas outside the frame."

Datature Vi provides default prompts for phrase grounding and VQA. Customize them when you need domain-specific behavior. See the system prompt configuration guide for full examples across industries.


Which VLM architectures does Datature Vi support?

Datature Vi supports seven model architectures across three Qwen generations plus specialist models. Start with Qwen3.5 4B to validate your setup, then scale to Qwen3.5 9B for production.

If you need...
Choose
Why
General-purpose starting point
Qwen3.5 9B
Best quality-to-cost ratio, 201-language support, hybrid DeltaNet architecture
Fast experimentation
Qwen3.5 4B
Quick iteration on small datasets with built-in thinking mode
Lowest memory / fastest inference
NVILA-Lite 2B
Runs on smaller GPUs, handles up to 4K resolution, full fine-tuning only
Complex multi-step reasoning
Cosmos-Reason1 7B
Optimized attention for reasoning chains, supports LoRA
Fine-grained visual details
InternVL3.5 8B
Best at attribute recognition and spatial awareness, supports LoRA
Maximum accuracy (large GPU budget)
Qwen3.5 27B
Highest dense model quality, strong reasoning and production-grade outputs

For full architecture details, hardware requirements, and benchmark citation pointers, see Model Architectures.

How do I choose a model size?

Model size (parameter count) is not "bigger is better." The right choice depends on your dataset size, task complexity, and deployment constraints.

Your situation
Recommended size
Why
Under 200 images, first experiment
2B-4B (NVILA-Lite, Qwen3.5 4B)
Fewer parameters means less risk of overfitting on small data. Trains faster, iterates quicker.
200-1,000 images, production use
7B-9B (Qwen3.5 9B, InternVL3.5 8B)
Enough capacity for most tasks without wasting compute. Good quality-to-cost ratio.
1,000+ images, high accuracy needed
27B+ (Qwen3.5 27B)
Large models benefit from large datasets. Requires more GPU memory and compute credits.
Strict latency requirements
2B-4B
Smaller models produce tokens faster. Consider this for real-time applications.

A 27B model trained on 50 images will often perform worse than a 4B model on the same data. Large models have so many parameters that they can memorize small datasets instead of learning general patterns. Match model size to data size first, then scale up if evaluation metrics plateau.

For more on how model size interacts with training settings, see How Does VLM Training Work?


Frequently asked questions

The minimum is 20 annotated images with 100+ annotation pairs. The recommended amount is 100+ images with 500+ pairs. For production use, aim for 500+ images with 1,000+ pairs.

Quality matters more than quantity. 50 accurate, specific annotations will produce better results than 500 vague ones.

Yes, for local inference. The Vi SDK detects available GPU hardware automatically and uses it. For GPUs with limited memory, the SDK supports 8-bit and 4-bit quantization to reduce memory requirements.

For production deployment without managing GPUs directly, use NVIDIA NIM containers.

Yes. Pre-trained models work out of the box for general tasks. But fine-tuning on your specific data improves accuracy for your domain. If the model needs to understand your specific terminology, equipment, or visual patterns, fine-tuning is the way to get there.

Phrase grounding returns bounding boxes: you describe an object in text, and the model locates it in the image.

Visual question answering (VQA) returns text answers: you ask a question about the image, and the model answers in natural language.

Use phrase grounding when you need to know WHERE something is. Use VQA when you need to know WHAT something is, how many there are, or whether a condition is met.


Further reading


Related resources

Phrase Grounding

Object localization with natural language descriptions.

Visual Question Answering

Answer questions about images with text responses.

Quickstart

Train and deploy your first VLM in 30 minutes.