What Are Vision-Language Models?

A vision-language model (VLM) takes an image and a text prompt as input and produces a text response. It can answer questions about images, locate objects described in natural language, and follow instructions that reference visual content. Datature Vi lets you fine-tune VLMs on your own data so the model learns your specific domain and task.

On this page

From classification to VLMs How VLMs work Key concepts Model architectures Choosing a model size FAQ

How did we get here? From classification to VLMs

Computer vision has evolved through three major phases. Each phase gave models more flexibility in how they understand and describe images.

Image classification (2012-2015) was the first breakthrough. Given an image, the model picks one label from a fixed list: "cat," "dog," "car." The model can only choose categories it was trained on. Adding a new category means retraining from scratch.

Object detection (2015-2020) added spatial awareness. The model draws bounding boxes around objects and labels each one. Still limited to pre-defined categories, but now the model can find multiple objects and say where they are. This is where formats like YOLO and COCO became standard.

Vision-language models (2021-present) removed the fixed-category constraint. Instead of picking from a list, VLMs generate free-form text. You can ask a VLM "find the dented can on the second shelf" and it will locate it, even if "dented can" was never a training category. The input is flexible (any text prompt) and the output is flexible (any text response, with optional bounding boxes).

Task

Input

Output

Flexibility

Image Classification

Image

Single label

Fixed categories

Object Detection

Image

Boxes + labels

Fixed categories

VLM (Phrase Grounding)

Image + text description

Boxes + text

Open vocabulary

VLM (VQA)

Image + question

Text answer

Any question

How do VLMs work?

A VLM has three main parts that work together. Traditional CV models have eyes but no language comprehension. VLMs have both.

Visual encoder

Processes the image by breaking it into patches and converting each patch into a numerical representation the model can work with.

Language model

Reads your prompt, processes the visual features from the encoder, and generates a text response token by token.

Cross-modal bridge

Connects the visual encoder to the language model. Translates visual features into a format the language model can understand, so the model can "see" and "speak" about the same content.

Key concepts you'll encounter

These are the terms you'll see throughout the docs. Each expands with a plain-language explanation. You don't need to understand all of them before starting the quickstart.

Which VLM architectures does Datature Vi support?

Datature Vi supports seven model architectures across three Qwen generations plus specialist models. Start with Qwen3.5 4B to validate your setup, then scale to Qwen3.5 9B for production.

If you need...

Choose

Why

General-purpose starting point

Qwen3.5 9B

Best quality-to-cost ratio, 201-language support, hybrid DeltaNet architecture

Fast experimentation

Qwen3.5 4B

Quick iteration on small datasets with built-in thinking mode

Lowest memory / fastest inference

NVILA-Lite 2B

Runs on smaller GPUs, handles up to 4K resolution, full fine-tuning only

Complex multi-step reasoning

Cosmos-Reason1 7B

Optimized attention for reasoning chains, supports LoRA

Fine-grained visual details

InternVL3.5 8B

Best at attribute recognition and spatial awareness, supports LoRA

Maximum accuracy (large GPU budget)

Qwen3.5 27B

Highest dense model quality, strong reasoning and production-grade outputs

For full architecture details, hardware requirements, and benchmark citation pointers, see Model Architectures.

How do I choose a model size?

Model size (parameter count) is not "bigger is better." The right choice depends on your dataset size, task complexity, and deployment constraints.

Your situation

Recommended size

Why

Under 200 images, first experiment

2B-4B (NVILA-Lite, Qwen3.5 4B)

Fewer parameters means less risk of overfitting on small data. Trains faster, iterates quicker.

200-1,000 images, production use

7B-9B (Qwen3.5 9B, InternVL3.5 8B)

Enough capacity for most tasks without wasting compute. Good quality-to-cost ratio.

1,000+ images, high accuracy needed

27B+ (Qwen3.5 27B)

Large models benefit from large datasets. Requires more GPU memory and compute credits.

Strict latency requirements

2B-4B

Smaller models produce tokens faster. Consider this for real-time applications.

A 27B model trained on 50 images will often perform worse than a 4B model on the same data. Large models have so many parameters that they can memorize small datasets instead of learning general patterns. Match model size to data size first, then scale up if evaluation metrics plateau.

For more on how model size interacts with training settings, see How Does VLM Training Work?

What Are Vision-Language Models?

How did we get here? From classification to VLMs

How do VLMs work?

Visual encoder

Language model

Cross-modal bridge

Key concepts you'll encounter

Which VLM architectures does Datature Vi support?

How do I choose a model size?

Frequently asked questions

Further reading

Related resources

Phrase Grounding

Visual Question Answering

Quickstart