Model Architectures
Compare the VLM architectures available in Datature Vi and choose the right one for your task, resource budget, and accuracy requirements.
Datature Vi supports seven vision-language model (VLM) architectures for fine-tuning. Each has different parameter sizes, strengths, and hardware requirements. The right choice depends on your task type, how much accuracy you need, and what GPU resources you have available.
Start with Qwen3.5 4B for your first training run. It trains fast, costs less, and handles most tasks well. Once you see good results, scale up to Qwen3.5 9B for production accuracy. You can skip the detailed comparison below and come back to it later.
A VLM architecture is the underlying model design: how the model processes images, encodes text, and generates predictions. Larger models (more parameters) can learn more complex patterns but cost more to train and run. Terms like "7B" mean 7 billion parameters. See VLM concepts for background on parameters, tokens, and other terms used on this page.
Modalities and task types
All seven architectures support image and video modalities and the full set of training tasks: phrase grounding, visual question answering (VQA), and freeform text.
Which architecture should I use?
Start here before reading the full details below.
Start with a smaller model (Qwen3.5 4B or Qwen3-VL 4B) to validate your dataset, pipeline, and task setup with fast iteration cycles. Once your approach is working, scale up to Qwen3.5 9B or larger for production-grade accuracy. This way you spend less time and fewer compute credits while experimenting, and only commit to longer training runs when the fundamentals are right.
Which upstream checkpoint does Vi start from?
The sections below compare behavior and benchmarks first. Skip this table unless you need Hugging Face repo IDs and the exact base versus instruct line Vi loads for each architecture.
Fine-tuning always begins from a fixed public weight line per architecture. Qwen2.5-VL and Qwen3-VL are instruction-tuned only: each size uses the matching *-Instruct Hugging Face repo, and base checkpoints (non-Instruct or *-Base VL builds) are not loaded or offered. Qwen3.5 uses the post-trained Qwen/Qwen3.5-{size} release, not Qwen3.5-{size}-Base. NVIDIA and OpenGVLab families use the single Hugging Face checkpoints linked in the sections below.
Qwen3.5
Qwen3.5 is a natively multimodal architecture from Alibaba Cloud. Every model in this family accepts text, images, and video as input, including the 0.8B variant. Unlike Qwen3-VL, which bolts a vision encoder onto a language model, Qwen3.5 is multimodal from the ground up.
The architecture uses a hybrid design that alternates between Gated DeltaNet layers (linear attention with O(n) complexity) and standard Gated Attention layers. This pattern allocates 75% of layers to linear attention and 25% to full attention. The result is lower memory usage on long sequences while keeping quality high at key layers. Qwen3.5 supports 201 languages and a 262K-token native context window (extensible to 1M). It also supports Multi-Token Prediction (MTP), which speeds up inference through speculative decoding.
Available sizes:
0.8B
Smallest natively multimodal VLM. Runs on minimal hardware for basic image tasks.
2B
Edge-capable VLM with decent grounding and visual reasoning.
4B
Thinking mode enabled by default. Strong at visual agents and document tasks.
9B
Best quality-to-cost ratio in the family. 201-language support and strong tool calling.
27B
Highest dense model quality. Strong reasoning, coding, and production-grade outputs.
35B-A3B
35B total parameters, only 3B active per inference step. Near-27B quality at 3B compute cost.
Key strengths:
- Every size is natively multimodal (not a separate vision encoder added to a language model)
- Hybrid DeltaNet + Attention architecture reduces memory usage on long contexts
- 201-language support for text and OCR
- 262K native context window, extensible to 1M tokens
- Multi-Token Prediction (MTP) for faster inference via speculative decoding
- Built-in thinking/non-thinking mode toggle across all sizes
- Supports LoRA and full fine-tuning
- Multilingual VLM tasks across 201 languages
- Long-context workloads (extended documents and video)
- Visual agent and tool-calling applications
- Deployments where the hybrid architecture's memory savings matter
- MoE variant for high-quality output on limited inference compute
- Cases where Qwen3-VL already meets your accuracy needs with a simpler architecture
- Teams that prefer the established Qwen2.5-VL/Qwen3-VL ecosystem and tooling
- MoE variant on single-GPU setups (35B total weights require ~70 GB VRAM)
Qwen3-VL
Qwen3-VL is Alibaba Cloud's latest transformer-based VLM and the direct successor to Qwen2.5-VL. It processes images and videos with three architectural upgrades over its predecessor: Interleaved-MRoPE for better video reasoning, DeepStack multi-level vision feature fusion, and Text-Timestamp Alignment for precise temporal event localization. The model supports a 256K-token native context window, extensible to 1M tokens.
Available sizes:
2B
Smallest Qwen3-VL variant. Good for edge deployment and basic image understanding.
4B
Fits on a single consumer GPU. Decent quality for document understanding and basic video.
8B
Best quality-to-cost ratio. Strong across image, video, OCR, and agent tasks.
32B
Highest benchmark scores in the family. Requires multi-GPU setup (40-80 GB).
Key strengths:
- 256K native context window (8x longer than Qwen2.5-VL), extensible to 1M tokens
- OCR support for 32 languages, up from 19 in Qwen2.5-VL
- Interleaved-MRoPE for stronger video temporal reasoning
- DeepStack fuses multi-level vision features for sharper image-text alignment
- Supports LoRA and full fine-tuning
- Optional thinking mode variants available for chain-of-thought (CoT) reasoning
- General-purpose VLM tasks with the proven Qwen transformer architecture
- Document understanding and multi-language OCR
- Long video comprehension with second-level temporal indexing
- Visual agent applications (PC and mobile GUI interaction)
- Teams that want the largest single dense model (32B)
- Ultra-lightweight edge deployments (consider NVILA-Lite 2B or Qwen3.5 0.8B)
- Physical world reasoning tasks (consider Cosmos-Reason2)
- Cases where Qwen2.5-VL already meets your accuracy needs and you prefer a mature ecosystem
Qwen2.5-VL
Qwen2.5-VL is developed by Alibaba Cloud. It processes images and videos at their native resolution without forced resizing, which preserves fine detail in high-resolution images, dense documents, and video frames. The model supports up to 128K tokens in its context window.
Available sizes:
3B
Fast training and inference. Good for initial experiments and resource-constrained deployments.
7B
Best balance of performance and efficiency for most production use cases.
32B
Highest accuracy across all task types. Requires 40-80 GB GPU memory.
Key strengths:
- Dynamic resolution processing: images are not resized to a fixed size
- Extended 128K token context window for long documents and multi-image inputs
- Multimodal Rotary Position Embedding (M-RoPE) for strong spatial and temporal understanding
- Supports LoRA and full fine-tuning
- Phrase grounding with high-resolution images
- Visual question answering
- Document understanding and OCR tasks
- General-purpose VLM applications
- Deployments where the 3B/7B models exceed memory limits (use NVILA-Lite 2B instead)
- Tasks that need maximum reasoning depth on a single 7B budget (consider Cosmos-Reason1)
NVILA-Lite
NVILA-Lite is part of NVIDIA's NVILA model family. It uses a "scale-then-compress" approach: images are processed at high resolution first, then visual tokens are compressed efficiently. This lets a 2B-parameter model handle high-resolution inputs (up to 4K) with lower memory than comparable models.
Available size:
NVILA-Lite does not support LoRA fine-tuning. It still supports full fine-tuning on your datasets. If you need LoRA, use Qwen3-VL, Qwen3.5, Qwen2.5-VL, or InternVL3.5 instead.
Key strengths:
- Lowest memory footprint of any available architecture
- Fastest inference latency
- Effective on high-resolution inputs despite compact size
- Suited for edge and real-time deployments
- Edge device deployment with strict memory limits
- Real-time or high-throughput inference applications
- Initial proof-of-concept training where speed matters more than maximum accuracy
- Tasks requiring LoRA fine-tuning
- Complex reasoning tasks where model capacity is critical
- Production deployments where accuracy must match larger models
NVIDIA Cosmos-Reason1
Cosmos-Reason1 is a 7B-parameter model from NVIDIA designed for tasks that require logical inference and multi-step analysis. It performs well on cause-and-effect reasoning, contextual understanding, and analytical tasks where the model needs to draw conclusions from visual evidence.
Available size:
Key strengths:
- Optimized attention mechanisms for complex reasoning chains
- Strong at contextual understanding across visual and textual modalities
- Good performance on diagnostic and analytical applications
- Multi-step visual reasoning (defect root cause analysis, diagnostic imaging)
- Tasks requiring contextual inference beyond simple detection
- Visual reasoning puzzles and structured analysis
- Simple object detection or counting tasks (Qwen2.5-VL 3B/7B is sufficient and faster)
- Tasks where general-purpose VLM capabilities matter more than reasoning depth
NVIDIA Cosmos-Reason2
Cosmos-Reason2 is NVIDIA's second-generation physical reasoning VLM, built on the Qwen2.5-VL architecture. It focuses on understanding how the physical world works: spatial relationships, object dynamics, affordances, and cause-and-effect reasoning from visual input. NVIDIA trained these models with a two-stage pipeline of supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO), which produces structured chain-of-thought (CoT) reasoning traces before final answers.
The 2B variant is based on Qwen2.5-VL-2B-Instruct. The 8B variant builds on Qwen2.5-VL-7B-Instruct and offers stronger multi-step reasoning and finer spatial accuracy.
Available sizes:
Key strengths:
- Purpose-built for physical world understanding: spatial reasoning, object permanence, and causal inference
- Chain-of-thought output provides interpretable reasoning steps (valuable for safety-critical applications)
- GRPO-trained for higher reasoning accuracy compared to SFT alone
- Processes both images and video natively
- Compatible with Qwen2.5-VL tooling and inference infrastructure
- Supports LoRA and full fine-tuning
- Robotics perception and task planning
- Spatial relationship verification in manufacturing and quality inspection
- Video surveillance with physical event understanding
- Autonomous driving scene analysis
- Any task where understanding physical world behavior matters more than general VLM capability
- General-purpose VLM tasks like OCR, VQA, or document understanding (use Qwen3-VL or Qwen2.5-VL)
- Workloads where physical reasoning is not the primary requirement
OpenGVLab InternVL3.5
InternVL3.5 is an 8B-parameter model from OpenGVLab that provides strong performance across diverse multimodal tasks. It is designed for detailed visual understanding: fine-grained object recognition, scene comprehension, and spatial reasoning.
Available size:
Key strengths:
- Fine-grained recognition of small objects and subtle visual details
- Strong spatial awareness and scene understanding
- Supports LoRA and full fine-tuning
- Good balance between accuracy and resource cost at 8B parameters
- Tasks requiring detailed visual attribute recognition
- Scene-level understanding with multiple interacting objects
- General-purpose applications where Qwen2.5-VL 7B results are borderline
- Deployments needing the smallest possible model
- Tasks where the Qwen2.5-VL family has a large established advantage
Coming soon
Three additional architectures are in development:
- Gemma 4: Google's open multimodal model family with strong visual reasoning and instruction following
- DeepSeek OCR: Specialized for text extraction and document understanding, with strong handling of handwritten text and multilingual documents
- LLaVA-NeXT: Advanced multimodal reasoning with improved instruction following
How do model generations compare?
Datature Vi includes three generations of Qwen VLMs. Each generation brings architectural improvements, but the older models remain available and production-tested.
Official references and benchmark sources
Use this table for citations. Numbers in the benchmark section below are copied only where the upstream source publishes them as text; otherwise we link to the figure, leaderboard, or paper table.
Benchmark performance
These scores come from the official model cards and papers published by each architecture's authors. They measure base model performance before fine-tuning. Your fine-tuned model's accuracy depends on your dataset quality, annotation volume, and training configuration.
Frequently asked questions
Do this with the Vi SDK
import vi
client = vi.Client(
secret_key="your-secret-key",
organization_id="your-organization-id"
)
flow = client.flows.get("your-flow-id")
blocks = []
for block in flow.spec.blocks:
settings = dict(block.settings)
if "model" in block.block:
settings["architecture"] = {"name": "qwen3vl", "size": "8B"}
blocks.append({
"block": block.block,
"settings": settings,
"style": block.style,
})
client.flows.update(flow_id=flow.flow_id, spec={"blocks": blocks})For more details, see the full SDK reference.
Further reading
- How to Fine-Tune Qwen2.5-VL: A hands-on guide to fine-tuning one of Datature Vi's most popular architectures.
- Containerized VLM Deployment with NVIDIA NIM: Deploy trained VLMs in production using NVIDIA NIM containers.
Related resources
Model Settings
Configure training mode, hyperparameters, and inference settings after choosing an architecture.
Start A Training Run
Configure training settings, select GPU hardware, and start a training run.
What Are VLMs?
Learn the fundamentals of vision-language models, how they process images, and why they matter.
Updated 4 days ago
