Model Architectures

Compare the VLM architectures available in Datature Vi and choose the right one for your task, resource budget, and accuracy requirements.

Datature Vi supports seven vision-language model (VLM) architectures for fine-tuning. Each has different parameter sizes, strengths, and hardware requirements. The right choice depends on your task type, how much accuracy you need, and what GPU resources you have available.

Not sure which model to pick?

Start with Qwen3.5 4B for your first training run. It trains fast, costs less, and handles most tasks well. Once you see good results, scale up to Qwen3.5 9B for production accuracy. You can skip the detailed comparison below and come back to it later.

Jump to the quickstart guide →

A VLM architecture is the underlying model design: how the model processes images, encodes text, and generates predictions. Larger models (more parameters) can learn more complex patterns but cost more to train and run. Terms like "7B" mean 7 billion parameters. See VLM concepts for background on parameters, tokens, and other terms used on this page.

Modalities and task types

All seven architectures support image and video modalities and the full set of training tasks: phrase grounding, visual question answering (VQA), and freeform text.

Which architecture should I use?

Start here before reading the full details below.

Architecture
Sizes
Best for
LoRA support
Qwen3.5
0.8B, 2B, 4B, 9B, 27B, 35B-A3B
Best overall performance, native multimodal
Qwen3-VL
2B, 4B, 8B, 32B
General-purpose tasks with extended context and OCR
Qwen2.5-VL
3B, 7B, 32B
Proven general-purpose tasks, phrase grounding, VQA
NVILA-Lite
2B
Fast inference, edge deployment
Cosmos-Reason1
7B
Complex multi-step reasoning
Cosmos-Reason2
2B, 8B
Physical world and spatial reasoning
InternVL3.5
8B
Fine-grained visual understanding

Start with a smaller model (Qwen3.5 4B or Qwen3-VL 4B) to validate your dataset, pipeline, and task setup with fast iteration cycles. Once your approach is working, scale up to Qwen3.5 9B or larger for production-grade accuracy. This way you spend less time and fewer compute credits while experimenting, and only commit to longer training runs when the fundamentals are right.

If the detailed table above feels overwhelming, use this guide instead. Pick the row that best describes your situation, then use the recommended model.

  • First experiment or proof of concept: Qwen3.5 4B. Trains fast, low cost, good enough to test whether your data and task setup work.
  • Manufacturing quality inspection: Qwen3.5 9B with phrase grounding. Good at finding and describing defects in product images.
  • Document processing (invoices, receipts, forms): Qwen3-VL 8B or Qwen3.5 9B. Both have strong OCR and text extraction. Qwen3-VL supports 32 languages; Qwen3.5 supports 201.
  • Medical imaging: Qwen3.5 9B. Scores well on medical VQA benchmarks (SLAKE, PMC-VQA). Pair with chain-of-thought annotations for explainable diagnoses.
  • Retail shelf analysis or inventory: Qwen3.5 9B with phrase grounding. Locates and identifies products in complex scenes.
  • Agriculture or drone imagery: Qwen3.5 9B with VQA or freeform text. Handles aerial perspectives and varied lighting.
  • Physical reasoning (counting, spatial relationships): Cosmos-Reason2 8B. Designed for multi-step reasoning about the physical world.
  • Edge or mobile deployment (low memory): NVILA-Lite 2B. Smallest footprint, designed to run on constrained hardware.

When in doubt, start with Qwen3.5 4B and upgrade later.


Which upstream checkpoint does Vi start from?

The sections below compare behavior and benchmarks first. Skip this table unless you need Hugging Face repo IDs and the exact base versus instruct line Vi loads for each architecture.

Fine-tuning always begins from a fixed public weight line per architecture. Qwen2.5-VL and Qwen3-VL are instruction-tuned only: each size uses the matching *-Instruct Hugging Face repo, and base checkpoints (non-Instruct or *-Base VL builds) are not loaded or offered. Qwen3.5 uses the post-trained Qwen/Qwen3.5-{size} release, not Qwen3.5-{size}-Base. NVIDIA and OpenGVLab families use the single Hugging Face checkpoints linked in the sections below.

Starting weights per architecture

Architecture
Datature Vi starts from
Qwen3.5
Post-trained `Qwen/Qwen3.5-{size}` on Hugging Face (same repos as the size cards below). Not `Qwen3.5-{size}-Base`.
Qwen3-VL
Instruction-tuned only: `Qwen3-VL-{size}-Instruct`. No base checkpoint.
Qwen2.5-VL
Instruction-tuned only: `Qwen2.5-VL-{size}-Instruct`. No base checkpoint.
NVILA-Lite
`nvidia/NVILA-Lite-2B-hf` as published on Hugging Face.
Cosmos-Reason1
`nvidia/Cosmos-Reason1-7B` as published on Hugging Face.
Cosmos-Reason2
`nvidia/Cosmos-Reason2-2B` or `nvidia/Cosmos-Reason2-8B` as published. NVIDIA documents these as built on Qwen2.5-VL Instruct backbones.
InternVL3.5
`OpenGVLab/InternVL3-8B` as published on Hugging Face (same repo as the 8B card below).

Qwen3.5

Qwen3.5 is a natively multimodal architecture from Alibaba Cloud. Every model in this family accepts text, images, and video as input, including the 0.8B variant. Unlike Qwen3-VL, which bolts a vision encoder onto a language model, Qwen3.5 is multimodal from the ground up.

The architecture uses a hybrid design that alternates between Gated DeltaNet layers (linear attention with O(n) complexity) and standard Gated Attention layers. This pattern allocates 75% of layers to linear attention and 25% to full attention. The result is lower memory usage on long sequences while keeping quality high at key layers. Qwen3.5 supports 201 languages and a 262K-token native context window (extensible to 1M). It also supports Multi-Token Prediction (MTP), which speeds up inference through speculative decoding.

Available sizes:

MoE stands for Mixture of Experts. The 35B-A3B model contains 35 billion total parameters split across 256 expert sub-networks, but only 8 experts (plus 1 shared expert) activate for any given input. This means the model stores 35B parameters on disk and in GPU memory, but each inference step uses about 3B parameters of compute. The result: benchmark scores close to the 27B dense model, with inference compute cost closer to a 3B model. The tradeoff is higher total VRAM to hold all expert weights (~70 GB).

Key strengths:

  • Every size is natively multimodal (not a separate vision encoder added to a language model)
  • Hybrid DeltaNet + Attention architecture reduces memory usage on long contexts
  • 201-language support for text and OCR
  • 262K native context window, extensible to 1M tokens
  • Multi-Token Prediction (MTP) for faster inference via speculative decoding
  • Built-in thinking/non-thinking mode toggle across all sizes
  • Supports LoRA and full fine-tuning
Best for
  • Multilingual VLM tasks across 201 languages
  • Long-context workloads (extended documents and video)
  • Visual agent and tool-calling applications
  • Deployments where the hybrid architecture's memory savings matter
  • MoE variant for high-quality output on limited inference compute
Not for
  • Cases where Qwen3-VL already meets your accuracy needs with a simpler architecture
  • Teams that prefer the established Qwen2.5-VL/Qwen3-VL ecosystem and tooling
  • MoE variant on single-GPU setups (35B total weights require ~70 GB VRAM)

Qwen3-VL

Qwen3-VL is Alibaba Cloud's latest transformer-based VLM and the direct successor to Qwen2.5-VL. It processes images and videos with three architectural upgrades over its predecessor: Interleaved-MRoPE for better video reasoning, DeepStack multi-level vision feature fusion, and Text-Timestamp Alignment for precise temporal event localization. The model supports a 256K-token native context window, extensible to 1M tokens.

Available sizes:

Key strengths:

  • 256K native context window (8x longer than Qwen2.5-VL), extensible to 1M tokens
  • OCR support for 32 languages, up from 19 in Qwen2.5-VL
  • Interleaved-MRoPE for stronger video temporal reasoning
  • DeepStack fuses multi-level vision features for sharper image-text alignment
  • Supports LoRA and full fine-tuning
  • Optional thinking mode variants available for chain-of-thought (CoT) reasoning
Best for
  • General-purpose VLM tasks with the proven Qwen transformer architecture
  • Document understanding and multi-language OCR
  • Long video comprehension with second-level temporal indexing
  • Visual agent applications (PC and mobile GUI interaction)
  • Teams that want the largest single dense model (32B)
Not for
  • Ultra-lightweight edge deployments (consider NVILA-Lite 2B or Qwen3.5 0.8B)
  • Physical world reasoning tasks (consider Cosmos-Reason2)
  • Cases where Qwen2.5-VL already meets your accuracy needs and you prefer a mature ecosystem

Qwen2.5-VL

Qwen2.5-VL is developed by Alibaba Cloud. It processes images and videos at their native resolution without forced resizing, which preserves fine detail in high-resolution images, dense documents, and video frames. The model supports up to 128K tokens in its context window.

Available sizes:

Lightweight

3B

Fast training and inference. Good for initial experiments and resource-constrained deployments.

View model card
Recommended

7B

Best balance of performance and efficiency for most production use cases.

View model card
Maximum accuracy

32B

Highest accuracy across all task types. Requires 40-80 GB GPU memory.

View model card

Key strengths:

  • Dynamic resolution processing: images are not resized to a fixed size
  • Extended 128K token context window for long documents and multi-image inputs
  • Multimodal Rotary Position Embedding (M-RoPE) for strong spatial and temporal understanding
  • Supports LoRA and full fine-tuning
Best for
  • Phrase grounding with high-resolution images
  • Visual question answering
  • Document understanding and OCR tasks
  • General-purpose VLM applications
Not for
  • Deployments where the 3B/7B models exceed memory limits (use NVILA-Lite 2B instead)
  • Tasks that need maximum reasoning depth on a single 7B budget (consider Cosmos-Reason1)

NVILA-Lite

NVILA-Lite is part of NVIDIA's NVILA model family. It uses a "scale-then-compress" approach: images are processed at high resolution first, then visual tokens are compressed efficiently. This lets a 2B-parameter model handle high-resolution inputs (up to 4K) with lower memory than comparable models.

Available size:

Edge

2B

Compact VLM with scale-then-compress processing for high-resolution inputs on constrained hardware.

View model card
LoRA Not Supported

NVILA-Lite does not support LoRA fine-tuning. It still supports full fine-tuning on your datasets. If you need LoRA, use Qwen3-VL, Qwen3.5, Qwen2.5-VL, or InternVL3.5 instead.

Key strengths:

  • Lowest memory footprint of any available architecture
  • Fastest inference latency
  • Effective on high-resolution inputs despite compact size
  • Suited for edge and real-time deployments
Best for
  • Edge device deployment with strict memory limits
  • Real-time or high-throughput inference applications
  • Initial proof-of-concept training where speed matters more than maximum accuracy
Not for
  • Tasks requiring LoRA fine-tuning
  • Complex reasoning tasks where model capacity is critical
  • Production deployments where accuracy must match larger models

NVIDIA Cosmos-Reason1

Cosmos-Reason1 is a 7B-parameter model from NVIDIA designed for tasks that require logical inference and multi-step analysis. It performs well on cause-and-effect reasoning, contextual understanding, and analytical tasks where the model needs to draw conclusions from visual evidence.

Available size:

Reasoning

7B

Multi-step logical inference and cause-and-effect analysis from visual evidence.

View model card

Key strengths:

  • Optimized attention mechanisms for complex reasoning chains
  • Strong at contextual understanding across visual and textual modalities
  • Good performance on diagnostic and analytical applications
Best for
  • Multi-step visual reasoning (defect root cause analysis, diagnostic imaging)
  • Tasks requiring contextual inference beyond simple detection
  • Visual reasoning puzzles and structured analysis
Not for
  • Simple object detection or counting tasks (Qwen2.5-VL 3B/7B is sufficient and faster)
  • Tasks where general-purpose VLM capabilities matter more than reasoning depth

NVIDIA Cosmos-Reason2

Cosmos-Reason2 is NVIDIA's second-generation physical reasoning VLM, built on the Qwen2.5-VL architecture. It focuses on understanding how the physical world works: spatial relationships, object dynamics, affordances, and cause-and-effect reasoning from visual input. NVIDIA trained these models with a two-stage pipeline of supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO), which produces structured chain-of-thought (CoT) reasoning traces before final answers.

The 2B variant is based on Qwen2.5-VL-2B-Instruct. The 8B variant builds on Qwen2.5-VL-7B-Instruct and offers stronger multi-step reasoning and finer spatial accuracy.

Available sizes:

Edge physical AI

2B

Lightweight physical reasoning for edge and embedded systems. Based on Qwen2.5-VL-2B.

View model card
Full physical reasoning

8B

Strongest physical reasoning at single-GPU scale. Based on Qwen2.5-VL-7B.

View model card

Key strengths:

  • Purpose-built for physical world understanding: spatial reasoning, object permanence, and causal inference
  • Chain-of-thought output provides interpretable reasoning steps (valuable for safety-critical applications)
  • GRPO-trained for higher reasoning accuracy compared to SFT alone
  • Processes both images and video natively
  • Compatible with Qwen2.5-VL tooling and inference infrastructure
  • Supports LoRA and full fine-tuning
Best for
  • Robotics perception and task planning
  • Spatial relationship verification in manufacturing and quality inspection
  • Video surveillance with physical event understanding
  • Autonomous driving scene analysis
  • Any task where understanding physical world behavior matters more than general VLM capability
Not for
  • General-purpose VLM tasks like OCR, VQA, or document understanding (use Qwen3-VL or Qwen2.5-VL)
  • Workloads where physical reasoning is not the primary requirement

OpenGVLab InternVL3.5

InternVL3.5 is an 8B-parameter model from OpenGVLab that provides strong performance across diverse multimodal tasks. It is designed for detailed visual understanding: fine-grained object recognition, scene comprehension, and spatial reasoning.

Available size:

Fine-grained vision

8B

Detailed visual understanding with strong spatial awareness and fine-grained object recognition.

View model card

Key strengths:

  • Fine-grained recognition of small objects and subtle visual details
  • Strong spatial awareness and scene understanding
  • Supports LoRA and full fine-tuning
  • Good balance between accuracy and resource cost at 8B parameters
Best for
  • Tasks requiring detailed visual attribute recognition
  • Scene-level understanding with multiple interacting objects
  • General-purpose applications where Qwen2.5-VL 7B results are borderline
Not for
  • Deployments needing the smallest possible model
  • Tasks where the Qwen2.5-VL family has a large established advantage

Coming soon

Three additional architectures are in development:

  • Gemma 4: Google's open multimodal model family with strong visual reasoning and instruction following
  • DeepSeek OCR: Specialized for text extraction and document understanding, with strong handling of handwritten text and multilingual documents
  • LLaVA-NeXT: Advanced multimodal reasoning with improved instruction following

How do model generations compare?

Datature Vi includes three generations of Qwen VLMs. Each generation brings architectural improvements, but the older models remain available and production-tested.

Qwen2.5-VL
Qwen3-VL
Qwen3.5
Native context
32K (128K extended)
256K (1M extended)
262K (1M extended)
OCR languages
19
32
201
Thinking mode
No
Separate variants
Built-in toggle
Architecture
Transformer + ViT
Transformer + ViT (upgraded)
Hybrid DeltaNet + Attention + ViT
Multi-Token Prediction
No
No
Yes
Smallest available size
3B
2B
0.8B
MoE variant
No
No
35B-A3B

Official references and benchmark sources

Use this table for citations. Numbers in the benchmark section below are copied only where the upstream source publishes them as text; otherwise we link to the figure, leaderboard, or paper table.

Per-architecture primary references

Architecture
Technical report or overview
Benchmark tables on this page
Qwen3.5
https://qwen.ai/blog?id=qwen3.5 and https://arxiv.org/abs/2505.09388 (Qwen3 Technical Report)
Accordion: Qwen3.5 benchmarks (4B and 9B); scores match Hugging Face model cards for those sizes
Qwen3-VL
https://arxiv.org/abs/2511.21631 (Qwen3-VL Technical Report)
Accordion: Qwen3-VL benchmarks; figures and tables on https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct
Qwen2.5-VL
https://arxiv.org/abs/2502.13923 (Qwen2.5-VL Technical Report)
Accordion: Qwen2.5-VL benchmarks (7B) from https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
NVILA-Lite
https://arxiv.org/abs/2412.04468 (NVILA)
Accordion: NVILA-Lite; tabulated scores in the paper and on https://huggingface.co/nvidia/NVILA-Lite-2B-hf
Cosmos-Reason1
https://arxiv.org/abs/2503.15558 (Cosmos-Reason1)
Accordion: Cosmos-Reason1 benchmarks (7B) from https://huggingface.co/nvidia/Cosmos-Reason1-7B
Cosmos-Reason2
https://arxiv.org/abs/2501.03575 (Cosmos platform) and https://docs.nvidia.com/cosmos/latest/reason2/index.html
Accordion: Cosmos-Reason2; leaderboard https://huggingface.co/spaces/shi-labs/physical-ai-bench-leaderboard and https://huggingface.co/nvidia/Cosmos-Reason2-8B
InternVL3.5
https://arxiv.org/abs/2504.10479 (InternVL3)
Accordion: InternVL3.5 benchmarks (8B); scores from the paper for InternVL3-8B

Benchmark performance

These scores come from the official model cards and papers published by each architecture's authors. They measure base model performance before fine-tuning. Your fine-tuned model's accuracy depends on your dataset quality, annotation volume, and training configuration.

Scores from https://huggingface.co/Qwen/Qwen3.5-4B and https://huggingface.co/Qwen/Qwen3.5-9B on standard multimodal benchmarks. Higher is better.

Benchmark
Category
Qwen3.5 4B
Qwen3.5 9B
MMMU
General multimodal
77.6
78.4
MMMU-Pro
General multimodal
66.3
70.1
MMBench EN v1.1
General VQA
89.4
90.1
MMStar
General VQA
78.3
79.7
RealWorldQA
General VQA
79.5
80.3
MathVista (mini)
Math reasoning
85.1
85.7
MathVision
Math reasoning
74.6
78.9
AI2D
Document/diagram
89.6
90.2
OCRBench
OCR
85.0
89.2
CC-OCR
OCR
76.7
79.3
DocBench 1.5
Document
86.2
87.7
RefCOCO (avg)
Grounding
88.1
89.7
CountBench
Counting
96.3
97.2
VideoMME (w/ sub)
Video
83.5
84.5
MLVU
Video
82.8
84.4
SLAKE
Medical VQA
76.1
79.0
PMC-VQA
Medical VQA
55.5
57.9

Scores from the Qwen2.5-VL-7B model card. Higher is better.

Benchmark
Category
Qwen2.5-VL 7B
MMMU
General multimodal
58.6
MMMU-Pro
General multimodal
41.0
MMBench EN v1.1
General VQA
82.6
MMStar
General VQA
63.9
MathVista (mini)
Math reasoning
68.2
MathVision
Math reasoning
25.1
DocVQA
Document
95.7
ChartQA
Document
87.3
TextVQA
OCR
84.9
OCRBench
OCR
86.4
HallusionBench
Hallucination
52.9
MVBench
Video
69.6
VideoMME (w/ sub)
Video
71.6
MLVU
Video
70.2
ScreenSpot
Agent
84.7

Scores from the Cosmos-Reason1 model card. Cosmos models are evaluated on embodied reasoning tasks rather than standard VQA benchmarks.

Benchmark
Category
Cosmos-Reason1 7B
RoboVQA
Robotics reasoning
87.3
Autonomous Vehicle
Driving reasoning
70.8
BridgeDataV2
Manipulation
63.7
Agibot
Embodied AI
48.9
HoloAssist
AR assistance
62.7
RoboFail
Failure detection
57.2
Average
Overall
65.1

Standard multimodal benchmarks (MMMU, MMBench) are not published in the model card. See the Cosmos-Reason1 paper for extended evaluations.

Scores from the InternVL3 paper. Higher is better.

Benchmark
Category
InternVL3 8B
MMMU
General multimodal
62.7
MMBench EN v1.1
General VQA
81.7
MMStar
General VQA
68.2
MMVet
General VQA
81.3
MathVista
Math reasoning
71.6
MathVision
Math reasoning
29.3
AI2D
Document/diagram
85.2
DocVQA
Document
92.7
ChartQA
Document
86.6
TextVQA
OCR
80.2
OCRBench
OCR
880
RefCOCO (avg)
Grounding
89.6
MME
Comprehensive
2415.4

Dense sizes 2B, 4B, 8B, and 32B share the same evaluation setup. Text tables and charts for the 8B instruct model are on https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct (multimodal and pure-text figures). The technical report is https://arxiv.org/abs/2511.21631 and is the right place to cite when you need paragraph-level claims about architecture (Interleaved-MRoPE, DeepStack, text-timestamp alignment).

For a numeric comparison against Qwen2.5-VL 7B and Qwen3.5 9B, read the plotted benchmarks on that model card side by side with the Qwen2.5-VL 7B and Qwen3.5 9B accordions above.

NVILA-Lite targets edge latency and memory. Tabulated scores appear in https://arxiv.org/abs/2412.04468 (NVILA) and on https://huggingface.co/nvidia/NVILA-Lite-2B-hf. Compare those tables to Qwen2.5-VL 3B and Qwen3-VL 4B when you are choosing a compact model rather than a reasoning specialist.

Cosmos-Reason2 is scored on physical and embodied reasoning suites, not on the same public VQA leaderboard mix as Qwen-Instruct models. Start from https://huggingface.co/spaces/shi-labs/physical-ai-bench-leaderboard for ranked numbers, then open the model card at https://huggingface.co/nvidia/Cosmos-Reason2-8B for variant-specific notes. Platform context for the Cosmos family is in https://arxiv.org/abs/2501.03575; product docs live at https://docs.nvidia.com/cosmos/latest/reason2/index.html.

Benchmark scores measure base model ability before fine-tuning. After fine-tuning on your data, accuracy on your specific task will differ from these numbers.

Key benchmarks explained:

  • MMMU measures college-level multimodal reasoning across 30 subjects
  • MMBench tests general visual perception and reasoning
  • OCRBench measures text recognition accuracy across fonts, languages, and layouts
  • DocVQA tests document understanding and question answering
  • MathVista tests mathematical reasoning from visual inputs
  • RefCOCO measures object grounding accuracy (locating objects from text descriptions)
  • VideoMME tests video comprehension and temporal reasoning
  • MME is a comprehensive evaluation covering perception and cognition (scores are on a 0-2800 scale, not percentage)

Frequently asked questions

No. The model architecture is fixed when you save a workflow. To try a different architecture, create a new workflow. You can maintain multiple workflows within one training project and compare their results.

Larger models require more GPU memory and longer training time, which increases compute credit consumption. As a rough guide: sub-4B models train fastest with the lowest cost; 7B-9B models are moderate; 27B-35B models take the longest and cost the most. Training time also depends on dataset size and epoch count. MoE models (like Qwen3.5 35B-A3B) store more total weights but use fewer active parameters per step, so their training cost falls between their total size and active size.

Start with Qwen3.5 4B for your first training run. Smaller models train faster and cost less, which lets you iterate on your dataset and settings before committing to a larger model. Once you are satisfied with the results, scale up to Qwen3.5 9B or 27B to push accuracy higher. Follow the quickstart guide to train your first model with default settings.

Yes. Each architecture supports phrase grounding, visual question answering (VQA), and freeform text on image and video data. For NVIDIA NIM input modes by architecture, see Run NIM inference. Cosmos-Reason1 and Cosmos-Reason2 are tuned for reasoning workloads (logical inference and physical world reasoning, respectively). NVILA-Lite is the only listed model that does not support LoRA fine-tuning. DeepSeek OCR (coming soon) will be specialized for text extraction.

For most users, Qwen3.5 9B is the best starting point. It scores higher on benchmarks at comparable sizes, supports 201 languages, and its hybrid DeltaNet architecture is more memory-efficient on long contexts. Choose Qwen3-VL if you want the established Qwen transformer architecture and ecosystem, or if you need the 32B size (Qwen3.5's largest dense model is 27B). Both families support LoRA and full fine-tuning.

Pick Cosmos-Reason2 when your primary task involves physical world understanding: robotics, spatial verification, manufacturing inspection, or autonomous driving. These specialist models outperform general-purpose VLMs on physical reasoning benchmarks, but they trade off general capability. If your task is mostly OCR, VQA, or document understanding, a general-purpose model like Qwen3-VL will perform better.


Do this with the Vi SDK

import vi

client = vi.Client(
    secret_key="your-secret-key",
    organization_id="your-organization-id"
)

flow = client.flows.get("your-flow-id")
blocks = []
for block in flow.spec.blocks:
    settings = dict(block.settings)
    if "model" in block.block:
        settings["architecture"] = {"name": "qwen3vl", "size": "8B"}
    blocks.append({
        "block": block.block,
        "settings": settings,
        "style": block.style,
    })

client.flows.update(flow_id=flow.flow_id, spec={"blocks": blocks})

For more details, see the full SDK reference.

Further reading


Related resources

Model Settings

Configure training mode, hyperparameters, and inference settings after choosing an architecture.

Start A Training Run

Configure training settings, select GPU hardware, and start a training run.

What Are VLMs?

Learn the fundamentals of vision-language models, how they process images, and why they matter.