Model Architectures

Vi supports multiple state-of-the-art vision-language model (VLM) architectures, each optimized for different use cases and performance requirements. Choose the right model based on your task complexity, resource constraints, and accuracy needs.

💡
New to VLMs?
Vision-language models can understand both images and text, allowing you to train AI that responds to natural language prompts. Learn more about VLM concepts and phrase grounding.

Available Model Architectures

Vi offers four powerful VLM architectures across different parameter sizes. The model size (measured in billions of parameters) generally correlates with capability—larger models offer better accuracy but require more compute resources.

Qwen2.5-VL

Alibaba Cloud's multimodal VLM. Available in 3B, 7B, and 32B sizes.

NVIDIA NVILA-Lite

Efficient lightweight VLM. Available in 2B size.

NVIDIA Cosmos-Reason1

Reasoning-focused VLM. Available in 7B size.

OpenGVLab InternVL3.5

Advanced vision-language model. Available in 8B size.

Qwen Qwen2.5-VL

Qwen2.5-VL is a powerful multimodal vision-language model developed by Alibaba Cloud, designed to process and understand text, images, and videos. The model excels at visual question answering, image understanding, and multimodal reasoning tasks.

Available Sizes

Size	Parameters	Best For
3B	3 billion	Resource-constrained environments, faster inference, lightweight applications
7B	7 billion	Balanced performance and efficiency, most common use cases
32B	32 billion	Maximum accuracy, complex reasoning tasks, production deployments

📘
Recommended Starting Point
The 7B version offers an excellent balance between performance and resource requirements for most use cases. Start here unless you have specific constraints or accuracy requirements.

Key Features

Dynamic Resolution Processing

Qwen2.5-VL processes images and videos at their native resolutions without forced resizing or padding. This preserves fine details and improves accuracy, especially for:

High-resolution images with small objects
Documents with dense text
Videos with rapid motion or small details

Unlike traditional models that resize all inputs to a fixed size (like 224×224), Qwen2.5-VL adapts to each image's dimensions.

Multimodal Rotary Position Embedding (M-RoPE)

M-RoPE extends standard positional encoding to handle spatial and temporal dimensions. This helps the model understand:

Spatial relationships — Where objects are located in images
Temporal dynamics — How scenes change across video frames
Text alignment — How visual elements correspond to text descriptions

This makes Qwen2.5-VL particularly effective for video understanding and complex visual reasoning tasks.

Extended Context Length

Qwen2.5-VL supports up to 128K tokens in its context window, enabling:

Analysis of long-form videos
Processing of multi-page documents
Complex conversations about multiple images
Detailed system prompts with extensive examples

This extended context is especially valuable for document understanding and video analysis tasks.

Window Attention Mechanism

The Vision Transformer (ViT) encoder uses window attention to accelerate training and inference. This optimization:

Reduces computational complexity for large images
Enables faster training iterations
Improves inference speed without sacrificing accuracy

Architecture Overview

The Qwen2.5-VL architecture consists of three main components:

Vision Encoder: A Vision Transformer (ViT) with approximately 600 million parameters, supporting dynamic resolution for both images and videos
Language Model Decoder: Based on the Qwen2.5 transformer with Grouped Query Attention (GQA), enabling efficient inference
M-RoPE Integration: Multimodal positional embeddings that connect visual and textual information

Resources

Hugging Face: Qwen2.5-VL Model Collection
Official Documentation: Qwen2.5-VL Technical Guide
Model Training Guide: Qwen2.5-VL Training Documentation

NVIDIA NVILA-Lite

NVIDIA NVILA-Lite is an efficient vision-language model optimized for both accuracy and computational efficiency. Part of NVIDIA's NVILA family, it employs a "scale-then-compress" approach to process high-resolution images and long videos efficiently.

Available Sizes

Size	Parameters	Best For
2B	2 billion	Edge deployments, fast inference, resource-limited environments

🚧
LoRA Not Supported
NVILA-Lite does not support LoRA (Low-Rank Adaptation) fine-tuning. If you need LoRA-based training, consider using Qwen2.5-VL or InternVL3.5 instead.

Key Features

Scale-Then-Compress Approach

NVILA-Lite uses a unique two-stage processing method:

Scale: Process images and videos at high spatial and temporal resolutions to capture fine details
Compress: Reduce visual tokens efficiently without losing critical information

This approach enables the model to handle high-resolution images and long videos while maintaining efficiency. Benefits include:

Faster inference compared to similarly-sized models
Lower memory requirements during training
Efficient processing of high-resolution inputs (up to 4K images)

Training and Deployment Efficiency

NVILA-Lite is optimized for the complete model lifecycle:

Training: Reduced training costs compared to standard VLMs of similar capability
Fine-tuning: Efficient adaptation to custom datasets with fewer compute resources
Inference: Low latency during deployment, suitable for real-time applications
Edge Deployment: Small enough to run on edge devices with limited resources

Accuracy-Efficiency Balance

Despite its compact 2B parameter size, NVILA-Lite matches or exceeds the accuracy of many larger open-source and proprietary VLMs. This makes it ideal for:

Production deployments with strict latency requirements
Applications where model size is constrained (mobile, edge devices)
Scenarios requiring high throughput (processing many images per second)
Cost-sensitive deployments where compute resources are limited

Architecture Overview

NVILA-Lite integrates a vision encoder with a compact language model, using token compression techniques to maintain efficiency. The "scale-then-compress" methodology ensures that high-resolution details are captured before compression, preserving accuracy while reducing computational requirements.

Resources

Official Documentation: NVILA: Efficient Frontier Visual Language Models
Research Paper: NVILA Technical Report

NVIDIA Cosmos-Reason1

NVIDIA Cosmos-Reason1 is a 7B parameter vision-language model designed specifically for complex reasoning tasks. It excels at understanding relationships between visual elements and generating logical conclusions from multimodal inputs.

Available Sizes

Size	Parameters	Best For
7B	7 billion	Complex reasoning, logical inference, analytical tasks

Key Features

Advanced Multimodal Reasoning

Cosmos-Reason1 is optimized for tasks requiring deep understanding and logical inference:

Visual Reasoning: Understanding cause-and-effect relationships in images
Logical Inference: Drawing conclusions from visual evidence combined with textual context
Multi-Step Analysis: Breaking down complex problems into logical steps
Contextual Understanding: Considering broader context when analyzing visual information

Example use cases:

Analyzing diagnostic images with clinical context
Understanding complex diagrams and technical schematics
Solving visual reasoning puzzles and problems
Identifying anomalies that require contextual knowledge

Efficiency-Optimized Architecture

Cosmos-Reason1 balances computational efficiency with reasoning capability:

Optimized attention mechanisms for faster inference
Efficient memory utilization during complex reasoning tasks
Suitable for both cloud and edge deployments
Supports batch processing for high-throughput scenarios

Architecture Overview

Cosmos-Reason1 integrates a vision encoder with a 7B parameter language model, with specialized components designed to facilitate multi-step reasoning and logical inference across visual and textual modalities.

Resources

Hugging Face: NVIDIA Cosmos-Reason1
Official Documentation: NVIDIA Cosmos-Reason1 Research

OpenGVLab InternVL3.5

InternVL3.5 is an 8B parameter vision-language model developed by OpenGVLab, designed for comprehensive understanding and generation across visual and textual modalities. The model provides strong performance on diverse multimodal tasks.

Available Sizes

Size	Parameters	Best For
8B	8 billion	Balanced performance, general-purpose VLM applications

Key Features

Enhanced Visual Understanding

InternVL3.5 excels at processing complex visual scenes with high accuracy:

Fine-Grained Recognition: Identifying small objects and subtle details
Scene Understanding: Comprehending relationships between multiple objects
Spatial Awareness: Understanding object positions and spatial relationships
Visual Attributes: Recognizing colors, textures, sizes, and other properties

The model is particularly effective for:

Detailed image analysis and description
Complex scene understanding
Fine-grained object classification
Visual attribute recognition

Seamless Text Integration

InternVL3.5 effectively combines visual and textual information:

Natural language understanding of visual content
Generation of detailed, accurate image descriptions
Question answering about image content
Following complex multi-modal instructions

Scalable Architecture

With 8B parameters, InternVL3.5 offers:

Strong performance across diverse tasks
Reasonable resource requirements for most deployments
Good balance between accuracy and computational cost
Suitable for both research and production use

Architecture Overview

InternVL3.5 combines a sophisticated Vision Transformer (ViT) encoder with a language model decoder, enabling nuanced interpretation of multimodal inputs. The architecture is designed for comprehensive visual understanding and coherent text generation.

Resources

Hugging Face: OpenGVLab InternVL3.5
Official Documentation: InternVL3.5 Documentation

Coming Soon

Vi is continuously expanding its model support. The following architectures are coming soon:

DeepSeek OCR

Specialized OCR model with advanced document understanding capabilities

LLaVA-NeXT

Advanced multimodal reasoning with improved visual comprehension

🔔
Stay Updated
Want to be notified when these models become available? Contact us to join our early access program or subscribe to our updates.

DeepSeek OCR (Coming Soon)

A specialized OCR model optimized for text extraction and document understanding:

High accuracy for complex documents
Strong handling of handwritten text
Comprehensive multilingual support
Fast inference for large-scale document processing

LLaVA-NeXT (Coming Soon)

LLaVA-NeXT is an advanced vision-language model with state-of-the-art performance on multimodal reasoning tasks:

Enhanced visual understanding and reasoning capabilities
Improved instruction following
Better handling of complex visual scenes
Optimized for both training and inference efficiency

Choosing the Right Model

Select a model based on your specific requirements:

By Use Case

General vision-language tasks

Recommended: Qwen2.5-VL (7B) or InternVL3.5 (8B)

These models provide strong performance across diverse multimodal tasks including:

Visual question answering
Phrase grounding
Image understanding and description
Multi-image reasoning

Start with Qwen2.5-VL 7B for the best balance of performance and efficiency.

Resource-constrained environments

Recommended: NVILA-Lite (2B) or Qwen2.5-VL (3B)

Choose these models when:

Deploying to edge devices with limited compute
Requiring fast inference with low latency
Operating under strict memory constraints
Processing high volumes with limited resources

NVILA-Lite offers the best efficiency, while Qwen2.5-VL 3B provides broader capabilities.

Complex reasoning and analysis

Recommended: Cosmos-Reason1 (7B) or Qwen2.5-VL (32B)

These models excel at:

Multi-step logical reasoning
Cause-and-effect analysis
Contextual understanding
Complex problem-solving

Cosmos-Reason1 is optimized specifically for reasoning tasks, while Qwen2.5-VL 32B offers maximum capability across all task types.

OCR and text extraction

Recommended: Qwen2.5-VL (7B or 32B)

Qwen2.5-VL can handle OCR tasks along with other multimodal capabilities:

Document understanding
Text extraction from images
Multilingual text recognition
General vision-language tasks

Coming Soon: DeepSeek OCR will be available for specialized OCR tasks and document understanding.

Maximum accuracy

Recommended: Qwen2.5-VL (32B)

Choose the largest model when:

Accuracy is the top priority
Sufficient compute resources are available
Deploying to production with strict quality requirements
Handling complex, mission-critical tasks

The 32B model provides the highest accuracy across all task types but requires more computational resources.

By Model Size

Size	Inference Speed	Accuracy	Memory Required	Best For
2B	Fastest	Good	Lowest	Edge devices, real-time applications
3B	Very Fast	Good	Low	Lightweight deployments, specific tasks (OCR)
7B	Fast	Very Good	Moderate	Most use cases, balanced requirements
8B	Fast	Very Good	Moderate	General-purpose applications
32B	Moderate	Excellent	High	Maximum accuracy, complex tasks

💡
Testing Multiple Models
Not sure which model is best for your use case? You can easily create multiple workflows with different model architectures and compare their performance on your dataset. Monitor training runs and evaluate model performance to make an informed decision.

Model Configuration

After selecting your model architecture, you'll need to configure additional settings:

Model Settings

Configure training parameters, batch size, learning rate, and optimization settings

System Prompt

Define instructions and context for your model's behavior

Next Steps

Once you've chosen your model architecture:

Configure model settings — Set training parameters and optimization options
Define your system prompt — Provide instructions for model behavior
Configure your dataset — Set up data splits and augmentation
Create a workflow — Combine all settings into a reusable configuration

Common Questions

Can I change the model architecture after creating a workflow?

No, the model architecture is fixed when you create a workflow. To try a different architecture, create a new workflow with the desired model. You can maintain multiple workflows with different architectures to compare performance.

How do model sizes affect training time and cost?

Larger models require more training time and compute resources:

2B-3B models: Fastest training, lowest cost
7B-8B models: Moderate training time, reasonable cost
32B models: Longer training time, higher cost but best accuracy

Training time also depends on your dataset size and training settings. Check your resource usage to monitor compute consumption.

Which model is best for beginners?

Start with Qwen2.5-VL 7B. It offers:

Excellent performance across diverse tasks
Reasonable training time and resource requirements
Strong community support and documentation
Good balance for learning and experimentation

Follow the quickstart guide to train your first model.

Can I use multiple model architectures in the same project?

Yes! You can create multiple workflows within the same training project, each using a different model architecture. This allows you to:

Compare performance across architectures
Choose the best model for your specific use case
Optimize for different deployment scenarios (cloud vs. edge)

Each workflow maintains its own configuration and training runs.

Do all models support the same tasks?

All currently available models support common VLM tasks like visual question answering and phrase grounding:

Cosmos-Reason1 is optimized specifically for reasoning tasks
General-purpose models (Qwen2.5-VL, InternVL3.5, NVILA-Lite) handle all standard VLM tasks

Coming Soon: DeepSeek OCR will be specialized for text extraction and document understanding tasks.

Choose based on your primary use case. See choosing the right model for detailed guidance.

What does 'no LoRA support' mean for NVILA-Lite?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique. NVILA-Lite uses a different training approach and doesn't support LoRA-based fine-tuning.

This doesn't limit NVILA-Lite's capabilities—it still supports full fine-tuning on your datasets. If you specifically need LoRA-based training (for memory efficiency or other reasons), choose Qwen2.5-VL or InternVL3.5 instead.

Related resources

Model settings — Configure training parameters and optimization
System prompts — Define model instructions and behavior
Dataset configuration — Set up data splits and processing
Create a workflow — Combine all settings into a reusable configuration
Evaluate models — Compare model performance and accuracy
VLM concepts — Learn about vision-language models
Resource usage — Monitor compute consumption and costs
Configure your model — Complete model configuration guide
Train a model — Complete training workflow
Quickstart — End-to-end training tutorial
Vi SDK — Python SDK for model management
Contact us — Get help from the Datature team

Need help?

We're here to support your VLMOps journey. Reach out through any of these channels:

Contact Support

Get help from our team via our website or email us at [email protected]

Join Our Community

Connect with other Datature users, share ideas, and get community support on Slack

Explore Resources

Read our Blog
Check out GitHub
Watch Tutorials

Schedule a Demo

Book a personalized demo to see how Datature Vi can accelerate your vision AI projects