Model Architectures
Understand the available vision-language model architectures and choose the right one for your use case
Model Architectures
Vi supports multiple state-of-the-art vision-language model (VLM) architectures, each optimized for different use cases and performance requirements. Choose the right model based on your task complexity, resource constraints, and accuracy needs.
New to VLMs?Vision-language models can understand both images and text, allowing you to train AI that responds to natural language prompts. Learn more about VLM concepts and phrase grounding.
Available Model Architectures
Vi offers four powerful VLM architectures across different parameter sizes. The model size (measured in billions of parameters) generally correlates with capability—larger models offer better accuracy but require more compute resources.
Alibaba Cloud's multimodal VLM. Available in 3B, 7B, and 32B sizes.
Efficient lightweight VLM. Available in 2B size.
Reasoning-focused VLM. Available in 7B size.
Advanced vision-language model. Available in 8B size.
Qwen Qwen2.5-VL
Qwen2.5-VL is a powerful multimodal vision-language model developed by Alibaba Cloud, designed to process and understand text, images, and videos. The model excels at visual question answering, image understanding, and multimodal reasoning tasks.
Available Sizes
| Size | Parameters | Best For |
|---|---|---|
| 3B | 3 billion | Resource-constrained environments, faster inference, lightweight applications |
| 7B | 7 billion | Balanced performance and efficiency, most common use cases |
| 32B | 32 billion | Maximum accuracy, complex reasoning tasks, production deployments |
Recommended Starting PointThe 7B version offers an excellent balance between performance and resource requirements for most use cases. Start here unless you have specific constraints or accuracy requirements.
Key Features
Dynamic Resolution Processing
Qwen2.5-VL processes images and videos at their native resolutions without forced resizing or padding. This preserves fine details and improves accuracy, especially for:
- High-resolution images with small objects
- Documents with dense text
- Videos with rapid motion or small details
Unlike traditional models that resize all inputs to a fixed size (like 224×224), Qwen2.5-VL adapts to each image's dimensions.
Multimodal Rotary Position Embedding (M-RoPE)
M-RoPE extends standard positional encoding to handle spatial and temporal dimensions. This helps the model understand:
- Spatial relationships — Where objects are located in images
- Temporal dynamics — How scenes change across video frames
- Text alignment — How visual elements correspond to text descriptions
This makes Qwen2.5-VL particularly effective for video understanding and complex visual reasoning tasks.
Extended Context Length
Qwen2.5-VL supports up to 128K tokens in its context window, enabling:
- Analysis of long-form videos
- Processing of multi-page documents
- Complex conversations about multiple images
- Detailed system prompts with extensive examples
This extended context is especially valuable for document understanding and video analysis tasks.
Window Attention Mechanism
The Vision Transformer (ViT) encoder uses window attention to accelerate training and inference. This optimization:
- Reduces computational complexity for large images
- Enables faster training iterations
- Improves inference speed without sacrificing accuracy
Architecture Overview
The Qwen2.5-VL architecture consists of three main components:
- Vision Encoder: A Vision Transformer (ViT) with approximately 600 million parameters, supporting dynamic resolution for both images and videos
- Language Model Decoder: Based on the Qwen2.5 transformer with Grouped Query Attention (GQA), enabling efficient inference
- M-RoPE Integration: Multimodal positional embeddings that connect visual and textual information
Resources
- Hugging Face: Qwen2.5-VL Model Collection
- Official Documentation: Qwen2.5-VL Technical Guide
- Model Training Guide: Qwen2.5-VL Training Documentation
NVIDIA NVILA-Lite
NVIDIA NVILA-Lite is an efficient vision-language model optimized for both accuracy and computational efficiency. Part of NVIDIA's NVILA family, it employs a "scale-then-compress" approach to process high-resolution images and long videos efficiently.
Available Sizes
| Size | Parameters | Best For |
|---|---|---|
| 2B | 2 billion | Edge deployments, fast inference, resource-limited environments |
LoRA Not SupportedNVILA-Lite does not support LoRA (Low-Rank Adaptation) fine-tuning. If you need LoRA-based training, consider using Qwen2.5-VL or InternVL3.5 instead.
Key Features
Scale-Then-Compress Approach
NVILA-Lite uses a unique two-stage processing method:
- Scale: Process images and videos at high spatial and temporal resolutions to capture fine details
- Compress: Reduce visual tokens efficiently without losing critical information
This approach enables the model to handle high-resolution images and long videos while maintaining efficiency. Benefits include:
- Faster inference compared to similarly-sized models
- Lower memory requirements during training
- Efficient processing of high-resolution inputs (up to 4K images)
Training and Deployment Efficiency
NVILA-Lite is optimized for the complete model lifecycle:
- Training: Reduced training costs compared to standard VLMs of similar capability
- Fine-tuning: Efficient adaptation to custom datasets with fewer compute resources
- Inference: Low latency during deployment, suitable for real-time applications
- Edge Deployment: Small enough to run on edge devices with limited resources
Accuracy-Efficiency Balance
Despite its compact 2B parameter size, NVILA-Lite matches or exceeds the accuracy of many larger open-source and proprietary VLMs. This makes it ideal for:
- Production deployments with strict latency requirements
- Applications where model size is constrained (mobile, edge devices)
- Scenarios requiring high throughput (processing many images per second)
- Cost-sensitive deployments where compute resources are limited
Architecture Overview
NVILA-Lite integrates a vision encoder with a compact language model, using token compression techniques to maintain efficiency. The "scale-then-compress" methodology ensures that high-resolution details are captured before compression, preserving accuracy while reducing computational requirements.
Resources
- Official Documentation: NVILA: Efficient Frontier Visual Language Models
- Research Paper: NVILA Technical Report
NVIDIA Cosmos-Reason1
NVIDIA Cosmos-Reason1 is a 7B parameter vision-language model designed specifically for complex reasoning tasks. It excels at understanding relationships between visual elements and generating logical conclusions from multimodal inputs.
Available Sizes
| Size | Parameters | Best For |
|---|---|---|
| 7B | 7 billion | Complex reasoning, logical inference, analytical tasks |
Key Features
Advanced Multimodal Reasoning
Cosmos-Reason1 is optimized for tasks requiring deep understanding and logical inference:
- Visual Reasoning: Understanding cause-and-effect relationships in images
- Logical Inference: Drawing conclusions from visual evidence combined with textual context
- Multi-Step Analysis: Breaking down complex problems into logical steps
- Contextual Understanding: Considering broader context when analyzing visual information
Example use cases:
- Analyzing diagnostic images with clinical context
- Understanding complex diagrams and technical schematics
- Solving visual reasoning puzzles and problems
- Identifying anomalies that require contextual knowledge
Efficiency-Optimized Architecture
Cosmos-Reason1 balances computational efficiency with reasoning capability:
- Optimized attention mechanisms for faster inference
- Efficient memory utilization during complex reasoning tasks
- Suitable for both cloud and edge deployments
- Supports batch processing for high-throughput scenarios
Architecture Overview
Cosmos-Reason1 integrates a vision encoder with a 7B parameter language model, with specialized components designed to facilitate multi-step reasoning and logical inference across visual and textual modalities.
Resources
- Hugging Face: NVIDIA Cosmos-Reason1
- Official Documentation: NVIDIA Cosmos-Reason1 Research
OpenGVLab InternVL3.5
InternVL3.5 is an 8B parameter vision-language model developed by OpenGVLab, designed for comprehensive understanding and generation across visual and textual modalities. The model provides strong performance on diverse multimodal tasks.
Available Sizes
| Size | Parameters | Best For |
|---|---|---|
| 8B | 8 billion | Balanced performance, general-purpose VLM applications |
Key Features
Enhanced Visual Understanding
InternVL3.5 excels at processing complex visual scenes with high accuracy:
- Fine-Grained Recognition: Identifying small objects and subtle details
- Scene Understanding: Comprehending relationships between multiple objects
- Spatial Awareness: Understanding object positions and spatial relationships
- Visual Attributes: Recognizing colors, textures, sizes, and other properties
The model is particularly effective for:
- Detailed image analysis and description
- Complex scene understanding
- Fine-grained object classification
- Visual attribute recognition
Seamless Text Integration
InternVL3.5 effectively combines visual and textual information:
- Natural language understanding of visual content
- Generation of detailed, accurate image descriptions
- Question answering about image content
- Following complex multi-modal instructions
Scalable Architecture
With 8B parameters, InternVL3.5 offers:
- Strong performance across diverse tasks
- Reasonable resource requirements for most deployments
- Good balance between accuracy and computational cost
- Suitable for both research and production use
Architecture Overview
InternVL3.5 combines a sophisticated Vision Transformer (ViT) encoder with a language model decoder, enabling nuanced interpretation of multimodal inputs. The architecture is designed for comprehensive visual understanding and coherent text generation.
Resources
- Hugging Face: OpenGVLab InternVL3.5
- Official Documentation: InternVL3.5 Documentation
Coming Soon
Vi is continuously expanding its model support. The following architectures are coming soon:
Specialized OCR model with advanced document understanding capabilities
Advanced multimodal reasoning with improved visual comprehension
Stay UpdatedWant to be notified when these models become available? Contact us to join our early access program or subscribe to our updates.
DeepSeek OCR (Coming Soon)
A specialized OCR model optimized for text extraction and document understanding:
- High accuracy for complex documents
- Strong handling of handwritten text
- Comprehensive multilingual support
- Fast inference for large-scale document processing
LLaVA-NeXT (Coming Soon)
LLaVA-NeXT is an advanced vision-language model with state-of-the-art performance on multimodal reasoning tasks:
- Enhanced visual understanding and reasoning capabilities
- Improved instruction following
- Better handling of complex visual scenes
- Optimized for both training and inference efficiency
Choosing the Right Model
Select a model based on your specific requirements:
By Use Case
General vision-language tasks
Recommended: Qwen2.5-VL (7B) or InternVL3.5 (8B)
These models provide strong performance across diverse multimodal tasks including:
- Visual question answering
- Phrase grounding
- Image understanding and description
- Multi-image reasoning
Start with Qwen2.5-VL 7B for the best balance of performance and efficiency.
Resource-constrained environments
Recommended: NVILA-Lite (2B) or Qwen2.5-VL (3B)
Choose these models when:
- Deploying to edge devices with limited compute
- Requiring fast inference with low latency
- Operating under strict memory constraints
- Processing high volumes with limited resources
NVILA-Lite offers the best efficiency, while Qwen2.5-VL 3B provides broader capabilities.
Complex reasoning and analysis
Recommended: Cosmos-Reason1 (7B) or Qwen2.5-VL (32B)
These models excel at:
- Multi-step logical reasoning
- Cause-and-effect analysis
- Contextual understanding
- Complex problem-solving
Cosmos-Reason1 is optimized specifically for reasoning tasks, while Qwen2.5-VL 32B offers maximum capability across all task types.
OCR and text extraction
Recommended: Qwen2.5-VL (7B or 32B)
Qwen2.5-VL can handle OCR tasks along with other multimodal capabilities:
- Document understanding
- Text extraction from images
- Multilingual text recognition
- General vision-language tasks
Coming Soon: DeepSeek OCR will be available for specialized OCR tasks and document understanding.
Maximum accuracy
Recommended: Qwen2.5-VL (32B)
Choose the largest model when:
- Accuracy is the top priority
- Sufficient compute resources are available
- Deploying to production with strict quality requirements
- Handling complex, mission-critical tasks
The 32B model provides the highest accuracy across all task types but requires more computational resources.
By Model Size
| Size | Inference Speed | Accuracy | Memory Required | Best For |
|---|---|---|---|---|
| 2B | Fastest | Good | Lowest | Edge devices, real-time applications |
| 3B | Very Fast | Good | Low | Lightweight deployments, specific tasks (OCR) |
| 7B | Fast | Very Good | Moderate | Most use cases, balanced requirements |
| 8B | Fast | Very Good | Moderate | General-purpose applications |
| 32B | Moderate | Excellent | High | Maximum accuracy, complex tasks |
Testing Multiple ModelsNot sure which model is best for your use case? You can easily create multiple workflows with different model architectures and compare their performance on your dataset. Monitor training runs and evaluate model performance to make an informed decision.
Model Configuration
After selecting your model architecture, you'll need to configure additional settings:
Configure training parameters, batch size, learning rate, and optimization settings
Define instructions and context for your model's behavior
Next Steps
Once you've chosen your model architecture:
- Configure model settings — Set training parameters and optimization options
- Define your system prompt — Provide instructions for model behavior
- Configure your dataset — Set up data splits and augmentation
- Create a workflow — Combine all settings into a reusable configuration
Common Questions
Can I change the model architecture after creating a workflow?
No, the model architecture is fixed when you create a workflow. To try a different architecture, create a new workflow with the desired model. You can maintain multiple workflows with different architectures to compare performance.
How do model sizes affect training time and cost?
Larger models require more training time and compute resources:
- 2B-3B models: Fastest training, lowest cost
- 7B-8B models: Moderate training time, reasonable cost
- 32B models: Longer training time, higher cost but best accuracy
Training time also depends on your dataset size and training settings. Check your resource usage to monitor compute consumption.
Which model is best for beginners?
Start with Qwen2.5-VL 7B. It offers:
- Excellent performance across diverse tasks
- Reasonable training time and resource requirements
- Strong community support and documentation
- Good balance for learning and experimentation
Follow the quickstart guide to train your first model.
Can I use multiple model architectures in the same project?
Yes! You can create multiple workflows within the same training project, each using a different model architecture. This allows you to:
- Compare performance across architectures
- Choose the best model for your specific use case
- Optimize for different deployment scenarios (cloud vs. edge)
Each workflow maintains its own configuration and training runs.
Do all models support the same tasks?
All currently available models support common VLM tasks like visual question answering and phrase grounding:
- Cosmos-Reason1 is optimized specifically for reasoning tasks
- General-purpose models (Qwen2.5-VL, InternVL3.5, NVILA-Lite) handle all standard VLM tasks
Coming Soon: DeepSeek OCR will be specialized for text extraction and document understanding tasks.
Choose based on your primary use case. See choosing the right model for detailed guidance.
What does 'no LoRA support' mean for NVILA-Lite?
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique. NVILA-Lite uses a different training approach and doesn't support LoRA-based fine-tuning.
This doesn't limit NVILA-Lite's capabilities—it still supports full fine-tuning on your datasets. If you specifically need LoRA-based training (for memory efficiency or other reasons), choose Qwen2.5-VL or InternVL3.5 instead.
Related resources
- Model settings — Configure training parameters and optimization
- System prompts — Define model instructions and behavior
- Dataset configuration — Set up data splits and processing
- Create a workflow — Combine all settings into a reusable configuration
- Evaluate models — Compare model performance and accuracy
- VLM concepts — Learn about vision-language models
- Resource usage — Monitor compute consumption and costs
- Configure your model — Complete model configuration guide
- Train a model — Complete training workflow
- Quickstart — End-to-end training tutorial
- Vi SDK — Python SDK for model management
- Contact us — Get help from the Datature team
Need help?
We're here to support your VLMOps journey. Reach out through any of these channels:
Updated about 1 month ago
