Model Architectures

Understand the available vision-language model architectures and choose the right one for your use case

Model Architectures

Vi supports multiple state-of-the-art vision-language model (VLM) architectures, each optimized for different use cases and performance requirements. Choose the right model based on your task complexity, resource constraints, and accuracy needs.

💡

New to VLMs?

Vision-language models can understand both images and text, allowing you to train AI that responds to natural language prompts. Learn more about VLM concepts and phrase grounding.

Available Model Architectures

Vi offers four powerful VLM architectures across different parameter sizes. The model size (measured in billions of parameters) generally correlates with capability—larger models offer better accuracy but require more compute resources.


Qwen Qwen2.5-VL

Qwen2.5-VL is a powerful multimodal vision-language model developed by Alibaba Cloud, designed to process and understand text, images, and videos. The model excels at visual question answering, image understanding, and multimodal reasoning tasks.

Available Sizes

SizeParametersBest For
3B3 billionResource-constrained environments, faster inference, lightweight applications
7B7 billionBalanced performance and efficiency, most common use cases
32B32 billionMaximum accuracy, complex reasoning tasks, production deployments
📘

Recommended Starting Point

The 7B version offers an excellent balance between performance and resource requirements for most use cases. Start here unless you have specific constraints or accuracy requirements.

Key Features

Dynamic Resolution Processing

Qwen2.5-VL processes images and videos at their native resolutions without forced resizing or padding. This preserves fine details and improves accuracy, especially for:

  • High-resolution images with small objects
  • Documents with dense text
  • Videos with rapid motion or small details

Unlike traditional models that resize all inputs to a fixed size (like 224×224), Qwen2.5-VL adapts to each image's dimensions.

Multimodal Rotary Position Embedding (M-RoPE)

M-RoPE extends standard positional encoding to handle spatial and temporal dimensions. This helps the model understand:

  • Spatial relationships — Where objects are located in images
  • Temporal dynamics — How scenes change across video frames
  • Text alignment — How visual elements correspond to text descriptions

This makes Qwen2.5-VL particularly effective for video understanding and complex visual reasoning tasks.

Extended Context Length

Qwen2.5-VL supports up to 128K tokens in its context window, enabling:

  • Analysis of long-form videos
  • Processing of multi-page documents
  • Complex conversations about multiple images
  • Detailed system prompts with extensive examples

This extended context is especially valuable for document understanding and video analysis tasks.

Window Attention Mechanism

The Vision Transformer (ViT) encoder uses window attention to accelerate training and inference. This optimization:

  • Reduces computational complexity for large images
  • Enables faster training iterations
  • Improves inference speed without sacrificing accuracy

Architecture Overview

The Qwen2.5-VL architecture consists of three main components:

  1. Vision Encoder: A Vision Transformer (ViT) with approximately 600 million parameters, supporting dynamic resolution for both images and videos
  2. Language Model Decoder: Based on the Qwen2.5 transformer with Grouped Query Attention (GQA), enabling efficient inference
  3. M-RoPE Integration: Multimodal positional embeddings that connect visual and textual information

Resources


NVIDIA NVILA-Lite

NVIDIA NVILA-Lite is an efficient vision-language model optimized for both accuracy and computational efficiency. Part of NVIDIA's NVILA family, it employs a "scale-then-compress" approach to process high-resolution images and long videos efficiently.

Available Sizes

SizeParametersBest For
2B2 billionEdge deployments, fast inference, resource-limited environments
🚧

LoRA Not Supported

NVILA-Lite does not support LoRA (Low-Rank Adaptation) fine-tuning. If you need LoRA-based training, consider using Qwen2.5-VL or InternVL3.5 instead.

Key Features

Scale-Then-Compress Approach

NVILA-Lite uses a unique two-stage processing method:

  1. Scale: Process images and videos at high spatial and temporal resolutions to capture fine details
  2. Compress: Reduce visual tokens efficiently without losing critical information

This approach enables the model to handle high-resolution images and long videos while maintaining efficiency. Benefits include:

  • Faster inference compared to similarly-sized models
  • Lower memory requirements during training
  • Efficient processing of high-resolution inputs (up to 4K images)
Training and Deployment Efficiency

NVILA-Lite is optimized for the complete model lifecycle:

  • Training: Reduced training costs compared to standard VLMs of similar capability
  • Fine-tuning: Efficient adaptation to custom datasets with fewer compute resources
  • Inference: Low latency during deployment, suitable for real-time applications
  • Edge Deployment: Small enough to run on edge devices with limited resources
Accuracy-Efficiency Balance

Despite its compact 2B parameter size, NVILA-Lite matches or exceeds the accuracy of many larger open-source and proprietary VLMs. This makes it ideal for:

  • Production deployments with strict latency requirements
  • Applications where model size is constrained (mobile, edge devices)
  • Scenarios requiring high throughput (processing many images per second)
  • Cost-sensitive deployments where compute resources are limited

Architecture Overview

NVILA-Lite integrates a vision encoder with a compact language model, using token compression techniques to maintain efficiency. The "scale-then-compress" methodology ensures that high-resolution details are captured before compression, preserving accuracy while reducing computational requirements.

Resources


NVIDIA Cosmos-Reason1

NVIDIA Cosmos-Reason1 is a 7B parameter vision-language model designed specifically for complex reasoning tasks. It excels at understanding relationships between visual elements and generating logical conclusions from multimodal inputs.

Available Sizes

SizeParametersBest For
7B7 billionComplex reasoning, logical inference, analytical tasks

Key Features

Advanced Multimodal Reasoning

Cosmos-Reason1 is optimized for tasks requiring deep understanding and logical inference:

  • Visual Reasoning: Understanding cause-and-effect relationships in images
  • Logical Inference: Drawing conclusions from visual evidence combined with textual context
  • Multi-Step Analysis: Breaking down complex problems into logical steps
  • Contextual Understanding: Considering broader context when analyzing visual information

Example use cases:

  • Analyzing diagnostic images with clinical context
  • Understanding complex diagrams and technical schematics
  • Solving visual reasoning puzzles and problems
  • Identifying anomalies that require contextual knowledge
Efficiency-Optimized Architecture

Cosmos-Reason1 balances computational efficiency with reasoning capability:

  • Optimized attention mechanisms for faster inference
  • Efficient memory utilization during complex reasoning tasks
  • Suitable for both cloud and edge deployments
  • Supports batch processing for high-throughput scenarios

Architecture Overview

Cosmos-Reason1 integrates a vision encoder with a 7B parameter language model, with specialized components designed to facilitate multi-step reasoning and logical inference across visual and textual modalities.

Resources


OpenGVLab InternVL3.5

InternVL3.5 is an 8B parameter vision-language model developed by OpenGVLab, designed for comprehensive understanding and generation across visual and textual modalities. The model provides strong performance on diverse multimodal tasks.

Available Sizes

SizeParametersBest For
8B8 billionBalanced performance, general-purpose VLM applications

Key Features

Enhanced Visual Understanding

InternVL3.5 excels at processing complex visual scenes with high accuracy:

  • Fine-Grained Recognition: Identifying small objects and subtle details
  • Scene Understanding: Comprehending relationships between multiple objects
  • Spatial Awareness: Understanding object positions and spatial relationships
  • Visual Attributes: Recognizing colors, textures, sizes, and other properties

The model is particularly effective for:

  • Detailed image analysis and description
  • Complex scene understanding
  • Fine-grained object classification
  • Visual attribute recognition
Seamless Text Integration

InternVL3.5 effectively combines visual and textual information:

  • Natural language understanding of visual content
  • Generation of detailed, accurate image descriptions
  • Question answering about image content
  • Following complex multi-modal instructions
Scalable Architecture

With 8B parameters, InternVL3.5 offers:

  • Strong performance across diverse tasks
  • Reasonable resource requirements for most deployments
  • Good balance between accuracy and computational cost
  • Suitable for both research and production use

Architecture Overview

InternVL3.5 combines a sophisticated Vision Transformer (ViT) encoder with a language model decoder, enabling nuanced interpretation of multimodal inputs. The architecture is designed for comprehensive visual understanding and coherent text generation.

Resources


Coming Soon

Vi is continuously expanding its model support. The following architectures are coming soon:

🔔

Stay Updated

Want to be notified when these models become available? Contact us to join our early access program or subscribe to our updates.

DeepSeek OCR (Coming Soon)

A specialized OCR model optimized for text extraction and document understanding:

  • High accuracy for complex documents
  • Strong handling of handwritten text
  • Comprehensive multilingual support
  • Fast inference for large-scale document processing

LLaVA-NeXT (Coming Soon)

LLaVA-NeXT is an advanced vision-language model with state-of-the-art performance on multimodal reasoning tasks:

  • Enhanced visual understanding and reasoning capabilities
  • Improved instruction following
  • Better handling of complex visual scenes
  • Optimized for both training and inference efficiency

Choosing the Right Model

Select a model based on your specific requirements:

By Use Case

General vision-language tasks

Recommended: Qwen2.5-VL (7B) or InternVL3.5 (8B)

These models provide strong performance across diverse multimodal tasks including:

Start with Qwen2.5-VL 7B for the best balance of performance and efficiency.

Resource-constrained environments

Recommended: NVILA-Lite (2B) or Qwen2.5-VL (3B)

Choose these models when:

  • Deploying to edge devices with limited compute
  • Requiring fast inference with low latency
  • Operating under strict memory constraints
  • Processing high volumes with limited resources

NVILA-Lite offers the best efficiency, while Qwen2.5-VL 3B provides broader capabilities.

Complex reasoning and analysis

Recommended: Cosmos-Reason1 (7B) or Qwen2.5-VL (32B)

These models excel at:

  • Multi-step logical reasoning
  • Cause-and-effect analysis
  • Contextual understanding
  • Complex problem-solving

Cosmos-Reason1 is optimized specifically for reasoning tasks, while Qwen2.5-VL 32B offers maximum capability across all task types.

OCR and text extraction

Recommended: Qwen2.5-VL (7B or 32B)

Qwen2.5-VL can handle OCR tasks along with other multimodal capabilities:

  • Document understanding
  • Text extraction from images
  • Multilingual text recognition
  • General vision-language tasks

Coming Soon: DeepSeek OCR will be available for specialized OCR tasks and document understanding.

Maximum accuracy

Recommended: Qwen2.5-VL (32B)

Choose the largest model when:

  • Accuracy is the top priority
  • Sufficient compute resources are available
  • Deploying to production with strict quality requirements
  • Handling complex, mission-critical tasks

The 32B model provides the highest accuracy across all task types but requires more computational resources.

By Model Size

SizeInference SpeedAccuracyMemory RequiredBest For
2BFastestGoodLowestEdge devices, real-time applications
3BVery FastGoodLowLightweight deployments, specific tasks (OCR)
7BFastVery GoodModerateMost use cases, balanced requirements
8BFastVery GoodModerateGeneral-purpose applications
32BModerateExcellentHighMaximum accuracy, complex tasks
💡

Testing Multiple Models

Not sure which model is best for your use case? You can easily create multiple workflows with different model architectures and compare their performance on your dataset. Monitor training runs and evaluate model performance to make an informed decision.


Model Configuration

After selecting your model architecture, you'll need to configure additional settings:

Next Steps

Once you've chosen your model architecture:

  1. Configure model settings — Set training parameters and optimization options
  2. Define your system prompt — Provide instructions for model behavior
  3. Configure your dataset — Set up data splits and augmentation
  4. Create a workflow — Combine all settings into a reusable configuration

Common Questions

Can I change the model architecture after creating a workflow?

No, the model architecture is fixed when you create a workflow. To try a different architecture, create a new workflow with the desired model. You can maintain multiple workflows with different architectures to compare performance.

How do model sizes affect training time and cost?

Larger models require more training time and compute resources:

  • 2B-3B models: Fastest training, lowest cost
  • 7B-8B models: Moderate training time, reasonable cost
  • 32B models: Longer training time, higher cost but best accuracy

Training time also depends on your dataset size and training settings. Check your resource usage to monitor compute consumption.

Which model is best for beginners?

Start with Qwen2.5-VL 7B. It offers:

  • Excellent performance across diverse tasks
  • Reasonable training time and resource requirements
  • Strong community support and documentation
  • Good balance for learning and experimentation

Follow the quickstart guide to train your first model.

Can I use multiple model architectures in the same project?

Yes! You can create multiple workflows within the same training project, each using a different model architecture. This allows you to:

  • Compare performance across architectures
  • Choose the best model for your specific use case
  • Optimize for different deployment scenarios (cloud vs. edge)

Each workflow maintains its own configuration and training runs.

Do all models support the same tasks?

All currently available models support common VLM tasks like visual question answering and phrase grounding:

  • Cosmos-Reason1 is optimized specifically for reasoning tasks
  • General-purpose models (Qwen2.5-VL, InternVL3.5, NVILA-Lite) handle all standard VLM tasks

Coming Soon: DeepSeek OCR will be specialized for text extraction and document understanding tasks.

Choose based on your primary use case. See choosing the right model for detailed guidance.

What does 'no LoRA support' mean for NVILA-Lite?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique. NVILA-Lite uses a different training approach and doesn't support LoRA-based fine-tuning.

This doesn't limit NVILA-Lite's capabilities—it still supports full fine-tuning on your datasets. If you specifically need LoRA-based training (for memory efficiency or other reasons), choose Qwen2.5-VL or InternVL3.5 instead.


Related resources